Inefficient enqueue implementation in CUDA and HIP plugins

On `piEnqueueKernelLaunch` CUDA and HIP plugins call wait on each event upon kernel submission. For relatively narrow, but deep graphs, this essentially makes SYCL queues in-order, and each submission is blocked until the above tasks are completed.  See https://github.com/intel/llvm/blob/1717a143b7118089d55ff3608bf7671fc6cb8adc/sycl/plugins/cuda/pi_cuda.cpp#L2639-L2642

Other APIs take different approach. Level Zero (similarly to Vulkan) has notion of command list. Commands are batched before submission, and events are not explicitly waited on. OpenCL has native support for wait lists and implementation is free to treat it however it likes, no explicit wait on the plugin side again.

I believe CUDA or HIP can either implement some sort of thread pool to wait on events or somehow make use of graph APIs to do the same thing Level Zero does, i.e. collect some commands into sub-graph and submit it upon threshold or when user explicitly waits on event.

Tagging @AerialMantis and @npmiller 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inefficient enqueue implementation in CUDA and HIP plugins #4647

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if (event_wait_list) {
	retError = cuda_piEnqueueEventsWait(
	command_queue, num_events_in_wait_list, event_wait_list, nullptr);
	}

Inefficient enqueue implementation in CUDA and HIP plugins #4647

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions