Description
On piEnqueueKernelLaunch
CUDA and HIP plugins call wait on each event upon kernel submission. For relatively narrow, but deep graphs, this essentially makes SYCL queues in-order, and each submission is blocked until the above tasks are completed. See
llvm/sycl/plugins/cuda/pi_cuda.cpp
Lines 2639 to 2642 in 1717a14
Other APIs take different approach. Level Zero (similarly to Vulkan) has notion of command list. Commands are batched before submission, and events are not explicitly waited on. OpenCL has native support for wait lists and implementation is free to treat it however it likes, no explicit wait on the plugin side again.
I believe CUDA or HIP can either implement some sort of thread pool to wait on events or somehow make use of graph APIs to do the same thing Level Zero does, i.e. collect some commands into sub-graph and submit it upon threshold or when user explicitly waits on event.
Tagging @AerialMantis and @npmiller