Skip to content

Inefficient enqueue implementation in CUDA and HIP plugins #4647

Closed
@alexbatashev

Description

@alexbatashev

On piEnqueueKernelLaunch CUDA and HIP plugins call wait on each event upon kernel submission. For relatively narrow, but deep graphs, this essentially makes SYCL queues in-order, and each submission is blocked until the above tasks are completed. See

if (event_wait_list) {
retError = cuda_piEnqueueEventsWait(
command_queue, num_events_in_wait_list, event_wait_list, nullptr);
}

Other APIs take different approach. Level Zero (similarly to Vulkan) has notion of command list. Commands are batched before submission, and events are not explicitly waited on. OpenCL has native support for wait lists and implementation is free to treat it however it likes, no explicit wait on the plugin side again.

I believe CUDA or HIP can either implement some sort of thread pool to wait on events or somehow make use of graph APIs to do the same thing Level Zero does, i.e. collect some commands into sub-graph and submit it upon threshold or when user explicitly waits on event.

Tagging @AerialMantis and @npmiller

Metadata

Metadata

Assignees

No one assigned

    Labels

    cudaCUDA back-endenhancementNew feature or requesthipIssues related to execution on HIP backend.performancePerformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions