[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102

t4c1 · 2022-05-05T10:44:03Z

So far each queue only had one underlying CUstream, making it de facto in-order. This PR introduces multiple streams in each queue. To improve opportunities for concurrent execution streams are split into two pools - one for compute (kernels) and one for memory transfers.

The streams in pools are created dynamically when first needed. When a pool is full, previously created streams are reused. By default each queue has space for up to 128 streams for compute and 64 for transfers.

This PR also removes a test for internal workings of the queue. The problem is that introducing dynamic stream creation puts more work into _pi_queue::get(), making it depend on some helper functions in pi_cuda.cpp, so I had to move the definition from header to pi_cuda.cpp. This, however caused problem with linking this test, as lib_pi_cuda.so is created using custom linking script that only exposes functions starting with "pi". This improves linking performance, but prevents any other function from being tested. Looking at other tests for internals of the plugin I noticed that other functions that other functions that would need anything from pi_cuda.cpp are also not tested, so I deleted this test as well.

This is not changing any user-facing interface, so there are no accompanying changes to the test suite.

sycl/plugins/cuda/pi_cuda.cpp

steffenlarsen · 2022-05-05T11:12:03Z

sycl/plugins/cuda/pi_cuda.cpp

@@ -341,10 +366,34 @@ pi_result cuda_piEventRetain(pi_event event);

 /// \endcond

-_pi_event::_pi_event(pi_command_type type, pi_context context, pi_queue queue)
+_pi_queue::native_type _pi_queue::get_compute() {


I suspect that with these changes the definition of _pi_queue::native_type may have to change, given the native type is no longer a single stream but rather two vectors of streams. As such it would probably be safer to use CUstream here rather than the alias.

We have discussed this internally and decided that it makes sense to leave it as it is now, so as not to overcomplicate things. We expect that current interface can likely handle all use cases. If there comes a request for getting access to multiple streams that can be changed later.

sycl/plugins/cuda/pi_cuda.cpp

steffenlarsen · 2022-05-05T12:12:07Z

sycl/plugins/cuda/pi_cuda.cpp

@@ -2357,7 +2404,8 @@ pi_result cuda_piQueueFlush(pi_queue command_queue) {
 /// \return PI_SUCCESS
 pi_result cuda_piextQueueGetNativeHandle(pi_queue queue,
                                         pi_native_handle *nativeHandle) {
-  *nativeHandle = reinterpret_cast<pi_native_handle>(queue->get());
+  ScopedContext active(queue->get_context());
+  *nativeHandle = reinterpret_cast<pi_native_handle>(queue->get_compute());


As mentioned in a previous comment, I suspect it would make sense to change the native handles for the CUDA backend. I am okay with leaving it as-is for now though.

@AerialMantis - What are your thoughts on this? Should the CUDA backend return the compute and transfer streams as native handle for the queue?

Hi @steffenlarsen, apologies for the delay in replying I was on holiday the past week, we discussed this a fair bit as well, there could be a benefit to extending the backend interop such that get_native_queue returns a multiple streams, however, that would introduce the problem of how the user should interpret and use those streams which would complicate the more simple use case, so we decided to leave it as it is for now, and once we have some more user experience of it we should revisit it.

Thank you, @AerialMantis! I think the primary problem I could see happen is the users extracting the CUstream and expecting that synchronizing on it would be the same as synchronizing on the SYCL queue, which would not necessarily be the case. Likewise, they could try and recreate the SYCL queue and expect it to act in the same way as the old one. I don't think it is necessarily an immediate problem, but it should be something to be considered while writing the backend documentation.

sycl/plugins/cuda/pi_cuda.cpp

t4c1 · 2022-05-09T08:02:34Z

The HIP failure looks unrelated to the changes in this PR. Except for removing a test all changes here are limited to the CUDA plugin.

steffenlarsen · 2022-05-09T08:15:25Z

The HIP failure looks unrelated to the changes in this PR. Except for removing a test all changes here are limited to the CUDA plugin.

HIP failure should be addressed with intel/llvm-test-suite#1020

sycl/plugins/cuda/pi_cuda.cpp

t4c1 · 2022-05-17T10:27:48Z

@steffenlarsen Can this be merged?

steffenlarsen

LGTM!

So far each queue only had one underlying CUstream, making it de facto in-order. This PR introduces multiple streams in each queue. To improve opportunities for concurrent execution streams are split into two pools - one for compute (kernels) and one for memory transfers. The streams in pools are created dynamically when first needed. When a pool is full, previously created streams are reused. By default each queue has space for up to 128 streams for compute and 64 for transfers. This PR also removes a test for internal workings of the queue. The problem is that introducing dynamic stream creation puts more work into `_pi_queue::get()`, making it depend on some helper functions in `pi_cuda.cpp`, so I had to move the definition from header to `pi_cuda.cpp`. This, however caused problem with linking this test, as `lib_pi_cuda.so` is created using custom linking script that only exposes functions starting with "pi". This improves linking performance, but prevents any other function from being tested. Looking at other tests for internals of the plugin I noticed that other functions that other functions that would need anything from `pi_cuda.cpp` are also not tested, so I deleted this test as well. This is not changing any user-facing interface, so there are no accompanying changes to the test suite.

Improves performance of `piQueueFinish` and therefore of `queue::wait()` on CUDA backend by reducing the number of `cuStreamSynchronize()` calls invoked. This in most use cases fixes the slowdown to `queue::wait()` introduced in #6102. This does not change any interface so there are no changes to the test suite.

Closely mimics the functionality of CUDA plugin #6102

t4c1 added 15 commits March 17, 2022 13:40

introduce 3 streams per queue

6dee4c8

fix stream in event member functions

f64d3a3

introduce pool of threads

c95ac19

Merge branch 'sycl' into pool_of_streams

5ad0a62

use two pools

8ba1fb9

add suport for in order queue

6a74888

use for_each_stream instead of get_all

3b78a29

dynamic stream allocation

77a589c

add get function

35f0b65

removed problematic test

39efc6e

addressed review comments

abc3dcf

Merge branch 'sycl' into two_pools_of_streams

4f8cbb8

format

7180531

minor refactor

bc4ef9c

Merge branch 'sycl' into two_pools_of_streams

0a88680

t4c1 requested review from a team as code owners May 5, 2022 10:44

t4c1 requested a review from steffenlarsen May 5, 2022 10:44

format

c4ccce8

steffenlarsen reviewed May 5, 2022

View reviewed changes

t4c1 added 3 commits May 5, 2022 14:15

addressed most comments

9731cd9

resolved race condition

c21a58c

format

c6b4a86

steffenlarsen reviewed May 9, 2022

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

addressed review comment

55adb7c

steffenlarsen approved these changes May 17, 2022

View reviewed changes

steffenlarsen merged commit dd41845 into intel:sycl May 17, 2022

t4c1 mentioned this pull request May 26, 2022

[SYCL][CUDA][PI] Improve performance of piQueueFinish #6201

Merged

abagusetty mentioned this pull request Jun 19, 2022

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

Merged

steffenlarsen pushed a commit that referenced this pull request Jul 4, 2022

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue (#6325)

e0c40a9

Closely mimics the functionality of CUDA plugin #6102

npmiller mentioned this pull request Oct 18, 2022

Inefficient enqueue implementation in CUDA and HIP plugins #4647

Closed

JackAKirk mentioned this pull request Dec 9, 2022

[SYCL][CUDA][HIP] device overlap property #7681

Closed

JackAKirk mentioned this pull request Sep 28, 2023

Host task lacks proper synchronization capabilities when used for interoperability #11284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102

[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102

Uh oh!

t4c1 commented May 5, 2022

Uh oh!

Uh oh!

steffenlarsen May 5, 2022

Uh oh!

t4c1 May 5, 2022

Uh oh!

Uh oh!

Uh oh!

steffenlarsen May 5, 2022

Uh oh!

AerialMantis May 17, 2022

Uh oh!

steffenlarsen May 17, 2022

Uh oh!

Uh oh!

t4c1 commented May 9, 2022

Uh oh!

steffenlarsen commented May 9, 2022

Uh oh!

Uh oh!

t4c1 commented May 17, 2022

Uh oh!

steffenlarsen left a comment

Uh oh!

Uh oh!

[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102

[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102

Uh oh!

Conversation

t4c1 commented May 5, 2022

Uh oh!

Uh oh!

steffenlarsen May 5, 2022

Choose a reason for hiding this comment

Uh oh!

t4c1 May 5, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

steffenlarsen May 5, 2022

Choose a reason for hiding this comment

Uh oh!

AerialMantis May 17, 2022

Choose a reason for hiding this comment

Uh oh!

steffenlarsen May 17, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

t4c1 commented May 9, 2022

Uh oh!

steffenlarsen commented May 9, 2022

Uh oh!

Uh oh!

t4c1 commented May 17, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!