-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA][PI] Introduce multiple streams in each queue #6102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sycl/plugins/cuda/pi_cuda.cpp
Outdated
@@ -341,10 +366,34 @@ pi_result cuda_piEventRetain(pi_event event); | |||
|
|||
/// \endcond | |||
|
|||
_pi_event::_pi_event(pi_command_type type, pi_context context, pi_queue queue) | |||
_pi_queue::native_type _pi_queue::get_compute() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that with these changes the definition of _pi_queue::native_type
may have to change, given the native type is no longer a single stream but rather two vectors of streams. As such it would probably be safer to use CUstream
here rather than the alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed this internally and decided that it makes sense to leave it as it is now, so as not to overcomplicate things. We expect that current interface can likely handle all use cases. If there comes a request for getting access to multiple streams that can be changed later.
sycl/plugins/cuda/pi_cuda.cpp
Outdated
@@ -2357,7 +2404,8 @@ pi_result cuda_piQueueFlush(pi_queue command_queue) { | |||
/// \return PI_SUCCESS | |||
pi_result cuda_piextQueueGetNativeHandle(pi_queue queue, | |||
pi_native_handle *nativeHandle) { | |||
*nativeHandle = reinterpret_cast<pi_native_handle>(queue->get()); | |||
ScopedContext active(queue->get_context()); | |||
*nativeHandle = reinterpret_cast<pi_native_handle>(queue->get_compute()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in a previous comment, I suspect it would make sense to change the native handles for the CUDA backend. I am okay with leaving it as-is for now though.
@AerialMantis - What are your thoughts on this? Should the CUDA backend return the compute and transfer streams as native handle for the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @steffenlarsen, apologies for the delay in replying I was on holiday the past week, we discussed this a fair bit as well, there could be a benefit to extending the backend interop such that get_native_queue
returns a multiple streams, however, that would introduce the problem of how the user should interpret and use those streams which would complicate the more simple use case, so we decided to leave it as it is for now, and once we have some more user experience of it we should revisit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @AerialMantis! I think the primary problem I could see happen is the users extracting the CUstream and expecting that synchronizing on it would be the same as synchronizing on the SYCL queue, which would not necessarily be the case. Likewise, they could try and recreate the SYCL queue and expect it to act in the same way as the old one. I don't think it is necessarily an immediate problem, but it should be something to be considered while writing the backend documentation.
The HIP failure looks unrelated to the changes in this PR. Except for removing a test all changes here are limited to the CUDA plugin. |
HIP failure should be addressed with intel/llvm-test-suite#1020 |
@steffenlarsen Can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
So far each queue only had one underlying CUstream, making it de facto in-order. This PR introduces multiple streams in each queue. To improve opportunities for concurrent execution streams are split into two pools - one for compute (kernels) and one for memory transfers. The streams in pools are created dynamically when first needed. When a pool is full, previously created streams are reused. By default each queue has space for up to 128 streams for compute and 64 for transfers. This PR also removes a test for internal workings of the queue. The problem is that introducing dynamic stream creation puts more work into `_pi_queue::get()`, making it depend on some helper functions in `pi_cuda.cpp`, so I had to move the definition from header to `pi_cuda.cpp`. This, however caused problem with linking this test, as `lib_pi_cuda.so` is created using custom linking script that only exposes functions starting with "pi". This improves linking performance, but prevents any other function from being tested. Looking at other tests for internals of the plugin I noticed that other functions that other functions that would need anything from `pi_cuda.cpp` are also not tested, so I deleted this test as well. This is not changing any user-facing interface, so there are no accompanying changes to the test suite.
Improves performance of `piQueueFinish` and therefore of `queue::wait()` on CUDA backend by reducing the number of `cuStreamSynchronize()` calls invoked. This in most use cases fixes the slowdown to `queue::wait()` introduced in #6102. This does not change any interface so there are no changes to the test suite.
Closely mimics the functionality of CUDA plugin #6102
So far each queue only had one underlying CUstream, making it de facto in-order. This PR introduces multiple streams in each queue. To improve opportunities for concurrent execution streams are split into two pools - one for compute (kernels) and one for memory transfers.
The streams in pools are created dynamically when first needed. When a pool is full, previously created streams are reused. By default each queue has space for up to 128 streams for compute and 64 for transfers.
This PR also removes a test for internal workings of the queue. The problem is that introducing dynamic stream creation puts more work into
_pi_queue::get()
, making it depend on some helper functions inpi_cuda.cpp
, so I had to move the definition from header topi_cuda.cpp
. This, however caused problem with linking this test, aslib_pi_cuda.so
is created using custom linking script that only exposes functions starting with "pi". This improves linking performance, but prevents any other function from being tested. Looking at other tests for internals of the plugin I noticed that other functions that other functions that would need anything frompi_cuda.cpp
are also not tested, so I deleted this test as well.This is not changing any user-facing interface, so there are no accompanying changes to the test suite.