-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325
Conversation
@AerialMantis Can you get us some help with reviewing this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static constexpr int default_num_compute_streams = 64; | ||
static constexpr int default_num_transfer_streams = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what was the reason for going for half the number of streams CUDA plugin uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of concurrent compute/DMA streams on AMD devices (gfx906, gfx908, gfx90a, etc, i.e., per device or per GCD in case of multiple per device limits) limits to 4 IIRC. Just to have efficient mapping of the above vectors sizes to hardware limited count(4).
The choice for the above 64 & 16 were just motivated by the fact of CUDA hardware concurrent streams (Volta = 16, Ampere = 32) and AMD hardware count being 4. Hence reduced by the factor of half, i.e., just a arbitrary and also ensuring default_num_compute_streams
and default_num_transfer_streams
were just multiples of 4.
Please let me know if you just prefer sticking with the same count as CUDA or anything else, since the above choice is not really a sound one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with these numbers.
@abagusetty We made another improvement to multiple streams for CUDA in #6333. Please do the same changes in this PR. |
@t4c1 can you merge if everything looks good. |
I can not merge PRs. You will need approval from someone at Intel. |
@steffenlarsen Can you please help with review, merge for this PR, mimics mostly CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for delayed review.
Thanks a lot for caring about thread-safety, I have several questions/comments.
std::lock_guard compute_sync_guard(compute_stream_sync_mutex_); | ||
std::lock_guard<std::mutex> compute_guard(compute_stream_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use std::scoped_lock here to lock both mutexes simultaneously, it allows to avoid deadlock.
unsigned int size = static_cast<unsigned int>(transfer_streams_.size()); | ||
if (size > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be under std::lock_guard<std::mutex> transfer_guard(transfer_stream_mutex_);
to be thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, size does not change.
if (delay_compute_[stream_i % compute_streams_.size()]) { | ||
delay_compute_[stream_i % compute_streams_.size()] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we guard access to the delay_compute_ with a mutex too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.
if (stream_token) { | ||
*stream_token = stream_i; | ||
} | ||
return compute_streams_[stream_i % compute_streams_.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this access to the compute_streams_ must be guarded with a mutex compute_stream_mutex_
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a lock just for accessing streams, but the size is constant.
} | ||
}; | ||
{ | ||
unsigned int size = static_cast<unsigned int>(compute_streams_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we read size of compute_streams_
under lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see the explanation above.
// commands enqueued after it and the one we are about to enqueue to run | ||
// concurrently | ||
bool is_last_command = | ||
(compute_stream_idx_ - stream_token) <= compute_streams_.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we read size of compute_streams_ under lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size does not change after it is set. However we can not mark compute_streams_ const, as the streams are initialized later.
// When compute_stream_sync_mutex_ and compute_stream_mutex_ both need to be | ||
// locked at the same time, compute_stream_sync_mutex_ should be locked first | ||
// to avoid deadlocks | ||
std::mutex compute_stream_sync_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like compute_stream_mutex_
is used to guard access to the compute_streams_
, transfer_stream_mutex_
is used to guard access to the transfer_streams_
.
Could you please clarify what is the purpose of compute_stream_sync_mutex_
? Probably it worth having a comment for it.
And should we have an additional mutex for delay_compute_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify what is the purpose of compute_stream_sync_mutex_ ?
It ensures a compute stream is not being reused while streams are getting synchronized. That could mess up synchronization.
And should we have an additional mutex for delay_compute_?
We do not need that one, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.
@abagusetty, @t4c1, can you check hip fails? |
I am looking into the HIP failures. Will update soon |
@t4c1 Can you help with parsing the CUDA test suite failures. They all look not related to test-suite though, if you can confirm. |
@abagusetty - We are seeing the CUDA failure in other PRs as well, so it is unlikely to be caused by this. @t4c1 - Would you mind giving this another pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the CUDA assert failure seems unrelated to changes in this PR. The failure might be coming from #6370
Closely mimics the functionality of CUDA plugin #6102