Skip to content

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 4, 2022

Conversation

abagusetty
Copy link
Contributor

Closely mimics the functionality of CUDA plugin #6102

@abagusetty abagusetty requested a review from a team as a code owner June 19, 2022 02:51
@abagusetty abagusetty requested a review from againull June 19, 2022 02:51
@abagusetty
Copy link
Contributor Author

@AerialMantis Can you get us some help with reviewing this PR

t4c1
t4c1 previously approved these changes Jun 20, 2022
Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Note: this also mimics changes from #6201 and #6224.

Comment on lines +375 to +376
static constexpr int default_num_compute_streams = 64;
static constexpr int default_num_transfer_streams = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what was the reason for going for half the number of streams CUDA plugin uses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of concurrent compute/DMA streams on AMD devices (gfx906, gfx908, gfx90a, etc, i.e., per device or per GCD in case of multiple per device limits) limits to 4 IIRC. Just to have efficient mapping of the above vectors sizes to hardware limited count(4).

The choice for the above 64 & 16 were just motivated by the fact of CUDA hardware concurrent streams (Volta = 16, Ampere = 32) and AMD hardware count being 4. Hence reduced by the factor of half, i.e., just a arbitrary and also ensuring default_num_compute_streams and default_num_transfer_streams were just multiples of 4.

Please let me know if you just prefer sticking with the same count as CUDA or anything else, since the above choice is not really a sound one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with these numbers.

@t4c1
Copy link
Contributor

t4c1 commented Jun 21, 2022

@abagusetty We made another improvement to multiple streams for CUDA in #6333. Please do the same changes in this PR.

@abagusetty
Copy link
Contributor Author

@t4c1 can you merge if everything looks good.

@t4c1
Copy link
Contributor

t4c1 commented Jun 23, 2022

I can not merge PRs. You will need approval from someone at Intel.

@abagusetty
Copy link
Contributor Author

@steffenlarsen Can you please help with review, merge for this PR, mimics mostly CUDA

Copy link
Contributor

@againull againull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for delayed review.

Thanks a lot for caring about thread-safety, I have several questions/comments.

Comment on lines +506 to +507
std::lock_guard compute_sync_guard(compute_stream_sync_mutex_);
std::lock_guard<std::mutex> compute_guard(compute_stream_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use std::scoped_lock here to lock both mutexes simultaneously, it allows to avoid deadlock.

Comment on lines +527 to +528
unsigned int size = static_cast<unsigned int>(transfer_streams_.size());
if (size > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be under std::lock_guard<std::mutex> transfer_guard(transfer_stream_mutex_); to be thread-safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, size does not change.

Comment on lines +445 to +446
if (delay_compute_[stream_i % compute_streams_.size()]) {
delay_compute_[stream_i % compute_streams_.size()] = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we guard access to the delay_compute_ with a mutex too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.

if (stream_token) {
*stream_token = stream_i;
}
return compute_streams_[stream_i % compute_streams_.size()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this access to the compute_streams_ must be guarded with a mutex compute_stream_mutex_ too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a lock just for accessing streams, but the size is constant.

}
};
{
unsigned int size = static_cast<unsigned int>(compute_streams_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we read size of compute_streams_ under lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, see the explanation above.

// commands enqueued after it and the one we are about to enqueue to run
// concurrently
bool is_last_command =
(compute_stream_idx_ - stream_token) <= compute_streams_.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we read size of compute_streams_ under lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size does not change after it is set. However we can not mark compute_streams_ const, as the streams are initialized later.

// When compute_stream_sync_mutex_ and compute_stream_mutex_ both need to be
// locked at the same time, compute_stream_sync_mutex_ should be locked first
// to avoid deadlocks
std::mutex compute_stream_sync_mutex_;
Copy link
Contributor

@againull againull Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like compute_stream_mutex_ is used to guard access to the compute_streams_, transfer_stream_mutex_ is used to guard access to the transfer_streams_.
Could you please clarify what is the purpose of compute_stream_sync_mutex_ ? Probably it worth having a comment for it.
And should we have an additional mutex for delay_compute_?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what is the purpose of compute_stream_sync_mutex_ ?

It ensures a compute stream is not being reused while streams are getting synchronized. That could mess up synchronization.

And should we have an additional mutex for delay_compute_?

We do not need that one, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.

@againull againull self-requested a review June 27, 2022 19:24
againull
againull previously approved these changes Jun 27, 2022
@pvchupin
Copy link
Contributor

@abagusetty, @t4c1, can you check hip fails?

@abagusetty
Copy link
Contributor Author

@abagusetty, @t4c1, can you check hip fails?

I am looking into the HIP failures. Will update soon

@abagusetty
Copy link
Contributor Author

@pvchupin @t4c1 HIP tests are now fixed
@againull Last commit made the review stale. If you can please look into this one more pass.

@againull againull self-requested a review June 30, 2022 15:53
@abagusetty
Copy link
Contributor Author

@t4c1 Can you help with parsing the CUDA test suite failures. They all look not related to test-suite though, if you can confirm.

@steffenlarsen
Copy link
Contributor

@abagusetty - We are seeing the CUDA failure in other PRs as well, so it is unlikely to be caused by this.

@t4c1 - Would you mind giving this another pass?

@steffenlarsen steffenlarsen requested a review from t4c1 July 1, 2022 08:18
Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the CUDA assert failure seems unrelated to changes in this PR. The failure might be coming from #6370

@steffenlarsen steffenlarsen merged commit e0c40a9 into intel:sycl Jul 4, 2022
@abagusetty abagusetty deleted the hip_multiple_streams_per_queue branch August 8, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants