[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

abagusetty · 2022-06-19T02:51:06Z

Closely mimics the functionality of CUDA plugin #6102

abagusetty · 2022-06-19T03:30:52Z

@AerialMantis Can you get us some help with reviewing this PR

t4c1

LGTM!

Note: this also mimics changes from #6201 and #6224.

sycl/plugins/hip/pi_hip.cpp

t4c1 · 2022-06-20T11:17:27Z

sycl/plugins/hip/pi_hip.hpp

+  static constexpr int default_num_compute_streams = 64;
+  static constexpr int default_num_transfer_streams = 16;


Just curious, what was the reason for going for half the number of streams CUDA plugin uses?

The number of concurrent compute/DMA streams on AMD devices (gfx906, gfx908, gfx90a, etc, i.e., per device or per GCD in case of multiple per device limits) limits to 4 IIRC. Just to have efficient mapping of the above vectors sizes to hardware limited count(4).

The choice for the above 64 & 16 were just motivated by the fact of CUDA hardware concurrent streams (Volta = 16, Ampere = 32) and AMD hardware count being 4. Hence reduced by the factor of half, i.e., just a arbitrary and also ensuring default_num_compute_streams and default_num_transfer_streams were just multiples of 4.

Please let me know if you just prefer sticking with the same count as CUDA or anything else, since the above choice is not really a sound one.

I am fine with these numbers.

t4c1 · 2022-06-21T08:34:56Z

@abagusetty We made another improvement to multiple streams for CUDA in #6333. Please do the same changes in this PR.

abagusetty · 2022-06-22T19:41:34Z

@t4c1 can you merge if everything looks good.

t4c1 · 2022-06-23T07:28:57Z

I can not merge PRs. You will need approval from someone at Intel.

abagusetty · 2022-06-24T21:48:45Z

@steffenlarsen Can you please help with review, merge for this PR, mimics mostly CUDA

againull

Very sorry for delayed review.

Thanks a lot for caring about thread-safety, I have several questions/comments.

againull · 2022-06-24T22:21:22Z

sycl/plugins/hip/pi_hip.hpp

+      std::lock_guard compute_sync_guard(compute_stream_sync_mutex_);
+      std::lock_guard<std::mutex> compute_guard(compute_stream_mutex_);


Could you please use std::scoped_lock here to lock both mutexes simultaneously, it allows to avoid deadlock.

againull · 2022-06-24T22:23:25Z

sycl/plugins/hip/pi_hip.hpp

+      unsigned int size = static_cast<unsigned int>(transfer_streams_.size());
+      if (size > 0) {


I think this should also be under std::lock_guard<std::mutex> transfer_guard(transfer_stream_mutex_); to be thread-safe.

No, size does not change.

againull · 2022-06-24T22:32:56Z

sycl/plugins/hip/pi_hip.cpp

+    if (delay_compute_[stream_i % compute_streams_.size()]) {
+      delay_compute_[stream_i % compute_streams_.size()] = false;


Should we guard access to the delay_compute_ with a mutex too?

Not really, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.

againull · 2022-06-24T22:38:16Z

sycl/plugins/hip/pi_hip.cpp

+  if (stream_token) {
+    *stream_token = stream_i;
+  }
+  return compute_streams_[stream_i % compute_streams_.size()];


It looks like this access to the compute_streams_ must be guarded with a mutex compute_stream_mutex_ too.

We need a lock just for accessing streams, but the size is constant.

againull · 2022-06-24T22:45:53Z

sycl/plugins/hip/pi_hip.hpp

+      }
+    };
+    {
+      unsigned int size = static_cast<unsigned int>(compute_streams_.size());


Should we read size of compute_streams_ under lock?

No, see the explanation above.

againull · 2022-06-24T22:47:22Z

sycl/plugins/hip/pi_hip.hpp

+    // commands enqueued after it and the one we are about to enqueue to run
+    // concurrently
+    bool is_last_command =
+        (compute_stream_idx_ - stream_token) <= compute_streams_.size();


Shouldn't we read size of compute_streams_ under lock?

Size does not change after it is set. However we can not mark compute_streams_ const, as the streams are initialized later.

againull · 2022-06-24T22:51:05Z

sycl/plugins/hip/pi_hip.hpp

+  // When compute_stream_sync_mutex_ and compute_stream_mutex_ both need to be
+  // locked at the same time, compute_stream_sync_mutex_ should be locked first
+  // to avoid deadlocks
+  std::mutex compute_stream_sync_mutex_;


It looks like compute_stream_mutex_ is used to guard access to the compute_streams_, transfer_stream_mutex_ is used to guard access to the transfer_streams_.
Could you please clarify what is the purpose of compute_stream_sync_mutex_ ? Probably it worth having a comment for it.
And should we have an additional mutex for delay_compute_?

Could you please clarify what is the purpose of compute_stream_sync_mutex_ ?

It ensures a compute stream is not being reused while streams are getting synchronized. That could mess up synchronization.

And should we have an additional mutex for delay_compute_?

We do not need that one, since a race condition with delay_compute_ is unlikely, and even when it happens it will not affect correctness of the execution.

pvchupin · 2022-06-28T00:23:50Z

@abagusetty, @t4c1, can you check hip fails?

abagusetty · 2022-06-28T00:35:36Z

@abagusetty, @t4c1, can you check hip fails?

I am looking into the HIP failures. Will update soon

abagusetty · 2022-06-30T15:47:44Z

@pvchupin @t4c1 HIP tests are now fixed
@againull Last commit made the review stale. If you can please look into this one more pass.

abagusetty · 2022-06-30T20:51:47Z

@t4c1 Can you help with parsing the CUDA test suite failures. They all look not related to test-suite though, if you can confirm.

steffenlarsen · 2022-07-01T08:18:39Z

@abagusetty - We are seeing the CUDA failure in other PRs as well, so it is unlikely to be caused by this.

@t4c1 - Would you mind giving this another pass?

t4c1

LGTM, the CUDA assert failure seems unrelated to changes in this PR. The failure might be coming from #6370

Abhishek Bagusetty added 3 commits June 18, 2022 16:22

fix build issue

7149a60

Merge branch 'sycl' into hip_multiple_streams_per_queue

ef929ac

fix a bug in next transfer stream counter

933123a

abagusetty requested a review from a team as a code owner June 19, 2022 02:51

abagusetty requested a review from againull June 19, 2022 02:51

t4c1 previously approved these changes Jun 20, 2022

View reviewed changes

removed commented lines

7d93200

abagusetty dismissed t4c1’s stale review via 7d93200 June 20, 2022 14:19

Merge branch 'sycl' into hip_multiple_streams_per_queue

7ee900c

fix to mimic CUDA plugin intel#6333

f2f7141

abagusetty and others added 2 commits June 24, 2022 15:26

Merge branch 'intel:sycl' into hip_multiple_streams_per_queue

79ba796

fix formatting, intel#6354

007cad9

againull reviewed Jun 24, 2022

View reviewed changes

againull self-requested a review June 27, 2022 19:24

againull previously approved these changes Jun 27, 2022

View reviewed changes

abagusetty and others added 2 commits June 30, 2022 10:16

Merge branch 'intel:sycl' into hip_multiple_streams_per_queue

b191d14

reverts a commit CUDA's intel#6333

a2157fa

abagusetty dismissed againull’s stale review via a2157fa June 30, 2022 15:44

againull self-requested a review June 30, 2022 15:53

againull approved these changes Jun 30, 2022

View reviewed changes

steffenlarsen requested a review from t4c1 July 1, 2022 08:18

t4c1 approved these changes Jul 4, 2022

View reviewed changes

steffenlarsen merged commit e0c40a9 into intel:sycl Jul 4, 2022

abagusetty deleted the hip_multiple_streams_per_queue branch August 8, 2022 18:03

npmiller mentioned this pull request Oct 18, 2022

Inefficient enqueue implementation in CUDA and HIP plugins #4647

Closed

		static constexpr int default_num_compute_streams = 64;
		static constexpr int default_num_transfer_streams = 16;

		std::lock_guard compute_sync_guard(compute_stream_sync_mutex_);
		std::lock_guard<std::mutex> compute_guard(compute_stream_mutex_);

		unsigned int size = static_cast<unsigned int>(transfer_streams_.size());
		if (size > 0) {

		if (delay_compute_[stream_i % compute_streams_.size()]) {
		delay_compute_[stream_i % compute_streams_.size()] = false;

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

Uh oh!

Conversation

abagusetty commented Jun 19, 2022

Uh oh!

abagusetty commented Jun 19, 2022

Uh oh!

t4c1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t4c1 commented Jun 21, 2022

Uh oh!

abagusetty commented Jun 22, 2022

Uh oh!

t4c1 commented Jun 23, 2022

Uh oh!

abagusetty commented Jun 24, 2022

Uh oh!

againull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

againull Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvchupin commented Jun 28, 2022

Uh oh!

abagusetty commented Jun 28, 2022

Uh oh!

abagusetty commented Jun 30, 2022

Uh oh!

abagusetty commented Jun 30, 2022

Uh oh!

steffenlarsen commented Jul 1, 2022

Uh oh!

t4c1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

againull Jun 24, 2022 •

edited

Loading