[SYCL][CUDA] Fix context setup for device infos #8124

npmiller · 2023-01-27T11:04:40Z

Extend the ScopedContext to work with just a device, in that case it will simply use the primary context.

This is helpful for entry points that only have a pi_device and no pi_context but that still need some cuda calls that require an active context, such as for the device infos.

This addresses a bug where getting the amount of free memory before creating any queues or context, would simply crash.

This was partially solved in a previous PR (#7906), however the previous PR was releasing the primary context, but leaving it active on the current thread, so getting the device info twice in a row would end up crashing again since it would just use the active but released primary context.

This should address: #8117

zjin-lcf · 2023-01-27T13:16:37Z

Thank you for explaining the cause !

abagusetty · 2023-01-27T17:34:06Z

Thanks! Works on multi-gpu node

Extend the `ScopedContext` to work with just a device, in that case it will simply use the primary context. This is helpful for entry points that only have a `pi_device` and no `pi_context` but that still need some cuda calls that require an active context, such as for the device infos. This addresses a bug where getting the amount of free memory before creating any queues or context, would simply crash. This was partially solved in a previous PR, however the previous PR was releasing the primary context, but leaving it active on the current thread, so getting the device info twice in a row would end up crashing again since it would just use the active but released primary context.

npmiller · 2023-01-31T13:29:53Z

@intel/llvm-reviewers-cuda does anyone knows what is going on with the CUDA job on this PR? It keeps failing but in the logs it looks like it doesn't run at all, there's nothing past the checkout step.

steffenlarsen · 2023-01-31T13:47:57Z

@intel/llvm-reviewers-cuda does anyone knows what is going on with the CUDA job on this PR? It keeps failing but in the logs it looks like it doesn't run at all, there's nothing past the checkout step.

I saw the same happen in another PR recently, so it may just be the CI being temperamental. Did your recent rebase use the current intel/llvm sycl branch?

npmiller · 2023-01-31T13:55:03Z

Yeah I rebased it yesterday and again this morning, still no luck

sycl/plugins/cuda/pi_cuda.cpp

bader · 2023-01-31T18:58:49Z

https://github.com/intel/llvm/actions/runs/4053625730

The self-hosted runner: aws-cuda_4053625730-2_g5.2xlarge_on-demand lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I don't see any problems with this job in other pre-commits launched not later than 14 hours ago.
I think this issue might be related to the changes in the patch. Can it cause "starvation for CPU/Memory" during LIT tests run?

npmiller · 2023-02-01T15:00:55Z

https://github.com/intel/llvm/actions/runs/4053625730

The self-hosted runner: aws-cuda_4053625730-2_g5.2xlarge_on-demand lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I don't see any problems with this job in other pre-commits launched not later than 14 hours ago. I think this issue might be related to the changes in the patch. Can it cause "starvation for CPU/Memory" during LIT tests run?

After testing it, you are right, it seems this patch is causing issues I had missed, which ends up making the job timeout, I will look into it further, thanks!

abagusetty · 2023-02-01T17:51:29Z

@npmiller I was wondering if this test helps too to be included in your PR. I have had it in one of my branches for other functionality
abagusetty@e1128b2

npmiller · 2023-02-01T18:00:54Z

Yes! We definitely need more testing for this thank you!

So I actually know what's the issue, in this patch I originally moved the ScopedContext to cover all the device info cases, which meant that the primary context was created and released for all of them, and it ended up being created and then destroyed every time which is very expensive and caused certain tests that look up properties a lot to timeout.

I've updated this to only use the ScopedContext for the free memory property and it fixes the CI issue, but it's not really ideal and it will still be very slow to use that specific property.

Which seems to work but is clearly not ideal, so I'm currently evaluating a re-work of the way we handle contexts in the plugin, it's overly complicated, likely inefficient and hacky in some parts.

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

npmiller requested a review from a team as a code owner January 27, 2023 11:04

npmiller requested a review from jchlanda January 27, 2023 11:04

npmiller temporarily deployed to aws January 27, 2023 11:28 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 27, 2023 12:52 — with GitHub Actions Inactive

jchlanda approved these changes Jan 30, 2023

View reviewed changes

npmiller temporarily deployed to aws January 30, 2023 14:23 — with GitHub Actions Inactive

npmiller force-pushed the fix-ctxt-infos branch from 56f718f to 442783a Compare January 30, 2023 16:29

npmiller temporarily deployed to aws January 30, 2023 17:13 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 30, 2023 18:30 — with GitHub Actions Inactive

npmiller force-pushed the fix-ctxt-infos branch from 442783a to e7e0165 Compare January 31, 2023 11:30

npmiller temporarily deployed to aws January 31, 2023 11:57 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 31, 2023 13:12 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 31, 2023 14:24 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 31, 2023 15:42 — with GitHub Actions Inactive

steffenlarsen reviewed Jan 31, 2023

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Show resolved Hide resolved

[SYCL][CUDA] Only use scoped context for free memory

1744d31

npmiller force-pushed the fix-ctxt-infos branch from 3864e04 to 1744d31 Compare February 1, 2023 10:41

npmiller temporarily deployed to aws February 1, 2023 11:06 — with GitHub Actions Inactive

npmiller temporarily deployed to aws February 1, 2023 11:53 — with GitHub Actions Inactive

npmiller temporarily deployed to aws February 1, 2023 12:59 — with GitHub Actions Inactive

npmiller temporarily deployed to aws February 1, 2023 14:17 — with GitHub Actions Inactive

npmiller force-pushed the fix-ctxt-infos branch from 061e2fe to 1744d31 Compare February 1, 2023 15:35

npmiller temporarily deployed to aws February 1, 2023 16:26 — with GitHub Actions Inactive

npmiller temporarily deployed to aws February 1, 2023 17:45 — with GitHub Actions Inactive

bader requested a review from steffenlarsen February 1, 2023 19:01

npmiller temporarily deployed to aws February 2, 2023 11:06 — with GitHub Actions Inactive

npmiller temporarily deployed to aws February 2, 2023 11:39 — with GitHub Actions Inactive

steffenlarsen approved these changes Feb 2, 2023

View reviewed changes

steffenlarsen merged commit 7e6b3c2 into intel:sycl Feb 2, 2023

npmiller mentioned this pull request Feb 3, 2023

[SYCL][CUDA] Decouple CUDA contexts from PI contexts #8197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Fix context setup for device infos #8124

[SYCL][CUDA] Fix context setup for device infos #8124

Uh oh!

npmiller commented Jan 27, 2023

Uh oh!

zjin-lcf commented Jan 27, 2023

Uh oh!

abagusetty commented Jan 27, 2023

Uh oh!

npmiller commented Jan 31, 2023

Uh oh!

steffenlarsen commented Jan 31, 2023

Uh oh!

npmiller commented Jan 31, 2023

Uh oh!

Uh oh!

bader commented Jan 31, 2023

Uh oh!

npmiller commented Feb 1, 2023

Uh oh!

abagusetty commented Feb 1, 2023

Uh oh!

npmiller commented Feb 1, 2023

Uh oh!

Uh oh!

[SYCL][CUDA] Fix context setup for device infos #8124

[SYCL][CUDA] Fix context setup for device infos #8124

Uh oh!

Conversation

npmiller commented Jan 27, 2023

Uh oh!

zjin-lcf commented Jan 27, 2023

Uh oh!

abagusetty commented Jan 27, 2023

Uh oh!

npmiller commented Jan 31, 2023

Uh oh!

steffenlarsen commented Jan 31, 2023

Uh oh!

npmiller commented Jan 31, 2023

Uh oh!

Uh oh!

bader commented Jan 31, 2023

Uh oh!

npmiller commented Feb 1, 2023

Uh oh!

abagusetty commented Feb 1, 2023

Uh oh!

npmiller commented Feb 1, 2023

Uh oh!

Uh oh!