Skip to content

[SYCL][CUDA] Fix context setup for device infos #8124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 2, 2023

Conversation

npmiller
Copy link
Contributor

Extend the ScopedContext to work with just a device, in that case it will simply use the primary context.

This is helpful for entry points that only have a pi_device and no pi_context but that still need some cuda calls that require an active context, such as for the device infos.

This addresses a bug where getting the amount of free memory before creating any queues or context, would simply crash.

This was partially solved in a previous PR (#7906), however the previous PR was releasing the primary context, but leaving it active on the current thread, so getting the device info twice in a row would end up crashing again since it would just use the active but released primary context.

This should address: #8117

@npmiller npmiller requested a review from a team as a code owner January 27, 2023 11:04
@npmiller npmiller requested a review from jchlanda January 27, 2023 11:04
@npmiller npmiller temporarily deployed to aws January 27, 2023 11:28 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 27, 2023 12:52 — with GitHub Actions Inactive
@zjin-lcf
Copy link
Contributor

Thank you for explaining the cause !

@abagusetty
Copy link
Contributor

Thanks! Works on multi-gpu node

@npmiller npmiller temporarily deployed to aws January 30, 2023 14:23 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 30, 2023 17:13 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 30, 2023 18:30 — with GitHub Actions Inactive
Extend the `ScopedContext` to work with just a device, in that case it
will simply use the primary context.

This is helpful for entry points that only have a `pi_device` and no
`pi_context` but that still need some cuda calls that require an active
context, such as for the device infos.

This addresses a bug where getting the amount of free memory before
creating any queues or context, would simply crash.

This was partially solved in a previous PR, however the previous PR was
releasing the primary context, but leaving it active on the current
thread, so getting the device info twice in a row would end up crashing
again since it would just use the active but released primary context.
@npmiller npmiller temporarily deployed to aws January 31, 2023 11:57 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 31, 2023 13:12 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor Author

@intel/llvm-reviewers-cuda does anyone knows what is going on with the CUDA job on this PR? It keeps failing but in the logs it looks like it doesn't run at all, there's nothing past the checkout step.

@steffenlarsen
Copy link
Contributor

@intel/llvm-reviewers-cuda does anyone knows what is going on with the CUDA job on this PR? It keeps failing but in the logs it looks like it doesn't run at all, there's nothing past the checkout step.

I saw the same happen in another PR recently, so it may just be the CI being temperamental. Did your recent rebase use the current intel/llvm sycl branch?

@npmiller
Copy link
Contributor Author

Yeah I rebased it yesterday and again this morning, still no luck

@npmiller npmiller temporarily deployed to aws January 31, 2023 14:24 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 31, 2023 15:42 — with GitHub Actions Inactive
@bader
Copy link
Contributor

bader commented Jan 31, 2023

https://github.com/intel/llvm/actions/runs/4053625730

The self-hosted runner: aws-cuda_4053625730-2_g5.2xlarge_on-demand lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I don't see any problems with this job in other pre-commits launched not later than 14 hours ago.
I think this issue might be related to the changes in the patch. Can it cause "starvation for CPU/Memory" during LIT tests run?

@npmiller npmiller temporarily deployed to aws February 1, 2023 11:06 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 1, 2023 11:53 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 1, 2023 12:59 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 1, 2023 14:17 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor Author

npmiller commented Feb 1, 2023

https://github.com/intel/llvm/actions/runs/4053625730

The self-hosted runner: aws-cuda_4053625730-2_g5.2xlarge_on-demand lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I don't see any problems with this job in other pre-commits launched not later than 14 hours ago. I think this issue might be related to the changes in the patch. Can it cause "starvation for CPU/Memory" during LIT tests run?

After testing it, you are right, it seems this patch is causing issues I had missed, which ends up making the job timeout, I will look into it further, thanks!

@npmiller npmiller temporarily deployed to aws February 1, 2023 16:26 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 1, 2023 17:45 — with GitHub Actions Inactive
@abagusetty
Copy link
Contributor

@npmiller I was wondering if this test helps too to be included in your PR. I have had it in one of my branches for other functionality
abagusetty@e1128b2

@npmiller
Copy link
Contributor Author

npmiller commented Feb 1, 2023

Yes! We definitely need more testing for this thank you!

So I actually know what's the issue, in this patch I originally moved the ScopedContext to cover all the device info cases, which meant that the primary context was created and released for all of them, and it ended up being created and then destroyed every time which is very expensive and caused certain tests that look up properties a lot to timeout.

I've updated this to only use the ScopedContext for the free memory property and it fixes the CI issue, but it's not really ideal and it will still be very slow to use that specific property.

Which seems to work but is clearly not ideal, so I'm currently evaluating a re-work of the way we handle contexts in the plugin, it's overly complicated, likely inefficient and hacky in some parts.

@bader bader requested a review from steffenlarsen February 1, 2023 19:01
@npmiller npmiller temporarily deployed to aws February 2, 2023 11:06 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 2, 2023 11:39 — with GitHub Actions Inactive
@steffenlarsen steffenlarsen merged commit 7e6b3c2 into intel:sycl Feb 2, 2023
npmiller added a commit to npmiller/llvm that referenced this pull request Feb 6, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in intel#8124 and intel#7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in intel#6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
  accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
  applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
  to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
  contexts. It's unclear if there's any actual use case for this, it
  seems very uncommon in CUDA codebases to have multiple CUDA contexts
  for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
bader pushed a commit that referenced this pull request Feb 9, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in #8124 and #7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in #6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
contexts. It's unclear if there's any actual use case for this, it seems
very uncommon in CUDA codebases to have multiple CUDA contexts for a
single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants