-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL][CUDA] Handle the case of not having any CUDA device #1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks better than my patch, which is now closed.
sycl/plugins/cuda/pi_cuda.cpp
Outdated
@@ -34,6 +34,8 @@ pi_result map_error(CUresult result) { | |||
return PI_INVALID_OPERATION; | |||
case CUDA_ERROR_INVALID_CONTEXT: | |||
return PI_INVALID_CONTEXT; | |||
case CUDA_ERROR_NO_DEVICE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error code seems to only come up with CUDA-OpenGL interop according to the docs - is that something you see "in the wild" otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - this is actually a leftover from the first attempt I made, that was returning PI_INVALID_PLATFORM
if there are no CUDA devices.
It should be fine to drop these two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if PI_INVALID_PLATFORM
is the best match for the device error as I am unsure which OpenCL (extension) call would relate to OpenGL in this case?
fe20c0a
to
f51ec95
Compare
sycl/plugins/cuda/pi_cuda.cpp
Outdated
initFlag, | ||
[](pi_result &err) { | ||
CUresult result = cuInit(0); | ||
if (result == CUDA_ERROR_NO_DEVICE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing this error code in the 10.2.89 CUDA Driver API docs docs - does an earlier version return it?
Should we instead check as follows?
if (result == CUDA_ERROR_NO_DEVICE) { | |
if (result != CUDA_SUCCESS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not in the docs, but it is returned when there are no available devices (I'm testing with CUDA 10.2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to special case a working installation which simply has no devices (for example because of CUDA_VISIBLE_DEVICES=
).
I guess it also makes sense not to return the platform, instead of callingabort()
, if cuInit()
fails for other reasons.
I can make the change (and remove the call to PI_CHECK_ERROR(result)
) if this is the preferred behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that cuInit()
should be allowed to fail for any reason. At that point we should just not get any CUDA devices at runtime. The application should still be allowed to sue OpenCL/host devices at that point.
So yeah, I would just check with CUDA_SUCCESS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm CUDA_ERROR_NO_DEVICE is reported by cuInit, despite of it not being documented.
I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😕 Do you think production code should behave that way? If I want to run an application on "some platform", I mainly want it to run. If it can't use any CUDA devices, then fine, it should still use any other device that it can, and not just die on me.
So an exception is really not what I would go for here. For whatever reason the cuInit()
function is dying. At the most I would print an "info message" if the return value of cuInit()
is something really non-expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.
I do not know how the plugins should work in the long term.
For the moment I understand that - if they are built - both the OpenCL and the CUDA plugins will be used, without a way to select/deselect them at runtime.
Now, the OpenCL plugin seems happy in not reporting any platforms for which it doesn't find any devices.
I think the CUDA plugin should never abort (or throw an exception that causes the runtime to do so); for example because that would break also SYCL applications that do not care about the PTX/CUDA backend.
Whether to print an error or silently report no CUDA devices (and thus no CUDA platform) is less clear, but I would suggest to use the same approach as the OpenCL plugin (i.e. be silent) and let the application or the user deal with the lack of devices.
If, on the other hand, one prefers a different behaviour (e.g. warning about a CUDA installation without device, or an NVIDIA device without CUDA drivers, etc.) I think the OpenCL backend should behave in the same way with respect to OpenCL devices.
f51ec95
to
170352b
Compare
Should we check in |
Studying OpenCL 2.1 specs:
To not deviate too much from the OpenCL spec, can we really support a |
Is it possible to call |
No, though at the moment we always return a platform - even if we set the internal number to |
For what is worth, my interpretation is that if the
if the
I do not know if "the function is executed successfully" implies that at least one platform with at least one device should be available. |
Just checked the only part of SYCL that calls piPlatformsGet and it checks the number of returned platforms. This should therefore be fine for the SYCL implementation but we have to check out unit tests. |
@fwyzard thank you for the back and forth to allow me to wrap my head around the intricacies! |
My pleasure. By the way, here is what I get on a machine with both OpenCL and CUDA available:
and here is what I get with the proposed changes, if I "hide" the NVIDIA GPU via
which looks like the behaviour I would expect :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of nits.
If CUDA initialisation fails or there are no CUDA devices, do not return a CUDA-based SYCL platform rather than aborting or throwing an exception. Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>
…ctor_tests * origin/sycl: (32 commits) [SYCL] Fix circular reference between events and queues (intel#1226) [CI][Doc] Use SSH to deploy GitHub Pages (intel#1232) [SYCL][CUDA][Test] Testing for use of CUDA primary context (intel#1174) [SYCL] allow underscore symbol in temporary directory name [SYCL] Reject zero length arrays (intel#1153) [SYCL] Fix static code analyzis concerns (intel#1189) [SYCL] Add more details about the -fintelfpga option (intel#1218) [SYCL][CUDA] Select only NVPTX64 device binaries (intel#1223) [SYCL] Reverse max work-group size order (intel#1177) [SYCL][Doc] Add GroupAlgorithms extension (intel#1079) [SYCL] Fix SYCL internal enumerators conflict with user defined macro (intel#1188) [SYCL][CUDA] Fixes context release and unnamed context scope (intel#1207) [SYCL][CUDA] Fix context creation property parsing [CUDA][PI] clang-format pi.h [SYCL][CUDA] Handle the case of not having any CUDA device (intel#1212) [SYCL] Fix check-sycl-deploy target problems (intel#1165) [SYCL] Disable tests which take more than 5 minutes (intel#1220) [SYCL] Make context constructors explicit to avoid unintended conversions (intel#1219) [SYCL][NFC] Add clang-format configuration file for SYCL LIT tests (intel#1224) [SYCL] Fix command cleanup invoked from multiple threads (intel#1214) ...
If there are no CUDA devices, do not return any CUDA-based SYCL platform rather than aborting or throwing an exception.
Fixes #1198.