[SYCL][CUDA] Handle the case of not having any CUDA device #1212

fwyzard · 2020-02-28T07:23:10Z

If there are no CUDA devices, do not return any CUDA-based SYCL platform rather than aborting or throwing an exception.

Fixes #1198.

Ruyk

Thanks, this looks better than my patch, which is now closed.

sycl/plugins/cuda/pi_cuda.cpp

bjoernknafla · 2020-02-28T13:54:42Z

sycl/plugins/cuda/pi_cuda.cpp

@@ -34,6 +34,8 @@ pi_result map_error(CUresult result) {
    return PI_INVALID_OPERATION;
  case CUDA_ERROR_INVALID_CONTEXT:
    return PI_INVALID_CONTEXT;
+  case CUDA_ERROR_NO_DEVICE:


This error code seems to only come up with CUDA-OpenGL interop according to the docs - is that something you see "in the wild" otherwise?

Sorry - this is actually a leftover from the first attempt I made, that was returning PI_INVALID_PLATFORM if there are no CUDA devices.

It should be fine to drop these two lines.

Wondering if PI_INVALID_PLATFORM is the best match for the device error as I am unsure which OpenCL (extension) call would relate to OpenGL in this case?

bjoernknafla · 2020-02-28T14:31:41Z

sycl/plugins/cuda/pi_cuda.cpp

+        initFlag,
+        [](pi_result &err) {
+          CUresult result = cuInit(0);
+          if (result == CUDA_ERROR_NO_DEVICE) {


I am not seeing this error code in the 10.2.89 CUDA Driver API docs docs - does an earlier version return it?

Should we instead check as follows?

Suggested change

if (result == CUDA_ERROR_NO_DEVICE) {

if (result != CUDA_SUCCESS) {

it's not in the docs, but it is returned when there are no available devices (I'm testing with CUDA 10.2).

The idea was to special case a working installation which simply has no devices (for example because of CUDA_VISIBLE_DEVICES=).

I guess it also makes sense not to return the platform, instead of callingabort(), if cuInit() fails for other reasons.

I can make the change (and remove the call to PI_CHECK_ERROR(result)) if this is the preferred behaviour.

I would argue that cuInit() should be allowed to fail for any reason. At that point we should just not get any CUDA devices at runtime. The application should still be allowed to sue OpenCL/host devices at that point.

So yeah, I would just check with CUDA_SUCCESS.

I can confirm CUDA_ERROR_NO_DEVICE is reported by cuInit, despite of it not being documented.
I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.

😕 Do you think production code should behave that way? If I want to run an application on "some platform", I mainly want it to run. If it can't use any CUDA devices, then fine, it should still use any other device that it can, and not just die on me.

So an exception is really not what I would go for here. For whatever reason the cuInit() function is dying. At the most I would print an "info message" if the return value of cuInit() is something really non-expected.

I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.

I do not know how the plugins should work in the long term.
For the moment I understand that - if they are built - both the OpenCL and the CUDA plugins will be used, without a way to select/deselect them at runtime.

Now, the OpenCL plugin seems happy in not reporting any platforms for which it doesn't find any devices.

I think the CUDA plugin should never abort (or throw an exception that causes the runtime to do so); for example because that would break also SYCL applications that do not care about the PTX/CUDA backend.

Whether to print an error or silently report no CUDA devices (and thus no CUDA platform) is less clear, but I would suggest to use the same approach as the OpenCL plugin (i.e. be silent) and let the application or the user deal with the lack of devices.

If, on the other hand, one prefers a different behaviour (e.g. warning about a CUDA installation without device, or an NVIDIA device without CUDA drivers, etc.) I think the OpenCL backend should behave in the same way with respect to OpenCL devices.

sycl/plugins/cuda/pi_cuda.cpp

bjoernknafla · 2020-02-28T17:27:00Z

Should we check in cuda_piDevicesGet if there is a platform and return PI_INVALID_PLATFORM otherwise?

bjoernknafla · 2020-02-28T17:34:26Z

Studying OpenCL 2.1 specs:

Without the cl_khr_icd extension (which we do not report/support) there is now way to not return a platform without an error code, but no error code exists to report that a platform does not exist...

To not deviate too much from the OpenCL spec, can we really support a 0 for number of platforms? Or do we always have to return a platform but then not report any devices?

fwyzard · 2020-02-28T17:34:55Z

Is it possible to call cuda_piDevicesGet if cuda_piPlatformsGet does not return any platforms ?

bjoernknafla · 2020-02-28T17:40:23Z

No, though at the moment we always return a platform - even if we set the internal number to 0 - and I think we have to return a platform if we do not want to return an error. And I believe we do not want to return an error if there are no devices as this might trip up an implementation that is ok with finding any platform with devices even if CUDA isn't the one.

fwyzard · 2020-02-28T17:42:22Z

For what is worth, my interpretation is that clGetPlatformIDs returns (ignoring the cases about invalid inputs and out-of-memory conditions):

if the cl_khr_icd extension is enabled:

CL_SUCCESS if the function is executed successfully and there are a non zero number of platforms available.
CL_PLATFORM_NOT_FOUND_KHR if no platforms are found.

if the cl_khr_icd extension is not enabled:

CL_SUCCESS if the function is executed successfully.

I do not know if "the function is executed successfully" implies that at least one platform with at least one device should be available.

bjoernknafla · 2020-02-28T17:47:44Z

Just checked the only part of SYCL that calls piPlatformsGet and it checks the number of returned platforms.

This should therefore be fine for the SYCL implementation but we have to check out unit tests.

bjoernknafla · 2020-02-28T17:48:19Z

@fwyzard thank you for the back and forth to allow me to wrap my head around the intricacies!

fwyzard · 2020-02-28T17:51:44Z

My pleasure.

By the way, here is what I get on a machine with both OpenCL and CUDA available:

Available SYCL platforms:
  - Intel(R) OpenCL, driver version OpenCL 2.1 LINUX
    - Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, driver version 2020.9.1.0.18

  - Intel(R) OpenCL HD Graphics, driver version OpenCL 2.1 
    - Intel(R) Gen9 HD Graphics NEO, driver version 20.07.15711

  - NVIDIA CUDA, driver version CUDA 10.20
    - Tesla K40c, driver version CUDA 10.20

  - SYCL host platform, driver version 1.2
    - SYCL host device, driver version 1.2

and here is what I get with the proposed changes, if I "hide" the NVIDIA GPU via CUDA_VISIBLE_DEVICES=:

Available SYCL platforms:
  - Intel(R) OpenCL, driver version OpenCL 2.1 LINUX
    - Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, driver version 2020.9.1.0.18

  - Intel(R) OpenCL HD Graphics, driver version OpenCL 2.1 
    - Intel(R) Gen9 HD Graphics NEO, driver version 20.07.15711

  - SYCL host platform, driver version 1.2
    - SYCL host device, driver version 1.2

which looks like the behaviour I would expect :-)

bader

LGTM with a couple of nits.

sycl/plugins/cuda/pi_cuda.cpp

If CUDA initialisation fails or there are no CUDA devices, do not return a CUDA-based SYCL platform rather than aborting or throwing an exception. Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

…ctor_tests * origin/sycl: (32 commits) [SYCL] Fix circular reference between events and queues (intel#1226) [CI][Doc] Use SSH to deploy GitHub Pages (intel#1232) [SYCL][CUDA][Test] Testing for use of CUDA primary context (intel#1174) [SYCL] allow underscore symbol in temporary directory name [SYCL] Reject zero length arrays (intel#1153) [SYCL] Fix static code analyzis concerns (intel#1189) [SYCL] Add more details about the -fintelfpga option (intel#1218) [SYCL][CUDA] Select only NVPTX64 device binaries (intel#1223) [SYCL] Reverse max work-group size order (intel#1177) [SYCL][Doc] Add GroupAlgorithms extension (intel#1079) [SYCL] Fix SYCL internal enumerators conflict with user defined macro (intel#1188) [SYCL][CUDA] Fixes context release and unnamed context scope (intel#1207) [SYCL][CUDA] Fix context creation property parsing [CUDA][PI] clang-format pi.h [SYCL][CUDA] Handle the case of not having any CUDA device (intel#1212) [SYCL] Fix check-sycl-deploy target problems (intel#1165) [SYCL] Disable tests which take more than 5 minutes (intel#1220) [SYCL] Make context constructors explicit to avoid unintended conversions (intel#1219) [SYCL][NFC] Add clang-format configuration file for SYCL LIT tests (intel#1224) [SYCL] Fix command cleanup invoked from multiple threads (intel#1214) ...

fwyzard mentioned this pull request Feb 28, 2020

[SYCL][CUDA] Avoid failure when no devices available #1210

Closed

Ruyk previously approved these changes Feb 28, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Show resolved Hide resolved

bader assigned romanovvlad Feb 28, 2020

bader added the cuda CUDA back-end label Feb 28, 2020

bjoernknafla reviewed Feb 28, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Show resolved Hide resolved

bjoernknafla reviewed Feb 28, 2020

View reviewed changes

fwyzard dismissed Ruyk’s stale review via f51ec95 February 28, 2020 14:12

fwyzard force-pushed the fix_issue_1198 branch from fe20c0a to f51ec95 Compare February 28, 2020 14:12

bjoernknafla suggested changes Feb 28, 2020

View reviewed changes

fwyzard force-pushed the fix_issue_1198 branch from f51ec95 to 170352b Compare February 28, 2020 15:48

bjoernknafla previously approved these changes Feb 28, 2020

View reviewed changes

bader reviewed Mar 2, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

bader previously approved these changes Mar 2, 2020

View reviewed changes

[SYCL][CUDA] Handle the case of not having any CUDA device

16ba1c5

If CUDA initialisation fails or there are no CUDA devices, do not return a CUDA-based SYCL platform rather than aborting or throwing an exception. Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

fwyzard dismissed stale reviews from bader and bjoernknafla via 16ba1c5 March 2, 2020 11:25

fwyzard force-pushed the fix_issue_1198 branch from 170352b to 16ba1c5 Compare March 2, 2020 11:25

romanovvlad approved these changes Mar 2, 2020

View reviewed changes

bader merged commit 745e759 into intel:sycl Mar 2, 2020

fwyzard deleted the fix_issue_1198 branch March 11, 2020 21:44

	if (result == CUDA_ERROR_NO_DEVICE) {
	if (result != CUDA_SUCCESS) {

[SYCL][CUDA] Handle the case of not having any CUDA device #1212

[SYCL][CUDA] Handle the case of not having any CUDA device #1212

Uh oh!

Conversation

fwyzard commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bjoernknafla Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjoernknafla commented Feb 28, 2020

Uh oh!

bjoernknafla commented Feb 28, 2020

Uh oh!

fwyzard commented Feb 28, 2020

Uh oh!

bjoernknafla commented Feb 28, 2020

Uh oh!

fwyzard commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjoernknafla commented Feb 28, 2020

Uh oh!

bjoernknafla commented Feb 28, 2020

Uh oh!

fwyzard commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fwyzard commented Feb 28, 2020 •

edited

Loading

bjoernknafla Feb 28, 2020 •

edited

Loading

fwyzard commented Feb 28, 2020 •

edited

Loading

fwyzard commented Feb 28, 2020 •

edited

Loading