Skip to content

[Doc] Add design doc for dynamic linking of device code feature #3210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 31, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 276 additions & 0 deletions sycl/doc/SharedLibraries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Shared DPC++ libraries

This document describes purpose and design of Shared DPC++ libraries feature.

## Background
Sometimes users want to provide *device* functions via shared libraries.
Simple source example:
```
// App:

CGH.parallel_for<app_kernel>(/* ... */ {
library_function();
});


// Shared library:
SYCL_EXTERNAL void library_function() {
// do something
}
```
It is possible to manually create `sycl::program` in both app and shared
library, then use `link` SYCL API to get a single program and launch kernels
using it. But it is not user-friendly and it is very different from regular
C/C++ workflow.

The main purpose of this feature is to provide a mechanism which allows to
provide *device* functions via shared libraries and works as close as possible
to regular shared libraries.

## Requrements:
User's code is compiled into a shared library which consists of some host API,
device code and device API (`SYCL_EXTERNAL` functions). The library is linked to
a user's application which also contains some device code and performs
computations using DPC++/SYCL.
For this combination the following statements must be true:

- `SYCL_EXTERNAL` functions from library can be called (directly or indirectly)
from device code of the application.
- Function pointers taken in application should work inside the library.
- Specific code changes are not required, i.e. the mechanism of linking works
as close as possible to regular shared libraries.

## Design
The overall idea is simple:

- Each device image is supplied with an information about exported and imported
symbols using device image properties
- DPC++ RT performs *device images collection* task by grouping all device
images required to execute a kernel based on the list of exports/imports
- Besides symbol names, additional attributes are taken into account (like
device image format: SPIR-V or device asm)
- Actual linking is performed by underlying backend (OpenCL/L0/etc.)

Next sections describe details of changes in each component.

### DPC++ front-end changes

DPC++ front-end generates `module-id` attribute on each `SYCL_EXTERNAL` function.
It was generated only on kernels earlier. There are two reasons to start
generating this attribute on `SYCL_EXTERNAL` functions:

- Later in pipeline, this attribute will be used by `sycl-post-link` tool to
separate `SYCL_EXTERNAL` functions from non-`SYCL_EXTERNAL` functions with
external linkage.
- `module-id` attribute also contains information about source file where the
function comes from. This information will be used to perform device code
split on device images that contain only exported functions.

### sycl-post-link changes

`sycl-post-link` performs 3 important tasks:
- Arranges `SYCL_EXTERNAL` functions into a separate device image(s)
- Supplies device images containing exports with an information about exported
symbols
- Supplies each device image with an information about imported symbols

`sycl-post-link` outlines `SYCL_EXTERNAL` functions with all their reachable
dependencies (functions with definitions called from `SYCL_EXTERNAL` ones)
into a separate device image(s) in order to create minimal self-contained
device images that can be linked from the user's app. There are several
notable moments though.

If a `SYCL_EXTERNAL` function is used within a kernel defined in a shared
library, it will be duplicated: one instance will be stored in the kernel's
device image and the function won't exported from this device image, while the
other will be stored in a special device image for other `SYCL_EXTERNAL`
functions and will be marked as exported there. Such duplication is need for
two reasons:
- We aim to make device images with kernels self-contained so no JIT linker
invocations would be needed if we have definitions of all called functions.
Also note that if AOT is requested, it would be impossible to link anything
at runtime.
- We could export `SYCL_EXTERNAL` functions from device images with kernels,
but it would mean that when user's app calls `SYCL_EXTERNAL` function, it has
to link a whole kernel and all its dependencies - not only it increases the
amount of unnecessary linked code, but might also lead to build errors if the
kernel uses some features, which are not supported by target device (and they
are not used in the `SYCL_EXTERNAL` function).
Besides separating `SYCL_EXTERNAL` functions from kernels, they can be further
split into separate device images if device code split is requested. This is
done by grouping them using `module-id` attribute. Non-`SYCL_EXTERNAL` functions
used by `SYCL_EXTERNAL` functions with different `module-id` attributes are
copied to device images corresponding to those `SYCL_EXTERNAL` functions
to make them self-contained
In case one `SYCL_EXTERNAL` function uses another `SYCL_EXTERNAL` function
with different `module-id` attribute, the second one is not copied to the
device image with the first function, but dependency between those device images
is recorder instead.

After `SYCL_EXTERNAL` functions are arranged into a separate device image(s),
all non-`SYCL_EXTERNAL` functions are internalized to avoid multiple definition
errors during runtime linking.
Device images with `SYCL_EXTERNAL` functions will also have a list of names
of exported functions.

**NOTE**: If device code split is enabled, it seems reasonable to perform
exports arrangement before device code split procedure.

In orger to collect information about imported symbols `sycl-post-link` looks
through LLVM IR and for each declared but not defined symbol records its name,
except the following cases:
- Declarations with `__` prefix in demangled name are not recorded as imported
functions
- Declarations with `__spirv_*` prefix should not be recorded as dependencies
since they represent SPIR-V operations and will be transformed to SPIR-V
instructions during LLVM->SPIR-V translation.
- Based on some attributes which could be defined later
- This is needed to have possibility to call device-specific builtins not
starting with `__` by forward-declaring them in DPC++ code

**NOTE**: If device code split is enabled, imports collection is performed after
split and it is performed on splitted images.

All collected information is attached to a device image via properties
mechanism.

Each device image is supplied with an array of property sets:
```
struct pi_device_binary_struct {
...
// Array of property sets
pi_device_binary_property_set PropertySetsBegin;
pi_device_binary_property_set PropertySetsEnd;
};

```
Each property set is represent by the following struct:
```
// Named array of properties.
struct _pi_device_binary_property_set_struct {
char *Name; // the name
pi_device_binary_property PropertiesBegin; // array start
pi_device_binary_property PropertiesEnd; // array end
};
```
It contains name of property set and array of properties. Each property is
represented by the following struct:
```
struct _pi_device_binary_property_struct {
char *Name; // null-terminated property name
void *ValAddr; // address of property value
uint32_t Type; // _pi_property_type
uint64_t ValSize; // size of property value in bytes
};
```

List of imported symbols is represented as a single property set with name
`ImportedSymbols` recorded in the `Name` field of property set.
Each property in this set holds name of the particular imported symbol recorded
in the `Name` field of the property.
List of exported symbols is represented in the same way, except the
corresponding set has the name `ExportedSymbols`.

### DPC++ runtime changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it very difficult to understand the big picture for the DPC++runtime changes. It would be nice to have an overview up front. Maybe something like this:

The overall strategy is as follows. When the application submits an offload kernel to execution on a device, the runtime uses the SYCL/imported symbols attribute from the kernel's device image to find the set of symbols it imports. The runtime then searches for device images that export these same symbols by searching their SYCL/exported symbols attributes. This symbol resolution algorithm results in a list of device images that need to be linked together.

The design currently supports only the case when all these device images are in SPIR-V format. The runtime passes the list of device images to the PI API, which compiles and links the SPIR-V modules together into a single native device image. This native device image is then added to the cache to avoid symbol resolution, compilation, and linking for any future attempts to invoke kernels from this device image.

Note that the PI API does not use the SYCL/imported symbols or SYCL/exported symbols attributes to perform the linking of the SPIR-V modules. Instead, functions in the SPIR-V modules must be annotated with the SPIR-V Import and Export linkage types. The PI API layer uses these annotations to perform that actual linking of function references across SPIR-V modules.

Note that I'm not really sure this is how the design will work, but I think this is what you intend. Can you confirm if my overall understanding of the runtime design is accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your understanding is correct, except that the design doesn't actually depend on device images format. I.e. the algorithm of searching won't be changed if device images are pre-compiled native device binaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the symbol-search part of the algorithm can probably work for device images in native code format. However, I think the online linking part will only work for device images in SPIR-V format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to stress that I think it's important to add some description similar to what I suggest to the introduction to the "DPC++ runtime changes" section. I think this will clarify the following points which are currently unclear:

  • The SYCL/imported symbols and SYCL/exported symbols attributes are used only by the runtime's symbol resolution algorithm. The PI API layer relies on the SPIR-V Import and Export linkage annotations, so these must also be present.

  • The linking part of the design works only for device images in SPIR-V format.

  • The only things added to the cache are fully linked device images. We do not cache device images with SYCL_EXTERNAL functions that are compiled and not yet linked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linking part of the design works only for device images in SPIR-V format.

I don't agree with you here. Yes, right now we don't have backends that support linking of native device binaries. But we don't say that it is not possible in the future. I think format of device image shouldn't affect design of runtime changes. Most likely when linking of native device binaries is supported, it won't matter for runtime which format device image has, it will just call some PI API for link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaslov-intel, my understanding is that in OpenCL terminology, compiled != built. compiled is an object form, i.e. front-end is passed, but no executable binary was produced yet. I can't find strong proof of that in the spec, but I'm sure that our (Intel) CPU & GPU compilers won't accept native binaries in clLinkProgram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clCompileProgram.html

The compiled binary can be queried using clGetProgramInfo(program, CL_PROGRAM_BINARIES, ...) and can be specified to clCreateProgramWithBinary to create a new program object.

Compile does produce "binary", otherwise the above wouldn't be possible.
Who can clarify it exactly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile does produce "binary", otherwise the above wouldn't be possible.
Who can clarify it exactly?

I think @bashbaug should be able to clarify this.

In the meantime:

From clCreateKernel:

Parameters
program
A program object with a successfully built executable.

From clBuildProgram:

Builds (compiles and links) a program executable from the program source or binary.

From clLinkProgram:

Links a set of compiled program objects and libraries for all the devices or a specific device(s) in the OpenCL context and creates an executable.

From clCompileProgram:

Compiles a program’s source for all the devices or a specific device(s) in the OpenCL context associated with program.

It seems to me that intent was that clCompileProgram does not generate an executable, but only creates some intermediate object file instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
From clGetProgramInfo(https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetProgramInfo.html) desc:

Return the program binaries for all devices associated with program. For each device in program, the binary returned can be the binary specified for the device when program is created with clCreateProgramWithBinary or it can be the executable binary generated by clBuildProgram. If program is created with clCreateProgramWithSource, the binary returned is the binary generated by clBuildProgram. The bits returned can be an implementation-specific intermediate representation (a.k.a. IR) or device specific executable bits or both. The decision on which information is returned in the binary is up to the OpenCL implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision on which information is returned in the binary is up to the OpenCL implementation.

Can we somehow query OpenCL about what is in the binary such that we could properly control it's linking?


DPC++ RT performs *device images collection* task by grouping all device
images required to execute a kernel based on the list of exports/imports and
links them together using PI API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we add a sub-section about this PI API below: this API is important, as this is a part of interface between plugins and DPC++ RT and we don't even have its name listed here.

The sub-section could describe the API, its behavior (whether we assume that it is capable to link native binaries regardless of device or that it is some optional capability which should be checked before usage), possible implementation/limitations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please see lines 236-259.


Given that all exports will be arranged to a separate device images without
kernels it is reasonable to store device images with exports in a separate data
structure.

## Corner cases and limitations

It is not guaranteed that behaviour of host shared libraries and device shared
libraries will always match. There are several cases when it can occur, the
next sections will cover details of such cases.

### ODR violations

C++ standard defines One Definition Rule as:
> Every program shall contain exactly one definition of every non-inline
function or variable that is odr-used in that program outside of a discarded
statement; no diagnostic required.
The definition can appear explicitly in the program, it can be found in the
standard or a user-defined library, or (when appropriate) it is implicitly
defined.


Here is an example:

![ODR violation](images/ODR-shared-libraries.svg)

Both libraries libB and libC provide two different definitions of function
`b()`, so this example illustrates ODR violation. Technically this case has
undefined behaviour, however it is possible to run and compile this example on
Linux and Windows. Whereas on Linux only function `b()` from library libB is
called, on Windows both versions of function `b()` are used.
Most of backends online linkers act like static linkers, i.e. just merge
device images with each other, so it is not possible to correctly imitate
Windows behaviour in device code linking because attempts to do it will result
in multiple definition errors.

Given that, it is not guaranteed that behaviour of shared host libraries and
shared device libraries will always match in case of such ODR violations.

#### LD_PRELOAD

Another way to violate ODR is `LD_PRELOAD` environment variable on Linux. It
allows to load specified shared library before any other shared libraries so it
will be searched for symbols before other shared libraries. It allows to
substitute functions from regular shared libraries by functions from preloaded
library.
Device code registration is implemented using global constructors. Order of
global constructors calling is not defined across different translation units,
so with current design of device shared libraries and device code registration
mechanism it is not possible to understand which device code comes from
preloaded library and which comes from regular shared libraries.

Here is an example:

![LD_PRELOAD](images/LD-preload-shared-libraries.svg)

"libPreload" library is preloaded using `LD_PRELOAD` environment variable.
In this example, device code from "libPreload" might be registered after
device code from "libA".

To implement basic support, for each device image we can record name of the
library where this device image comes from and parse content of `LD_PRELOAD`
environment variable to choose the proper images. However such implementation
will only allow to substitute a whole device image and not an arbitrary
function (unless it is the only function in a device image), because partial
substitution will cause multiple definition errors during runtime linking.

### Run-time libraries loading

It is possible to load shared library during run-time. Both Linux and Windows
provide a way to do so (for example `dlopen()` on Linux or `LoadLibrary` on
Windows).
In case run-time loading is used to load some shared library, the symbols from
this shared library do not appear in the namespace of the main program. It means
that even though shared library is loaded successfully in run-time, it is not
possible to use symbols from it directly. The symbols from run-time loaded
library can be accessed by address which can be obtained using corresponding
OS-dependent API (for example `dlsym()` on Linux).

The problem here is that even though symbols from run-time loaded shared
library are not part of application's namespace, the library is loaded through
standard mechanism, i.e. its global constructors are invoked which means that
device code from it is registered, so it is not possible to
understand whether device code comes from run-time loaded library or not.
If such run-time loaded library exports device symbols and they
somehow match with symbols that actually directly used in device code
somewhere, it is possible that symbols from run-time loaded library
will be unexpectedly used.

To resolve this problem we need to ensure that device code registered from
run-time loaded library appears at the end of symbols search list, however
having that device code registration is triggered by global constructors, it
doesn't seem possible.

One more possible mitigation would be to record name of the library from which
each symbol should be imported, but it still won't resolve all potential
issues with run-time library loading, because user can load the library with the
same name as one of the explicitly linked libraries.
1 change: 1 addition & 0 deletions sycl/doc/images/LD-preload-shared-libraries.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions sycl/doc/images/ODR-shared-libraries.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.