diff --git a/README_SYCL.md b/README_SYCL.md index ccac62293474..3361f5b48681 100644 --- a/README_SYCL.md +++ b/README_SYCL.md @@ -13,14 +13,12 @@ At the moment alpaka's SYCL back-end can only be used together with Intel oneAPI ### General CMake options * `alpaka_ACC_SYCL_ENABLE`: set to `ON` to enable the SYCL back-end. Requires the activation of at least one oneAPI hardware target (see below). -* `alpaka_SYCL_ENABLE_IOSTREAM`: set to `ON` to enable device-side printing. Force-enabled if `BUILD_TESTING` is enabled. -* `alpaka_SYCL_IOSTREAM_KIB`: Kibibytes per block reserved as output buffer for device-side printing. This cannot exceed the amount of shared memory per block. Only takes effect if `alpaka_SYCL_ENABLE_IOSTREAM` is enabled. Default: `64`. * `MKL_ROOT`: always required. Set to the root of your oneMKL installation. Example: `/opt/intel/oneapi/mkl/2023.1.0`. * `oneDPL_DIR`: always required. Set to the CMake path of your oneDPL installation. Example: `/opt/intel/oneapi/dpl/2022.1.0/lib/cmake/oneDPL`. -### Building for Intel CPUs +### Building for CPUs -The following CMake flags can be set for Intel CPUs: +The following CMake flags can be set for CPUs: * `alpaka_SYCL_ONEAPI_CPU`: set to `ON` to enable compilation for Intel CPUs. * `alpaka_SYCL_ONEAPI_CPU_ISA`: the Intel ISA to compile for. Look at the possible `--march` options listed in the output of `opencl-aot --help`. Default: `avx2`. @@ -37,7 +35,8 @@ Note: Intel FPGAs cannot be targeted together with other Intel hardware. This is ### Building for Intel GPUs * `alpaka_SYCL_ONEAPI_GPU`: set to `ON` to enable compilation for Intel GPUs. -* `alpaka_SYCL_ONEAPI_GPU_DEVICES`: semicolon-separated list of one or more Intel GPUs to compile for. Check the output of `ocloc compile --help` and look at the possible values for the `-device` argument for valid values to supply here. Default: `bdw`. +* `alpaka_SYCL_ONEAPI_GPU_DEVICES`: semicolon-separated list of one or more Intel GPUs to compile for. The possible values for the devices are listed in the [UsersManual](https://intel.github.io/llvm-docs/UsersManual.html#generic-options) under the flag `-fsycl-targets`. Default: `intel_gpu_pvc`. + NOTE: currently only one target at a time can be specified (limitation of the Intel Compiler) ## Standalone mode @@ -48,10 +47,7 @@ Using the SYCL back-end always requires the following flags: * `-fsycl` (compiler and linker) * `-fsycl-standard=2020` (compiler) -To enable device-side printing add the following compiler flags: - -* `-DALPAKA_SYCL_IOSTREAM_ENABLED`: to enable device-side printing. -* `-DALPAKA_SYCL_IOSTREAM_KIB=`: `` (without the brackets) defines the kibibytes per block to be reserved for device-side printing. `` cannot exceed the amount of shared memory per block. +Device-side printing is possible with `printf`, it calls `sycl::ext::oneapi::experimental::printf` that emulates the standard one. This is an extension of the SYCL standard, still in an experimental state, therefore may not always work correctly. ### Building for x86 64-bit CPUs @@ -75,8 +71,7 @@ To enable device-side printing add the following compiler flags: 1. `#include ` in your C++ code. 2. Add the following flags: - * `-fsycl-targets=spir64_gen` (compiler and linker): to enable GPU compilation. Note: If you are using multiple SYCL hardware targets (like CPU and GPU) separate them by comma here. - * `-Xsycl-target-backend=spir64_gen "-device "` (linker): to choose the Intel GPU(s) to compile for. Multiple devices can either be separated by comma or by supplying a range of devices. Refer to the output of `ocloc compile --help` and look for the `-device` flag for the possible values. + * `-fsycl-targets=intel_gpu_pvc` (compiler and linker): to enable GPU compilation. Note: If you are using multiple SYCL hardware targets (like CPU and GPU) separate them by comma here. ## Using the SYCL back-end @@ -84,9 +79,9 @@ To enable device-side printing add the following compiler flags: In contrast to the other back-ends the SYCL back-end comes with multiple different accelerators which should be chosen according to your requirements: -* `alpaka::experimental::AccCpuSycl` for targeting Intel and AMD CPUs. In contrast to the other CPU back-ends this will use Intel's OpenCL implementation for CPUs under the hood. -* `alpaka::experimental::AccFpgaSyclIntel` for targeting Intel FPGAs. -* `alpaka::experimental::AccGpuSyclIntel` for targeting Intel GPUs. +* `alpaka::AccCpuSycl` for targeting Intel and AMD CPUs. In contrast to the other CPU back-ends this will use Intel's OpenCL implementation for CPUs under the hood. +* `alpaka::AccFpgaSyclIntel` for targeting Intel FPGAs. +* `alpaka::AccGpuSyclIntel` for targeting Intel GPUs. These can be used interchangeably (some restrictions apply - see below) with the non-experimental alpaka accelerators to compile an existing alpaka code for SYCL-capable hardware. @@ -102,3 +97,14 @@ These can be used interchangeably (some restrictions apply - see below) with the ``` See [Intel's FAQ](https://github.com/intel/compute-runtime/blob/master/opencl/doc/FAQ.md#feature-double-precision-emulation-fp64) for more information. * The FPGA back-end does not support atomics. alpaka will not check this. +* device global variables are not supported in SYCL yet. + +### Choosing the sub-group size (warp size) + +There is a trait to specify at compile time the sub-group size to use for a kernel. The default behaviour, when no sub-group size is specified, is to let the back-end compiler pick the preferred size. + +Before launching a kernel with a compile-time sub-group size the user should query the sizes supported by the device, and choose accordingly. If the device does not support the requested size, the SYCL runtime will throw a synchronous exception. + +During just-in-time (JIT) compilation this guarantees that a kernel is compiled only for the sizes supported by the device. During ahead-of-time (AOT) compilation this is not enough, because the device is not known at compile time. The SYCL specification mandates that the back-end compilers should not fail if a kernel uses unsupported features, like unsupported sub-group sizes. Unfortunately the Intel OpenCL CPU and GPU compilers currently fail with a hard error. To work around this limitation, use the preprocessor macros defined when compiling AOT for the new SYCL targets to enable the compilation only for the sub-group sizes supported by each device. + +NOTE: while the CPU OpenCL back-end does support a sub-group size of 64, the SYCL code currently does not. To avoid issues with the sub-group primitives always consider the sub-group size of 64 as not supported by the device.