Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the SYCL backend for the SYCL 2020 standard and USM allocations #1845

Merged
merged 10 commits into from
Jul 27, 2023

Conversation

Parsifal-2045
Copy link
Contributor

@Parsifal-2045 Parsifal-2045 commented Nov 29, 2022

Rewrite the SYCL backend to support the SYCL 2020 standard, using USM allocations instead of SYCL buffers and accessors.

Few highlights:

  • full support for the SYCL 2020 standard and the Unified Shared Memory (USM) model;
  • rewrite the N-dimensional Copy and Set memory operations to support pitched memory buffers, based on the Cpu implementation;
  • generalise the SYCL CpuSelector to non-Intel CPUs;
  • implement support for random number generators based on the Intel oneAPI libraries;
  • support compile-time sub-group sizes in SYCL kernels (see below);
  • update the documentation, tests and examples.

Kernel trait for compile-time sub-group size

Introduce a new optional trait to describe at compile time the warp size that a kernel should use. The default behaviour is to let the back-end compiler pick the preferred size.

Before launching a kernel with a compile-time sub-group size the user should query the sizes supported by the device, and choose accordingly. If the device does not support the requested size, the SYCL runtime will throw a synchronous exception.

During just-in-time (JIT) compilation this guarantees that a kernel is compiled only for the sizes supported by the device. During ahead-of-time (AOT) compilation this is not enough, because the device is not known at compile time. The SYCL specification mandates that the back-end compilers should not fail if a kernel uses unsupported features, like unsupported sub-group sizes. Unfortunately the Intel OpenCL CPU and GPU compilers currently fail with a hard error. To work around this limitation, use the preprocessor macros defined when compiling AOT for the new SYCL targets to enable the compilation only for the sub-group sizes supported by each device.

Note: while the CPU OpenCL back-end does support a sub-group size of 64, the SYCL code currently does not. To avoid issues with the sub-group primitives always consider the sub-group size of 64 as not supported by the device.

Copy link
Member

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing but the changeset but it is cluttered with too much namespace removals to stay focused. Can we keep the removal of the experimental namespace out of this PR? We can either remove the namespace after your work is integrated or even schedule it before your PR. @j-stephan and @psychocoderHPC, what would you prefer?

include/alpaka/atomic/AtomicGenericSycl.hpp Outdated Show resolved Hide resolved
include/alpaka/dev/DevGenericSycl.hpp Outdated Show resolved Hide resolved
include/alpaka/dev/DevGenericSycl.hpp Outdated Show resolved Hide resolved
Comment on lines 138 to 139
auto get_device() const -> sycl::device
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call those maybe get_native_device? We tend to us the word native when we expose platform specifics. See also getNativeHandle, which btw already offers this functionality. Should we have your function at all then?

Copy link
Contributor Author

@Parsifal-2045 Parsifal-2045 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two functions get_device and get_context were implemented as commodities for other pieces of code more than anything. In particular, sometimes we only need one of the two (for example, to allocate pinned memory on host we just need the device's context) and I find it clearer to ask for either the device or the context with a specific function instead of relying on the pair. I might reimplement them using getNativeHandle and maybe rename them, following the same convention, as getNativeDevice and getNativeContext. Otherwise, removing them entirely shouldn't be much of an issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these should be in camelCase. However, I'm not sure I'm a fan. Are there use cases where you would need a sycl::device without its sycl::context? Even in this PR they are used together. This is why we decided to return a std::pair<sycl::device, sycl::context> in getNativeHandle().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few cases where we need the context but not the device: the allocation of pinned / mapped host memory.
I'll look into all the debug prints (also, in the sycl::free method we only need the context)

return m_impl->get_device();
}

auto get_context() const -> sycl::context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment.

include/alpaka/event/EventGenericSycl.hpp Outdated Show resolved Hide resolved
@fwyzard
Copy link
Contributor

fwyzard commented Nov 29, 2022

In particular, to allocate pinned memory on the host, SYCL needs the context of the specific device. For this reason the BufAllocMapped is templated on the device (which contains the context) and not on the platform like the other implementations. Two solutions can be proposed:

  • Keep the SYCL implementation as is and template all the other specialised traits on the device (while internally still only using its platform)
  • Move the sycl::context from the Alpaka device to the Alpaka platform, thus making the platform an object and, at the same time, allowing SYCL to template onto it like all the other implementations do.

The difference is a bit more nuanced.

Currently, allocMappedBuf takes the platform as a template argument:

auto buf = alpaka::allocMappedBuf<PltfAcc, Val, Idx>(host, extent);

The approach used in this PR is to pass accelerator device along, and get the sycl::context from there:

auto buf = alpaka::allocMappedBuf<Val, Idx>(host, device, extent);

This works, but IMHO is not the preferred solution because the mapped memory buffer is associated to all the devices in the same SYCL context, not only on the given one.

One of the options discussed this morning is to have the SYCL context as a data member of the PltfGenericSycl type, and change the Platform types to actual concrete types that one has to instantiate and pass around (explicitly, or implicitly through their devices).
This does require a change in interface:

auto buf = alpaka::allocMappedBuf<Val, Idx>(host, platform, extent);

The other option mentioned this morning (possibly what @j-stephan is looking into) is to have the SYCL context as a static data member of the PltfGenericSycl type.
This would indeed let us keep the current interface:

auto buf = alpaka::allocMappedBuf<PltfAcc, Val, Idx>(host, extent);

Personally, I'm neither in favour or against the approach based on static variables, if used correctly.

The first requirement that comes to mind is some form of lazy initialisation: the SYCL context should be initialised only if and when a SYCL device on the given platform is used, not as global object construction time. This is highly desirable because some SYCL platforms take a long time to initialise (e.g. SYCL's CUDA backend with more than one NVIDIA GPU present). I think it is also needed to support debugging (I've seen cuda-gdb fail if CUDA was initialised before the call to main()).

The other obvious requirement is thread safety: sharing the same SYCL context across threads should be safe both at construction time and during the rest of the program execution. This should be easy assuming the underlying sycl::context object is itself thread safe.

The last point is whether it should be possible to explicitly destroy the Alpaka platform and the associated SYCL context, or if it should only be destroyed automatically (e.g. by ref counting), or never (only implicitly at the end of the process, which might have consequences on debugging and profiling).

By the way, if the platforms gain a state (static or not), we could (re)consider adding also lazy refs to the devices, so that alpaka::getDevByIdx<Pltf>(i) may return the same device object instead of a new instance every time...

@Parsifal-2045
Copy link
Contributor Author

Can we keep the removal of the experimental namespace out of this PR? We can either remove the namespace after your work is integrated or even schedule it before your PR.

That removal was mostly motivated by the testing ease that we gained without it. Adding it back in, although certainly possible, would require a bit more work to reimplement all of the latest changes and would also make testing way more difficult on our side, since the generic alpaka interface wouldn't work anymore.

Copy link
Member

@j-stephan j-stephan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks like a promising effort. A few comments:

include/alpaka/acc/AccGenericSycl.hpp Outdated Show resolved Hide resolved
include/alpaka/kernel/TaskKernelGenericSycl.hpp Outdated Show resolved Hide resolved
Comment on lines 86 to 91
struct Accessor<
detail::SyclAccessor<TElem, DimInt<TDim>::value, TAccessModes>,
TElem,
TIdx,
TDim,
TAccessModes>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did clang-format do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it might be due to some other changes, it will probably be reverted

Comment on lines 138 to 139
auto get_device() const -> sycl::device
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these should be in camelCase. However, I'm not sure I'm a fan. Are there use cases where you would need a sycl::device without its sycl::context? Even in this PR they are used together. This is why we decided to return a std::pair<sycl::device, sycl::context> in getNativeHandle().

Comment on lines 165 to 162
# if ALPAKA_DEBUG >= ALPAKA_DEBUG_FULL
auto const widthBytes = width * static_cast<TIdx>(sizeof(TElem));
std::cout << __func__ << " ew: " << width << " ewb: " << widthBytes << '\n';
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be entirely removed as we require that info for debugging purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version of this method we don't have different cases to deal with the different dimensions, but a way to keep these debug prints could be this, lines 171 to 199. Does it work for you?

Comment on lines 193 to 191
# if ALPAKA_DEBUG >= ALPAKA_DEBUG_FULL
auto const widthBytes = width * static_cast<TIdx>(sizeof(TElem));
std::cout << __func__ << " ew: " << width << " eh: " << height << " ed: " << depth
<< " ewb: " << widthBytes << " pitch: " << widthBytes << '\n';
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

include/alpaka/mem/buf/BufGenericSycl.hpp Outdated Show resolved Hide resolved
Comment on lines 117 to 132
if(static_cast<std::size_t>(this->m_extent.prod()) != 0u)
{
meta::ndLoopIncIdx(
extentWithoutInnermost,
[&](Vec<DimMin1, ExtentSize> const& idx)
{
queue.getNativeHandle().memcpy(
reinterpret_cast<void*>(
this->m_dstMemNative
+ (castVec<DstSize>(idx) * dstPitchBytesWithoutOutmost)
.foldrAll(std::plus<DstSize>())),
reinterpret_cast<void const*>(
this->m_srcMemNative
+ (castVec<SrcSize>(idx) * srcPitchBytesWithoutOutmost)
.foldrAll(std::plus<SrcSize>())),
static_cast<std::size_t>(this->m_extentWidthBytes));
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would launch many memcpy operations if we are dealing with 2D or 3D buffers. I assume you are doing this in order to deal with offsets / slices in one of the source / destination views. Wouldn't it be more effective to write a specialized copy kernel for the 2D / 3D case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Alpaka support the cases where

  • the source and destination buffers have different pitch
  • only a subset of the source and destination buffers are copied

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the CUDA implementation seems to support both points, so I'd argue for SYCL to do the same. CC @psychocoderHPC, please correct me if I'm wrong.

Comment on lines 253 to 258
//! The SYCL non-blocking device queue scalar copy enqueue trait specialization.
template<typename TPltf, typename TExtent, typename TViewSrc, typename TViewDst>
struct Enqueue<
alpaka::QueueGenericSyclNonBlocking<TPltf>,
alpaka::detail::TaskCopySycl<DimInt<0u>, TViewDst, TViewSrc, TExtent>>
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding specializations for the Enqueue trait? Shouldn't the existing design (using objects that can be used as SYCL command groups) also fit for this use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the existing design without the specialization would work because we are using the memcpy method of the sycl::queue, not of the sycl::handler as it was before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And is there a reason for not calling cgh.memcpy and using the queue version instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we were using the queue version in our application. We've just tried with cgh.memcpy and it seems to work, we'll do some more tests tomorrow and then change it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, moving from the sycl::queue's methods to the sycl::handler's methods we lose the ability to Copy/Set in N dimensions as we are doing right now. We would need a specialised method for that since one handler task can call a single memory operation / kernel

Comment on lines 107 to 117
meta::ndLoopIncIdx(
extentWithoutInnermost,
[&](Vec<DimMin1, ExtentSize> const& idx)
{
queue.getNativeHandle().memset(
reinterpret_cast<void*>(
this->m_dstMemNative
+ (castVec<DstSize>(idx) * dstPitchBytesWithoutOutmost)
.foldrAll(std::plus<DstSize>())),
this->m_byte,
static_cast<std::size_t>(this->m_extentWidthBytes));
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same questions as for the copy operations also apply to this file.

@j-stephan
Copy link
Member

Some other points:

  1. The namespace changes should be part of a separate PR. They add a lot of noise for an otherwise (relatively) small change (in terms of LOC).
  2. The AMD/Xilinx types need special handling because the AMD/Xilinx implementation doesn't support USM pointers yet.

The other option mentioned this morning (possibly what @j-stephan is looking into) is to have the SYCL context as a static data member of the PltfGenericSycl type.

This is indeed what I'm currently investigating. The challenge here is that we would need to have the context not as part of PltfGenericSycl but as part of the various hardware-specific platforms. Otherwise it wouldn't be possible to mix platforms from multiple vendors in the same alpaka program.

However, this requires a refactoring of the existing (alpaka) SYCL platform design. In the current design, having separate contexts in PltfCpuSyclIntel and PltfGpuSyclIntel would mean that we cannot share memory objects between them which wouldn't be very useful.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 2, 2022

However, this requires a refactoring of the existing (alpaka) SYCL platform design. In the current design, having separate contexts in PltfCpuSyclIntel and PltfGpuSyclIntel would mean that we cannot share memory objects between them which wouldn't be very useful.

You cannot share memory objects between the Intel OpenCL CPU and GPU platforms/devices.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 2, 2022

  1. The namespace changes should be part of a separate PR. They add a lot of noise for an otherwise (relatively) small change (in terms of LOC).

What I understood from @Parsifal-2045 is that the experimental namespace approach makes it impossible to use the SYCL backend in the usual tests.

The removal of the experimental namespace can be delayed to a separate PR only if there is a way to still use the tests with it.

  • The AMD/Xilinx types need special handling because the AMD/Xilinx implementation doesn't support USM pointers yet.

From our point of view, it means that those backends are simply unusable.
Well, from the point of view of the Alpaka tests, as well.

@j-stephan
Copy link
Member

You cannot share memory objects between the Intel OpenCL CPU and GPU platforms/devices.

Wow, really? I actually never tested that because I assumed it would just work, everything being oneAPI and all. Is that documented somewhere? A quick Google search didn't bring up anything useful.

From our point of view, it means that those backends are simply unusable.

I tend to agree and I'm not a big fan of having separate code paths for them. However, our FPGA setup at HZDR is Xilinx-based so we need to keep it.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 2, 2022

You cannot share memory objects between the Intel OpenCL CPU and GPU platforms/devices.

Wow, really? I actually never tested that because I assumed it would just work, everything being oneAPI and all. Is that documented somewhere? A quick Google search didn't bring up anything useful.

SYCL memory objects are associated to the SYCL context.
The OpenCL CPU and GPU devices are in different contexts, so they cannot share memory objects.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 2, 2022

From our point of view, it means that those backends are simply unusable.

I tend to agree and I'm not a big fan of having separate code paths for them. However, our FPGA setup at HZDR is Xilinx-based so we need to keep it.

Maybe we can introduce an intermediate layer:

  • generic SYCL device/platform/etc.
    • USM-based SYCL device/platform/etc.
      • Intel CPUs
      • Intel GPUs
      • Intel/Altera FPGAs
      • CUDA GPUs
      • ROCm GPUs
    • accessor-based SYCL device/platform/etc.
      • Xilinx/AMD FPGAs
      • possibly the others as well

and keep as much as possible of the code common ?

@j-stephan
Copy link
Member

j-stephan commented Dec 2, 2022

Maybe we can introduce an intermediate layer [...] and keep as much as possible of the code common ?

Yes, that is what I'm trying to attempt in my version of this PR. So far I basically have buffer / memcpy / memset specializations for Xilinx and everything else is shared.

@Parsifal-2045
Copy link
Contributor Author

Parsifal-2045 commented Dec 19, 2022

The latest push implements static members inside the Alpaka platform for SYCL as suggested in #1865. This allows to use allocMappedBuf in the same way as it is used by other backends (two arguments: host and extent, while being templated on Platform, Element type, TDim, TIdx). A previous push has also implemented N-dimensional copy and set tasks for SYCL

@fwyzard
Copy link
Contributor

fwyzard commented Jan 17, 2023

@Parsifal-2045 now that #1865 has been merged, would you have time to rebase and update this PR ?

@Parsifal-2045
Copy link
Contributor Author

I can take a look in the next few days

@fwyzard
Copy link
Contributor

fwyzard commented Jan 19, 2023

@Parsifal-2045 @j-stephan I see that with these changes the tests fail pretty soon.
What could be a viable strategy to eventually merge it ?

For example, would it help if I prepare a separate PR with only the removal of the alpaka::experimental namespace for the SYCL classes ?

@j-stephan
Copy link
Member

Hi @fwyzard, I'm on vacation since last Wednesday. I'll be back on 06 February and look into this.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 24, 2023

Hi @fwyzard, I'm on vacation since last Wednesday.

I heard this morning - congratulations :-)

@Parsifal-2045
Copy link
Contributor Author

I have opened a supporting PR (#1910) with just the removal of the experimental namespace, if that makes it easier to review

@bernhardmgruber
Copy link
Member

@Parsifal-2045 We merged #1910, so this PR can be rebased. I would strongly recommend to squash all changes into one commit before rebasing. If you need help, I can do that for you.

@bernhardmgruber
Copy link
Member

@bernhardmgruber I think the platform changes broke the SYCL backend :-(

I am sorry to hear that. For the sake of getting this PR done, it's fine for me if you merge this PR with a workaround and add a FIXME comment in the code + a github issue to finish the transformation back to platform objects for the SYCL backend.

@SimeonEhrig
Copy link
Member

@SimeonEhrig I have finally been able to build most of the tests with this PR on my laptop with Ubuntu 22.04 and oneAPI 2023.2.0:

cmake \
  -DCMAKE_CXX_COMPILER=/opt/intel/oneapi/compiler/latest/linux/bin/icpx \
  -DoneDPL_ROOT=/opt/intel/oneapi/dpl/latest \
  -DoneDPL_DIR=/opt/intel/oneapi/dpl/latest/lib/cmake/oneDPL \
  -DMKL_ROOT=/opt/intel/oneapi/mkl/latest \
  -DMKL_DIR=/opt/intel/oneapi/mkl/latest/lib/cmake/mkl \
  -DTBB_ROOT=/opt/intel/oneapi/tbb/latest \
  -DTBB_DIR=/opt/intel/oneapi/tbb/latest/lib/cmake/tbb \
  -DBUILD_TESTING=ON \
  -Dalpaka_ACC_SYCL_ENABLE=ON \
  -Dalpaka_SYCL_PLATFORM_ONEAPI=ON \
  -Dalpaka_SYCL_ONEAPI_CPU=ON \
  ../alpaka/

make -j4

make test

The explicit _ROOT and _DIR are needed to convince CMake to use the version of the libraries from oneAPI instead of the system ones.

Thank you. I will test it, when I starting developing the sycl CI.

@j-stephan
Copy link
Member

@fwyzard: Do you still need to set those paths when you execute source /opt/intel/oneapi/setvars.sh before?

@fwyzard
Copy link
Contributor

fwyzard commented Jul 26, 2023

@fwyzard: Do you still need to set those paths when you execute source /opt/intel/oneapi/setvars.sh before?

At least some of them yes, otherwise CMake may pick some of the libraries from /usr/lib, and fail at link time.

Parsifal-2045 and others added 10 commits July 26, 2023 19:17
…ns (part 1)

Initial work to support the SYCL 2020 standard, using USM allocations instead of
SYCL buffers and accessors:
  - bring the SYCL interface in line with the other backends, and remove the last
    uses of the alpaka::experimental namespace;
  - reimplement the alpaka memory buffers, memset and memcpy tasks for the USM
    SYCL backend;
  - make the SYCL native handles more consistent with the other backends;
  - use the oneAPI printf extension, and implement a workaround for the OpenCL
    limitation on variadic functions and the conflict with AMD HIP/ROCm device code;
  - add more debug print messages;
  - various fixes for kernel names, memory_scope Grid and atomics;
  - update copyright information.

Initial work on the SYCL random number generators (not fully working yet).
…ns (part 2)

More changes to the SYCL backend:
  - move printf to alpaka/core and use it in ALPAKA_CHECK;
  - remove IsView -> false in mem/buf/sycl/Accessor;
  - remove wrong attribute in mem/buf/sycl/Copy;
  - remove the SYCL experimental BuildAccessor<BufGenericSycl>, use the default
    implementation from alpaka/mem/view.

Fix the examples to work with the SYCL backend:
  - fix the accelerator in the vectorAdd example;
  - move AccCpuSerial at the end in the ExampleDefaultAcc, as it was preventing
    the SYCL accelerators from being selected.

Complete the work on the SYCL random number generators.
…ns (part 3)

Update the documentation.

Implement various fixes to the SYCL math functions:
  - add missing "if constexpr" to rsqrt();
  - do not call math function with mixed arguments; this fixes errors due to
    the implicit conversion between floating point types of different sizes
    in sycl::atan2() and sycl::pow();
  - add explicit type casts to silence warnings;
  - cast the result of isfinite/isinf/isnan to bool.

Implement various fixes to the SYCL atomic functions:
  - fix the cas/compare_exchange loops;
  - clarify which atomic types are supported.

Implement various fixes to the SYCL warp-level functions:
  - fix compilation warnings;
  - extract bits from sub_group_mask.

Mark the use of global device variables and constants as undupported: the SYCL
backend does not support global device variables and constants, yet.

Add explicit checks on the dimensionality of the SYCL accelerator and work division.

Silence warnings about the use of GNU extensions, and those coming from the
Intel oneMKL and oneDPL headers.

Update more tests for the SYCL backend:
  - add a special case for 0-dimensional tests;
  - disable the use of STL rand;
  - disable the test of global device variables and constants.
…ns (part 4)

Update the documentation related to FPGAs.

Various fixes and updates to the SYCL backend and tests, the copyright
information and code formatting.
Rewrite the N-dimensional Copy and Set memory operations to support pitched
memory buffers, based on the Cpu implementation. This may require more than one
memset or memcpy call per operation, which is not supported by command group
handlers. Rewrite the Copy and Set memory operations to use queues instead.
Introduce a new optional trait to describe at compile time the warp size that a
kernel should use.  The default behaviour is to let the back-end compiler pick
the preferred size.

Before launching a kernel with a compile-time sub-group size the user should
query the sizes supported by the device, and choose accordingly.  If the device
does not support the requested size, the SYCL runtime will throw a synchronous
exception.

During just-in-time (JIT) compilation this guarantees that a kernel is compiled
only for the sizes supported by the device.  During ahead-of-time (AOT)
compilation this is not enough, because the device is not known at compile
time.  The SYCL specification mandates that the back-end compilers should not
fail if a kernel uses unsupported features, like unsupported sub-group sizes.
Unfortunately the Intel OpenCL CPU and GPU compilers currently fail with a hard
error.  To work around this limitation, use the preprocessor macros defined
when compiling AOT for the new SYCL targets to enable the compilation only for
the sub-group sizes supported by each device.

Note: while the CPU OpenCL back-end does support a sub-group size of 64, the
SYCL code currently does not.  To avoid issues with the sub-group primitives
always consider the sub-group size of 64 as not supported by the device.

Other changes:
  - remove the use of SYCL streams in favour of the printf() extension;
  - remove the ALPAKA_FN_HOST attribute;
  - fix the GetSize test for the different sub-group sizes;
  - fix the use of sycl::exceptions;
  - use different member names for nd_item in different classes, to avoid
    ambiguous name lookup error when accessing the nd_item in the accelerator
    object.
  - add the missing specialization of CreateViewPlainPtr for SYCL devices
  - improve the comments on the ALPAKA_FN_INLINE macro
  - remove unnecessary ALPAKA_FN_HOST attributes
  - rename QueueGenericSyclBase::m_impl to m_spQueueImpl, to align with the other back-ends
@fwyzard
Copy link
Contributor

fwyzard commented Jul 26, 2023

@bernhardmgruber it turns out that the changes to the platforms were good.
The problem comes from the delegating constructor of KernelExecutionFixture, that uses the data member m_platformAcc before it has been initialised.

A fix is in #2021.

@bernhardmgruber
Copy link
Member

@bernhardmgruber it turns out that the changes to the platforms were good. The problem comes from the delegating constructor of KernelExecutionFixture, that uses the data member m_platformAcc before it has been initialised.

A fix is in #2021.

You are amazing! Thank you so much :) The proposed PR also LGTM. Great work!

@fwyzard fwyzard requested a review from j-stephan July 27, 2023 06:22
@fwyzard fwyzard dismissed j-stephan’s stale review July 27, 2023 08:27

Agreed to keep the full set of changes in thi PR

@fwyzard fwyzard merged commit b5d541b into alpaka-group:develop Jul 27, 2023
20 checks passed
@fwyzard
Copy link
Contributor

fwyzard commented Jul 27, 2023

🎉

@j-stephan
Copy link
Member

Well, thanks for the good work! This was our largest PR so far (in terms of comments and reviews). Glad to see it accepted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fail to compile SYCL backend Add USM pointers to Intel SYCL back-ends
6 participants