-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the SYCL backend for the SYCL 2020 standard and USM allocations #1845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing but the changeset but it is cluttered with too much namespace removals to stay focused. Can we keep the removal of the experimental
namespace out of this PR? We can either remove the namespace after your work is integrated or even schedule it before your PR. @j-stephan and @psychocoderHPC, what would you prefer?
auto get_device() const -> sycl::device | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call those maybe get_native_device
? We tend to us the word native
when we expose platform specifics. See also getNativeHandle
, which btw already offers this functionality. Should we have your function at all then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two functions get_device
and get_context
were implemented as commodities for other pieces of code more than anything. In particular, sometimes we only need one of the two (for example, to allocate pinned memory on host we just need the device's context) and I find it clearer to ask for either the device or the context with a specific function instead of relying on the pair. I might reimplement them using getNativeHandle
and maybe rename them, following the same convention, as getNativeDevice
and getNativeContext
. Otherwise, removing them entirely shouldn't be much of an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these should be in camelCase. However, I'm not sure I'm a fan. Are there use cases where you would need a sycl::device
without its sycl::context
? Even in this PR they are used together. This is why we decided to return a std::pair<sycl::device, sycl::context>
in getNativeHandle()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few cases where we need the context but not the device: the allocation of pinned / mapped host memory.
I'll look into all the debug prints (also, in the sycl::free
method we only need the context)
return m_impl->get_device(); | ||
} | ||
|
||
auto get_context() const -> sycl::context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment.
The difference is a bit more nuanced. Currently, auto buf = alpaka::allocMappedBuf<PltfAcc, Val, Idx>(host, extent); The approach used in this PR is to pass accelerator device along, and get the auto buf = alpaka::allocMappedBuf<Val, Idx>(host, device, extent); This works, but IMHO is not the preferred solution because the mapped memory buffer is associated to all the devices in the same SYCL context, not only on the given one. One of the options discussed this morning is to have the SYCL context as a data member of the auto buf = alpaka::allocMappedBuf<Val, Idx>(host, platform, extent); The other option mentioned this morning (possibly what @j-stephan is looking into) is to have the SYCL context as a auto buf = alpaka::allocMappedBuf<PltfAcc, Val, Idx>(host, extent); Personally, I'm neither in favour or against the approach based on The first requirement that comes to mind is some form of lazy initialisation: the SYCL context should be initialised only if and when a SYCL device on the given platform is used, not as global object construction time. This is highly desirable because some SYCL platforms take a long time to initialise (e.g. SYCL's CUDA backend with more than one NVIDIA GPU present). I think it is also needed to support debugging (I've seen cuda-gdb fail if CUDA was initialised before the call to The other obvious requirement is thread safety: sharing the same SYCL context across threads should be safe both at construction time and during the rest of the program execution. This should be easy assuming the underlying The last point is whether it should be possible to explicitly destroy the Alpaka platform and the associated SYCL context, or if it should only be destroyed automatically (e.g. by ref counting), or never (only implicitly at the end of the process, which might have consequences on debugging and profiling). By the way, if the platforms gain a state ( |
That removal was mostly motivated by the testing ease that we gained without it. Adding it back in, although certainly possible, would require a bit more work to reimplement all of the latest changes and would also make testing way more difficult on our side, since the generic alpaka interface wouldn't work anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks like a promising effort. A few comments:
struct Accessor< | ||
detail::SyclAccessor<TElem, DimInt<TDim>::value, TAccessModes>, | ||
TElem, | ||
TIdx, | ||
TDim, | ||
TAccessModes> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did clang-format do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it might be due to some other changes, it will probably be reverted
auto get_device() const -> sycl::device | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these should be in camelCase. However, I'm not sure I'm a fan. Are there use cases where you would need a sycl::device
without its sycl::context
? Even in this PR they are used together. This is why we decided to return a std::pair<sycl::device, sycl::context>
in getNativeHandle()
.
# if ALPAKA_DEBUG >= ALPAKA_DEBUG_FULL | ||
auto const widthBytes = width * static_cast<TIdx>(sizeof(TElem)); | ||
std::cout << __func__ << " ew: " << width << " ewb: " << widthBytes << '\n'; | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These shouldn't be entirely removed as we require that info for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new version of this method we don't have different cases to deal with the different dimensions, but a way to keep these debug prints could be this, lines 171 to 199. Does it work for you?
# if ALPAKA_DEBUG >= ALPAKA_DEBUG_FULL | ||
auto const widthBytes = width * static_cast<TIdx>(sizeof(TElem)); | ||
std::cout << __func__ << " ew: " << width << " eh: " << height << " ed: " << depth | ||
<< " ewb: " << widthBytes << " pitch: " << widthBytes << '\n'; | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
include/alpaka/mem/buf/sycl/Copy.hpp
Outdated
if(static_cast<std::size_t>(this->m_extent.prod()) != 0u) | ||
{ | ||
meta::ndLoopIncIdx( | ||
extentWithoutInnermost, | ||
[&](Vec<DimMin1, ExtentSize> const& idx) | ||
{ | ||
queue.getNativeHandle().memcpy( | ||
reinterpret_cast<void*>( | ||
this->m_dstMemNative | ||
+ (castVec<DstSize>(idx) * dstPitchBytesWithoutOutmost) | ||
.foldrAll(std::plus<DstSize>())), | ||
reinterpret_cast<void const*>( | ||
this->m_srcMemNative | ||
+ (castVec<SrcSize>(idx) * srcPitchBytesWithoutOutmost) | ||
.foldrAll(std::plus<SrcSize>())), | ||
static_cast<std::size_t>(this->m_extentWidthBytes)); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would launch many memcpy
operations if we are dealing with 2D or 3D buffers. I assume you are doing this in order to deal with offsets / slices in one of the source / destination views. Wouldn't it be more effective to write a specialized copy kernel for the 2D / 3D case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does Alpaka support the cases where
- the source and destination buffers have different pitch
- only a subset of the source and destination buffers are copied
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the CUDA implementation seems to support both points, so I'd argue for SYCL to do the same. CC @psychocoderHPC, please correct me if I'm wrong.
include/alpaka/mem/buf/sycl/Copy.hpp
Outdated
//! The SYCL non-blocking device queue scalar copy enqueue trait specialization. | ||
template<typename TPltf, typename TExtent, typename TViewSrc, typename TViewDst> | ||
struct Enqueue< | ||
alpaka::QueueGenericSyclNonBlocking<TPltf>, | ||
alpaka::detail::TaskCopySycl<DimInt<0u>, TViewDst, TViewSrc, TExtent>> | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you adding specializations for the Enqueue
trait? Shouldn't the existing design (using objects that can be used as SYCL command groups) also fit for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the existing design without the specialization would work because we are using the memcpy
method of the sycl::queue
, not of the sycl::handler
as it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And is there a reason for not calling cgh.memcpy
and using the queue version instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, we were using the queue version in our application. We've just tried with cgh.memcpy
and it seems to work, we'll do some more tests tomorrow and then change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases, moving from the sycl::queue
's methods to the sycl::handler
's methods we lose the ability to Copy/Set in N dimensions as we are doing right now. We would need a specialised method for that since one handler task can call a single memory operation / kernel
include/alpaka/mem/buf/sycl/Set.hpp
Outdated
meta::ndLoopIncIdx( | ||
extentWithoutInnermost, | ||
[&](Vec<DimMin1, ExtentSize> const& idx) | ||
{ | ||
queue.getNativeHandle().memset( | ||
reinterpret_cast<void*>( | ||
this->m_dstMemNative | ||
+ (castVec<DstSize>(idx) * dstPitchBytesWithoutOutmost) | ||
.foldrAll(std::plus<DstSize>())), | ||
this->m_byte, | ||
static_cast<std::size_t>(this->m_extentWidthBytes)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same questions as for the copy operations also apply to this file.
Some other points:
This is indeed what I'm currently investigating. The challenge here is that we would need to have the context not as part of However, this requires a refactoring of the existing (alpaka) SYCL platform design. In the current design, having separate contexts in |
You cannot share memory objects between the Intel OpenCL CPU and GPU platforms/devices. |
What I understood from @Parsifal-2045 is that the The removal of the
From our point of view, it means that those backends are simply unusable. |
Wow, really? I actually never tested that because I assumed it would just work, everything being oneAPI and all. Is that documented somewhere? A quick Google search didn't bring up anything useful.
I tend to agree and I'm not a big fan of having separate code paths for them. However, our FPGA setup at HZDR is Xilinx-based so we need to keep it. |
SYCL memory objects are associated to the SYCL context. |
Maybe we can introduce an intermediate layer:
and keep as much as possible of the code common ? |
Yes, that is what I'm trying to attempt in my version of this PR. So far I basically have buffer / memcpy / memset specializations for Xilinx and everything else is shared. |
The latest push implements static members inside the Alpaka platform for SYCL as suggested in #1865. This allows to use |
@Parsifal-2045 now that #1865 has been merged, would you have time to rebase and update this PR ? |
I can take a look in the next few days |
15eb7aa
to
4894703
Compare
@Parsifal-2045 @j-stephan I see that with these changes the tests fail pretty soon. For example, would it help if I prepare a separate PR with only the removal of the |
Hi @fwyzard, I'm on vacation since last Wednesday. I'll be back on 06 February and look into this. |
I heard this morning - congratulations :-) |
I have opened a supporting PR (#1910) with just the removal of the experimental namespace, if that makes it easier to review |
@Parsifal-2045 We merged #1910, so this PR can be rebased. I would strongly recommend to squash all changes into one commit before rebasing. If you need help, I can do that for you. |
I am sorry to hear that. For the sake of getting this PR done, it's fine for me if you merge this PR with a workaround and add a FIXME comment in the code + a github issue to finish the transformation back to platform objects for the SYCL backend. |
Thank you. I will test it, when I starting developing the sycl CI. |
@fwyzard: Do you still need to set those paths when you execute |
At least some of them yes, otherwise CMake may pick some of the libraries from |
…ns (part 1) Initial work to support the SYCL 2020 standard, using USM allocations instead of SYCL buffers and accessors: - bring the SYCL interface in line with the other backends, and remove the last uses of the alpaka::experimental namespace; - reimplement the alpaka memory buffers, memset and memcpy tasks for the USM SYCL backend; - make the SYCL native handles more consistent with the other backends; - use the oneAPI printf extension, and implement a workaround for the OpenCL limitation on variadic functions and the conflict with AMD HIP/ROCm device code; - add more debug print messages; - various fixes for kernel names, memory_scope Grid and atomics; - update copyright information. Initial work on the SYCL random number generators (not fully working yet).
…ns (part 2) More changes to the SYCL backend: - move printf to alpaka/core and use it in ALPAKA_CHECK; - remove IsView -> false in mem/buf/sycl/Accessor; - remove wrong attribute in mem/buf/sycl/Copy; - remove the SYCL experimental BuildAccessor<BufGenericSycl>, use the default implementation from alpaka/mem/view. Fix the examples to work with the SYCL backend: - fix the accelerator in the vectorAdd example; - move AccCpuSerial at the end in the ExampleDefaultAcc, as it was preventing the SYCL accelerators from being selected. Complete the work on the SYCL random number generators.
…ns (part 3) Update the documentation. Implement various fixes to the SYCL math functions: - add missing "if constexpr" to rsqrt(); - do not call math function with mixed arguments; this fixes errors due to the implicit conversion between floating point types of different sizes in sycl::atan2() and sycl::pow(); - add explicit type casts to silence warnings; - cast the result of isfinite/isinf/isnan to bool. Implement various fixes to the SYCL atomic functions: - fix the cas/compare_exchange loops; - clarify which atomic types are supported. Implement various fixes to the SYCL warp-level functions: - fix compilation warnings; - extract bits from sub_group_mask. Mark the use of global device variables and constants as undupported: the SYCL backend does not support global device variables and constants, yet. Add explicit checks on the dimensionality of the SYCL accelerator and work division. Silence warnings about the use of GNU extensions, and those coming from the Intel oneMKL and oneDPL headers. Update more tests for the SYCL backend: - add a special case for 0-dimensional tests; - disable the use of STL rand; - disable the test of global device variables and constants.
…ns (part 4) Update the documentation related to FPGAs. Various fixes and updates to the SYCL backend and tests, the copyright information and code formatting.
Rewrite the N-dimensional Copy and Set memory operations to support pitched memory buffers, based on the Cpu implementation. This may require more than one memset or memcpy call per operation, which is not supported by command group handlers. Rewrite the Copy and Set memory operations to use queues instead.
Introduce a new optional trait to describe at compile time the warp size that a kernel should use. The default behaviour is to let the back-end compiler pick the preferred size. Before launching a kernel with a compile-time sub-group size the user should query the sizes supported by the device, and choose accordingly. If the device does not support the requested size, the SYCL runtime will throw a synchronous exception. During just-in-time (JIT) compilation this guarantees that a kernel is compiled only for the sizes supported by the device. During ahead-of-time (AOT) compilation this is not enough, because the device is not known at compile time. The SYCL specification mandates that the back-end compilers should not fail if a kernel uses unsupported features, like unsupported sub-group sizes. Unfortunately the Intel OpenCL CPU and GPU compilers currently fail with a hard error. To work around this limitation, use the preprocessor macros defined when compiling AOT for the new SYCL targets to enable the compilation only for the sub-group sizes supported by each device. Note: while the CPU OpenCL back-end does support a sub-group size of 64, the SYCL code currently does not. To avoid issues with the sub-group primitives always consider the sub-group size of 64 as not supported by the device. Other changes: - remove the use of SYCL streams in favour of the printf() extension; - remove the ALPAKA_FN_HOST attribute; - fix the GetSize test for the different sub-group sizes; - fix the use of sycl::exceptions; - use different member names for nd_item in different classes, to avoid ambiguous name lookup error when accessing the nd_item in the accelerator object.
- add the missing specialization of CreateViewPlainPtr for SYCL devices - improve the comments on the ALPAKA_FN_INLINE macro - remove unnecessary ALPAKA_FN_HOST attributes - rename QueueGenericSyclBase::m_impl to m_spQueueImpl, to align with the other back-ends
@bernhardmgruber it turns out that the changes to the platforms were good. A fix is in #2021. |
You are amazing! Thank you so much :) The proposed PR also LGTM. Great work! |
Agreed to keep the full set of changes in thi PR
🎉 |
Well, thanks for the good work! This was our largest PR so far (in terms of comments and reviews). Glad to see it accepted! |
Rewrite the SYCL backend to support the SYCL 2020 standard, using USM allocations instead of SYCL buffers and accessors.
Few highlights:
Kernel trait for compile-time sub-group size
Introduce a new optional trait to describe at compile time the warp size that a kernel should use. The default behaviour is to let the back-end compiler pick the preferred size.
Before launching a kernel with a compile-time sub-group size the user should query the sizes supported by the device, and choose accordingly. If the device does not support the requested size, the SYCL runtime will throw a synchronous exception.
During just-in-time (JIT) compilation this guarantees that a kernel is compiled only for the sizes supported by the device. During ahead-of-time (AOT) compilation this is not enough, because the device is not known at compile time. The SYCL specification mandates that the back-end compilers should not fail if a kernel uses unsupported features, like unsupported sub-group sizes. Unfortunately the Intel OpenCL CPU and GPU compilers currently fail with a hard error. To work around this limitation, use the preprocessor macros defined when compiling AOT for the new SYCL targets to enable the compilation only for the sub-group sizes supported by each device.
Note: while the CPU OpenCL back-end does support a sub-group size of 64, the SYCL code currently does not. To avoid issues with the sub-group primitives always consider the sub-group size of 64 as not supported by the device.