

## Bringing performant support for Nvidia hardware to SYCL

Ruyman Reyes Castro

Principal Software Engineer, Programming Models



IWOCL 2020

### Codeplay - Enabling AI to be Open, Safe and Accessible to all

**(**) codeplay

SOFTWARE

#### Products

#### Acoran

Integrates all the industry standard technologies needed to support a very wide range of AI and HPC

C++ platform via the SYCL<sup>™</sup> open standard, enabling vision & machine learning e.g. TensorFlow<sup>™</sup>

▲ **Compute**Aorta<sup>™</sup> The heart of Codeplay's compute technology enabling OpenCL<sup>™</sup>, SPIR-V<sup>™</sup>, HSA<sup>™</sup> and Vulkan<sup>™</sup>

#### Company

Leaders in enabling high-performance software solutions for new AI processing systems

Enabling the toughest processors with tools and middleware based on open standards

Established 2002 in Scotland with ~80 employees

#### Addressable Markets

High Performance Compute (HPC) Automotive ADAS, IoT, Cloud Compute Smartphones & Tablets Medical & Industrial

> **Technologies:** Artificial Intelligence Vision Processing Machine Learning Big Data Compute





## What have we done

## CUDA backend for Intel SYCL implementation

- Does not require OpenCL
- All contributions in the open

## SYCL Standard contributions

- Experience of porting SYCL to non-OpenCL backend
- Multiple extensions that enable CUDA-specific features
- Overall porting experience

## DPC++ and SYCL (and Codeplay)

Data Parallel C++ : C++ and SYCL\* standard and extensions

 "Incorporates" the SYCL standard for data parallelism and heterogeneous programming Data Parallel C++ ⇔ DPC++ DPC++ Extends SYCL 1.2.1

- Fast-moving open collaboration feeding into the SYCL standard
- Open source implementation with goal of upstream LLVM
- DPC++ extensions aim to become core SYCL, or Khronos extensions

#### Codeplay involvement

- Contribute back to the community from an independent codebase
- Explore extensions and actively participate on oneAPI initiative



## Disclaimer and Trademarks

- NVIDIA, the NVIDIA logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries
- Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code

## Summary

- Using SYCL for CUDA
- Overall design of SYCL for CUDA
- Compiler implementation
- Runtime implementation
- Interoperability with existing libraries
- Conclusions and future work





## Using SYCL for CUDA





## Using SYCL for CUDA

- Build or get a binary package of DPC++
  - Daily builds of master in <a href="https://github.com/intel/llvm/releases">https://github.com/intel/llvm/releases</a>
  - Detailed instructions in <a href="https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md">https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md</a>
- Compile your code using the CUDA target triple

clang++ -fsycl-targets=nvptx64-nvidia-cuda-sycldevice simple-sycl-app.cpp -o simple-sycl-app-cuda.exe No changes required to your SYCL code

• Run your application with the CUDA backend enabled

SYCL\_BE=PI\_CUDA ./simple-sycl-app-cuda.exe

Env var used by default device selection



## Design of SYCL for CUDA





## SYCL for CUDA

#### SYCL 1.2.1 was intended for OpenCL 1.2

- If a SYCL 2.2 ever existed, it was based on OpenCL 2.2
- What could be a good alternative target to demonstrate SYCL as a High Level Model?
- Let's have an open discussion about SYCL for non-OpenCL!

#### Sure let's do Vulkan!

- Not that simple, SYCL was never designed for Graphics
- Already a potential path via clspv + clvk

#### Have you heard about CUDA?

- Existing OpenCL + PTX path (available on ComputeCpp) not great
  - Difficult to maintain but no customer base
- Native CUDA support will be better to expand the ecosystem

## SYCL 1.2.1 on CUDA

- What can work?
  - Platform model (Platform/Device/Context)
  - Buffers, copy
  - NDRange kernels
- What cannot work
  - Interoperability (no OpenCL types!)
  - Images and samplers
    - CUDA images are sampled on construction
    - SYCL/OpenCL Images are sampled in the kernel
  - SYCL program class
    - OpenCL compilation model does not match CUDA (e.g. options are different)

#### We have created a number of proposals and provide feedback to the SYCL WG to make those implementable on a future SYCL version



## Main outcome: SYCL "generalization"



https://github.com/KhronosGroup/SYCL-Shared/blob/master/proposals/sycl\_generalization.md

## SYCL module objects

#### module<module status status> kern0 kern0 kern0 device image device image device image **SPIR-V SPIR-V** ISA with debug

A SYCL module represents a collection of functions and symbols that can be used for all devices in the associated context.

A SYCL module can store different versions of the same functions and symbols in different representations. Each of these versions is called a device image.

https://github.com/KhronosGroup/SYCL-Shared/blob/master/proposals/sycl\_modules.md

#### This is a high-level abstraction, NOT a mapping of a SPIR-V or LLVM module

## Host task

```
auto cgH = [=] (handler& cgh) {
  auto accB = bufB.get_access<access::mode::write,</pre>
                              access::target::host buffer>(cgh);
  h.codeplay_host_task([=]() {
    std::ifstream ifs(some file name, std::ifstream::in);
    std::for_each(std::begin(accB), std::end(accA), [&](auto& elem) {
                                                                               Command group that runs a task
      if (!ifs.good()) {
        elem = 0;
                                                                               on the host inside the SYCL DAG
      } else {
        elem = ifs.get();
   });
 });
qA.submit(cgH);
```

#### https://github.com/codeplaysoftware/standards-proposals/blob/master/host\_task/host



## Compiler implementation





## Leveraging existing CUDA support

- Current LLVM tip has CUDA support
- This was contributed by Google back in 2016
   <a href="https://research.google/pubs/pub45226/">https://research.google/pubs/pub45226/</a>
- Includes a CUDA runtime implementation and a PTX backend
- The PTX backend is the interesting part!



## Driver (file compilation)



## Driver (linking)







## Converting local memory to Shared memory

```
buffer<cl::sycl::cl int, 1> buf(data, range<1>(size));
52
53
         myQueue.submit([&](handler& cgh) {
54
           auto ptr = buf.get access<access::mode::read write>(cgh);
55
56
           accessor<cl::sycl::cl int, 1, access::mode::read write,</pre>
57
                    access::target::local>
                                                                              Local memory allocation
58
               tile(range<1>(2), cgh);
59
60
           cgh.parallel for<example kernel>(
61
               nd range<1>(range<1>(size), range<1>(2)), [=](nd item<1> item) {
62
```

tile[pos] = ptr[item.get\_global\_linear\_id()];

Usage as an accessor

#### Multiple allocations of local memory are allowed





## Converting local memory to Shared memory



CUDA Dynamic Shared memory Declarations, each pointer refers To an element

Using CUDA Dynamic Shared memory in the CUDA runtime: Passing the total size of the allocation as last argument

myKernel<<<gridSize, blockSize, nI\*sizeof(int)+nF\*sizeof(float)+nC\*sizeof(char)>>>(...);

https://devblogs.nvidia.com/using-shared-memory-cuda-cc/





## Local to Shared transformation

define void @kernel(i8 addrspace(3)\* %arg1, i32 addrspace(3)\* %arg2) {

@kernel.shared = external addrspace(3) global [0 x i8], align 4

define void @kernel(i8 addrspace(3)\* %arg1, i32 addrspace(3)\* %arg2) {



21

Transform all pointers to CUDA shared memory into a 32 bit integer

Create a global symbol

to the CUDA shared

memory address space

Replace all uses of the pointers by offsets into the shared memory

```
@kernel.shared = external addrspace(3) global [0 x i8], align 4
define void @kernel(i32 %0, i32 %1) {
```

@kernel\_shared\_mem = external addrspace(3) global [0 x i8], align 4



## Runtime implementation







## The PI API

#### https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/pi.h



## PI CUDA plugin equivalent (example)



https://github.com/intel/llvm/blob/sycl/sycl/plugins/cuda/pi\_cuda.cpp





# PI CUDA plugin equivalent (example) Construct a PI mem object auto piMemObj = std::unique\_ptr(pi\_mem) new \_pi\_mem{context, parentBuffer, allocMode, ptr, host\_ptr, size}); /// PI Mem mapping to a CUDA memory allocation

/// struct pi mem { using native\_type = CUdeviceptr; using pi context = pi context \*; pi\_context context\_; pi mem parent ; native\_type ptr\_; PI Mem object, no longer a 1 void \*hostPtr\_; 2 1 map! size t size ; size\_t mapOffset\_; void \*mapPtr\_; cl\_map\_flags mapFlags\_; std::atomic\_uint32\_t refCount\_

enum class alloc\_mode { classic, use\_host\_ptr } allocMode\_;



## Interoperability





## Using native libraries in SYCL

```
auto cgH = [=] (codeplay::handler& cgh) {
    //- Get device accessor to SYCL buffer (cannot be dereferenced directly in interop_task).
    auto accA = bufA.get_access<access::mode::read>(cgh);
    auto accB = bufB.get_access<access::mode::read_write>(cgh);
    h.interop_task([=](codeplay::interop_handle &handle) {
        Third_party_api(handle.get_queue(), // Get the OpenCL command queue to use, can be the fallback
            handle.get_buffer(accA), // Get the OpenCL mem object behind accA
            handle.get_buffer(accB)); // Get the OpenCL mem object behind accB
            // Assumes call has finish when exiting the task
        });
    };
    qA.submit(cgH);
```

https://github.com/codeplaysoftware/standards-proposals/blob/master/interop\_task/



## Calling CUDA libraries



29

## Conclusions and future work





## Preliminary performance results

BabelStream FP32 MB/s



31

#### Platform: CUDA 10.1 on Ge



http://uob-hpc.github.io/BabelStream

## Internal experimental branch

BabelStream FP32 MB/s



**()** codeplay<sup>®</sup>

32

## Conclusions

- DPC++ is a working SYCL 1.2.1 compiler with many extensions that enable oneAPI features
- CUDA backend is integrated into main trunk and is part of the DPC++ release
- Already lots of comments from community, issues and even contributed pull requests!
- Currently working towards conformance (as much as is possible) in SYCL 1.2.1





## Participate!

• Join us in the intel/llvm repository







@codeplaysoft info@codeplay.com

We're Hiring! colepan.com/caeers/

codeplay.com

