# The AMD ROCm<sup>™</sup> Platform (and its GPU programming models)

Speaker: Micha

Michael Klemm Principal Member of Technical Staff Compilers, Languages, Runtimes & Tools Machine Learning & Software Engineering

> AMD together we advance\_

#### **Developing for AMD Hardware**

## 

## AMDA ROCm

AMD Optimized CPU Compiler (AOCC) AMD Optimized CPU Libraries (AOCL) AMD ZenDNN AMD µProf Heterogenous-computing Interface for Portability (HIP) OpenMP API Machine Learning Frameworks Acceleration Libraries ROCm<sup>™</sup> Communication Libraries (RCCL)

In addition to numerous options with open source, community tools

## **AMD Compilers**

#### 



- AMD Optimizing C/C++ Compiler (AOCC)
- Targets x86 AMD CPUs (no offloading)

 C, C++ and Fortran compilers based on LLVM with extensive optimizations for AMD EPYC<sup>™</sup> processors

- ROCm<sup>™</sup> Compiler Collection
- Supports offloading to AMD GPUs

 C, C++ and Fortran compilers based on LLVM with additional open-source features and optimizations

#### **AMD Next-Gen Fortran Compiler – Public Downloads**

#### Introducing AMD's Next-Gen Fortran Compiler

We are excited to share a brief preview of AMD's <u>Next-Gen Fortran Compiler</u>, our new open source Fortran complier supporting OpenMP offloading. AMD's <u>Next-Gen Fortran Compiler</u> is a downstream flavor of <u>LLVM Flang</u>, optimized for AMD GPUs. Our <u>Next-Gen Fortran Compiler</u> enables OpenMP offloading and offers a direct interface to ROCm and HIP. In this blog post you will:

- 1. Learn how to use AMD's <u>Next-Gen Fortran Compiler</u> to deploy and accelerate your Fortran codes on AMD GPUs using OpenMP offloading.
- 2. Learn how to use AMD's Next-Gen Fortran Compiler to interface and invoke HIP and ROCm kernels.
- 3. See how AMD's <u>Next-Gen Fortran Compiler</u> OpenMP offloading exhibits competitive performance against native HIP/C++ codes, benchmarking on AMD GPUs.
- 4. Learn how to access a pre-production build of the new AMD's Next-Gen Fortran Compiler.

#### **Our commitment to Fortran**

Fortran is a powerful programming language for scientific and engineering high performance computing applications and is core to many, some very crucial, HPC codebases. Fortran remains under active development as a standard, supporting both legacy and modern codebases. The need for a more modern Fortran compiler motivated the creation of the LLVM Flang project and AMD fully supports that path. In following with community trends, AMD's <u>Next-Gen Fortran Compiler</u> will be a downstream flavor of <u>LLVM</u> Flang and will in time supplant the current AMD Flang compiler, a downstream flavor of "<u>Classic Flang</u>".

#### https://rocm.blogs.amd.com



#### [flang][driver] rename flang-new to flang #110023

#### °⊱ Merged

kiranchandramo... merged 7 commits into llvm:main from BerkeleyLab:rename-flang-new [] 2 weeks ago

#### https://github.com/llvm/llvm-project/



|                                      | Benchmarks &                 | HPC Applications and Optimized Training / Inference Models |          |              |                               |           |            |  |
|--------------------------------------|------------------------------|------------------------------------------------------------|----------|--------------|-------------------------------|-----------|------------|--|
| AMD<br>ROC<br>Open Software Platform | App Support                  | HPL/HPCG Life Scie                                         |          | cience Geo S | ence Geo Science F            |           | MLPERF     |  |
|                                      | Operating Systems<br>Support | Ubuntu                                                     |          | RHEL         | RHEL SLES                     |           | CentOS     |  |
|                                      | Cluster Deployment           | Docker <sup>®</sup> Si                                     |          | Singularity  | ingularity Kubernet           |           | es® SLURM  |  |
|                                      | Framework Support            | Kokkos/RAJA                                                |          | Py           | PyTorch                       |           | TensorFlow |  |
| AMD<br>INSTINCT<br>GPU               | Libraries                    | BLAS                                                       | RAND     | FFT          | MIGraphX                      | MIVisionX | PRIM       |  |
|                                      |                              | SOLVER                                                     | ALUTION  | N SPARSE     | THRUST                        | MIOpen    | RCCL       |  |
|                                      | Programming Models           | HIP API                                                    |          | Open         | OpenMP <sup>®</sup> API OpenC |           | enCL™      |  |
|                                      | Development<br>Toolchain     | Compiler                                                   | Profiler | Tracer       | Debugger                      | HIPIFY    | GPUFort    |  |
|                                      | Drivers & Runtime            | GPU Device Drivers and ROCm Runtime                        |          |              |                               |           |            |  |
|                                      | Deployment Tools             | ROCm Validation Suite ROCm Data Center Tool                |          |              | RO                            | ROCm SMI  |            |  |

| -   |   |    |     |
|-----|---|----|-----|
| יטו |   | h  | c   |
|     | u | UI | C I |
|     |   |    |     |

| AMD<br>ROC<br>Open Software Platform |                          |        |         |                           |          |           |       |  |
|--------------------------------------|--------------------------|--------|---------|---------------------------|----------|-----------|-------|--|
|                                      |                          |        |         |                           |          |           |       |  |
|                                      | L'hurden                 | BLAS   | RAND    | FFT                       | MIGraphX | MIVisionX | PRIM  |  |
| AMDZ<br>INSTINCT<br>GPU              | Libraries                | SOLVER | ALUTION | SPARSE                    | THRUST   | MIOpen    | RCCL  |  |
|                                      | Programming Models       | HIP    | API     | OpenMP <sup>®</sup> API O |          | Ор        | enCL™ |  |
|                                      | Development<br>Toolchain |        |         |                           |          |           |       |  |
|                                      | Drivers & Runtime        |        |         |                           |          |           |       |  |
| INSTINCT                             | Deployment Tools         |        |         |                           |          |           |       |  |

















## **Common LLVM Compiler Backend**









#### **HIP Grid Fundamentals**





#### **HIP Kernel Example**

```
_global__
void saxpy(float a, const float* d_x, float* d_y, unsigned int size) {
  const unsigned int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (global_idx < size)
    d_y[global_idx] = a * d_x[global_idx] + d_y[global_idx];
}</pre>
```

#### **HIP Kernel Example**

```
__global__
void saxpy(float a, const float* d_x, float* d_y, unsigned int size) {
  const unsigned int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (global_idx < size)
    d_y[global_idx] = a * d_x[global_idx] + d_y[global_idx];
}
```

#### **HIP Kernel Launch Example**

```
float* d_x{}; float* d_y{};
```

```
hipMalloc(&d_x, size_bytes);
hipMalloc(&d_y, size_bytes);
hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice);
hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice);
```

#### **HIP Porting From CUDA**



#### **HIP** supports AMDGPU and CUDA

Allows incremental porting



#### **HIP Porting From CUDA**



#### Text-based **HIP** translation

Compiler-based HIP translation



19





## **OpenMP® API Fundamentals**





#### **OpenMP®** Fundamentals







#### **OpenMP® API and C++**

```
void saxpy(float a, const float* x, float* y, unsigned int size) {
```

}

#### **OpenMP® API and C++ - Attribute Syntax**

```
void saxpy(float a, const float* x, float* y, unsigned int size) {
```

}

#### **OpenMP® API and C++**



#### **OpenMP® API and Fortran**

```
subroutine saxpy(a, x, y, size)
real :: a, x(size), y(size)
integer :: size, i
!$omp target teams distribute parallel do map(to:x) map(tofrom:y)
do i = 1, size
    y(i) = a*x(i)+y(i)
enddo
end subroutine saxpy
```

#### **OpenMP® API and Fortran**

```
subroutine saxpy(a, x, y, size)
real :: a, x(size), y(size)
integer :: size
!$omp target teams workdistribute map(to:x) map(tofrom:y)
y = a * x + y
!$omp end target teams workdistribute
end subroutine saxpy
```





#### C++ and StdPar

void saxpy(float a, std::vector<float> &x, std::vector<float> &y) {

std::transform(x.begin(), x.end(), y.begin(), y.begin(),
 [a](float xi, float yi) { return a \* xi + yi; });

#### C++ and StdPar

void saxpy(float a, std::vector<float> &x, std::vector<float> &y) {

```
std::transform(std::execution::par_unseq,
```

```
x.begin(), x.end(), y.begin(), y.begin(),
[a](float xi, float yi) { return a * xi + yi; });
```

#### Fortran and "StdPar"

```
subroutine saxpy(a, x, y, size)
real :: a, x(size), y(size)
integer :: size, i
```

```
Can be parallelized for
OpenMP host threads (now)
and AMD GPU (wip).
```

```
do concurrent (i=1:size)
```

```
y(i) = a*x(i)+y(i)
```

```
enddo
```

end subroutine saxpy

#### **ROCm<sup>™</sup> Libraries**

```
rocblas_create_handle(&handle);
```

// -- Allocate and initialize/copy device d\_x and d\_y; h\_alpha on host

rocblas\_set\_pointer\_mode(handle, rocblas\_pointer\_mode\_host);

rocblas\_saxpy(handle, n, &h\_alpha, d\_x, incx, d\_y, incy);

```
// -- Copy result back to host
```

rocblas\_destroy\_handle(handle);

#### **ROCm<sup>™</sup> Libraries**

```
rocblas_create_handle(&handle);
```

// -- Allocate and initialize/copy device d\_x and d\_y; h\_alpha on host

rocblas\_set\_pointer\_mode(handle, rocblas\_pointer\_mode\_host);

rocblas\_saxpy(handle, n, &h\_alpha, d\_x, incx, d\_y, incy);

// -- Copy result back to host

rocblas\_destroy\_handle(handle);





| Framework Support  | Kokkos/RAJA |         | PyTorch      |                         |       |                | TensorFlow |      |  |
|--------------------|-------------|---------|--------------|-------------------------|-------|----------------|------------|------|--|
| Libraries          | BLAS        | RAND    | FFT MIGraphX |                         |       | MIVisionX PRIM |            |      |  |
|                    | SOLVER      | ALUTION | SPA          | RSE                     | THRUS | Т              | MIOpen     | RCCL |  |
| Programming Models | HIP API     |         |              | OpenMP <sup>®</sup> API |       |                | OpenCL™    |      |  |
|                    |             |         |              |                         |       |                |            |      |  |
|                    |             |         |              |                         |       |                |            |      |  |
|                    |             |         |              |                         |       |                |            |      |  |



|                                       | Benchmarks &                 | HPC Applications and Optimized Training / Inference Models |          |             |                             |           |            |  |
|---------------------------------------|------------------------------|------------------------------------------------------------|----------|-------------|-----------------------------|-----------|------------|--|
| AMD<br>ROCm<br>Open Software Platform | App Support                  | HPL/HPCG Life Scie                                         |          | ience Geo S | ence Geo Science F          |           | MLPERF     |  |
|                                       | Operating Systems<br>Support | Ubuntu                                                     |          | RHEL SLES   |                             | CentOS    |            |  |
|                                       | Cluster Deployment           | Docker <sup>®</sup> S                                      |          | Singularity | ingularity Kubernet         |           | es® SLURM  |  |
|                                       | Framework Support            | Kokkos/RAJA                                                |          | РуТ         | PyTorch                     |           | TensorFlow |  |
| AMD<br>INSTINCT<br>GPU                | Libraries                    | BLAS                                                       | RAND     | FFT         | MlGraphX                    | MIVisionX | PRIM       |  |
|                                       |                              | SOLVER                                                     | ALUTION  | SPARSE      | THRUST                      | MIOpen    | RCCL       |  |
|                                       | Programming Models           | ΗΙΡ ΑΡΙ                                                    |          | OpenN       | OpenMP <sup>®</sup> API Ope |           | enCĽ™      |  |
|                                       | Development<br>Toolchain     | Compiler                                                   | Profiler | Tracer      | Debugger                    | HIPIFY    | GPUFort    |  |
|                                       | Drivers & Runtime            | GPU Device Drivers and ROCm Runtime                        |          |             |                             |           |            |  |
|                                       | Deployment Tools             | ROCm Validation Suite ROCm Data Center Tool                |          |             | RO                          | ROCm SMI  |            |  |

#### **AMD ROCm<sup>™</sup> Platform**

#### **High Performance**



- Powers the top500 list leader
- Solutions for HPC and AI
- Compilers, Libraries, Frameworks

#### **Open Source**



- Committed to open ecosystem
- Active community engagement
- Driving development

#### Portable



- Portable / Standardized languages
- Solutions for evolving accelerators
- Support via third-party libraries

#### Disclaimer

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

#### © 2025 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD Arrow logo, EPYC, Instinct, ROCm and combinations thereof are trademarks of Advanced Micro Devices, Inc. PCIe is a registered trademark of PCI-SIG Corporation. OpenCL is a trademark of Apple Inc. used by permission by Khronos Group, Inc. The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

#