# Programming RISC-V accelerators via Fortran
**Authors**: Nick Brown, Jake Davies, Felix LeClair
> Corresponding author: Nick Brown
bibliography.bib RISC-V Summit Europe, Paris, 12-15th May 2025
(1 EPCC, The University of Edinburgh, 47 Potterrow, Edinburgh, UK 2 Tenstorrent, 2600 Great America Way, Santa Clara, CA, USA)
## 1 Introduction
Whilst RISC-V has grown rapidly in areas such as embedded computing, it is yet to gain significant traction in High Performance Computing (HPC). However, as we move further into the exascale era the HPC community will be faced by a range of new challenges, for instance the requirement to decarbonise their workloads, and there is the potential for RISC-V to play an important role.
Arguably, it is likely that we will first see adoption of RISC-V in HPC via PCIe accelerator cards. These can can be easily fitted into existing systems and because it enables centres to dip their toe into the RISC-V ecosystem it limits their risk as other parts of the supercomputer remain the same. Indeed, there are a range of RISC-V PCIe accelerators cards that are shipping, such as Esperanto’s ET-SoC and the Tensix family from Tenstorrent, with other products such as Inspire Semiconductor’s Thunderbird having been announced. However, the major challenge associated with all of these is that to actually run codes on them then the developer must learn a new programming model, restructure their codes to map to the architecture, and leverage the vendor API.
Fortran is the lingua franca of scientific computing, indeed around 65% of codes running on ARCHER2, the UK national supercomputer, are written in Fortran and these account for around 70% of the machine’s runtime. Ultimately, developers of these high performance codes want to run more complex problems at reduced time to solution, and the specialisation provided by RISC-V accelerators means that they can potentially provide this whilst also delivering energy benefits. However, a major challenge to adoption of such technologies is the requirement for scientific programmers to significant restructure their codes, potentially also having to rewrite them in different programming languages.
### 1.1 MLIR
Since it was first merged into mainstream LLVM in 2019, MLIR has become a popular for developing compilers. Comprising Intermediate Representation (IR) dialects, and transformations which undertake optimisations and convert the IR between dialects, it is possible to mix dialects which are at different levels of abstraction and progressively lower between them. Ultimately, MLIR provides reuse of compiler infrastructure, and via the MLIR framework one can define their own IR dialects and transformations.
### 1.2 Flang
Flang is the LLVM community’s Fortran compiler and leverages MLIR by providing it’s own Fortran IR (FIR) and High Level Fortran IR (HLFIR) dialects. However, only a subset of MLIR standard dialects are integrated with Flang, and Flang itself transforms straight from HLFIR+FIR into LLVM-IR without using any of the existing MLIR transformations or optimisations.
Conversely, the mlir-opt MLIR driver tool is unaware of the Flang dialects and it is not possible to drive the wide range of MLIR transformations and optimisations via Flang’s IR. To this end in LABEL:brown2024fully we developed a transformation pass that lowers Flang’s HLFIR and FIR dialects into standard MLIR dialects. The first benefit of this is that the user is then able to leverage the existing MLIR transformations which are developed and maintained by a large community, including many vendor, to generate LLVM-IR. The second benefit is that it provides a much wider range of potential target architectures including GPUs.
## 2 Flang for RISC-V accelerators
<details>
<summary>extracted/6552213/images/flang-to-rv-acclerator.png Details</summary>

### Visual Description
\n
## Compilation Flowchart: Fortran to CPU and Accelerator Targets
### Overview
This image is a technical flowchart illustrating the compilation pipeline for Fortran source code, targeting both CPU and accelerator (specifically RISC-V) architectures. The diagram shows a branching process where initial parsing and optimization lead to two distinct code generation paths: one for traditional CPU execution and another for accelerator devices. The flowchart uses a combination of solid and dashed boxes, directional arrows with labels, and color-coding to denote different stages and components.
### Components/Axes
The diagram is structured as a directed graph flowing from left to right. Key components are enclosed in boxes, with dashed boxes grouping related stages.
**Primary Components (Boxes):**
1. **Fortran source code** (Leftmost, solid black outline)
2. **Flang** (Dashed black outline, containing the next stage)
3. **HLFIR & FIR** (Solid green outline, inside the Flang box)
4. **Standard dialects** (Solid blue outline)
5. **CPU code** (Dashed black outline, top branch)
* **host dialect** (Solid blue outline)
* **func dialect** (Solid blue outline)
* **LLVM dialect** (Solid blue outline)
6. **LLVM IR** (Solid blue outline, rightmost of top branch)
7. **RISC-V accelerator** (Dashed black outline, bottom branch)
* **accelerator device-side dialects** (Solid green outline, first box)
* **accelerator device-side dialects** (Solid green outline, second box)
8. **C/C++ with API** (Solid blue outline, rightmost of bottom branch)
**Flow Labels (Text on Arrows):**
* `Parsing, lexing and some optimisation` (Arrow from Fortran source to HLFIR & FIR)
* `Existing Lowering` (Arrow from HLFIR & FIR to Standard dialects)
* `For the CPU` (Arrow branching upward from Standard dialects)
* `For the accelerator` (Arrow branching downward from Standard dialects)
* `Lowering pass` (Arrow from host dialect to func dialect)
* `Lowering passes` (Arrow from func dialect to LLVM dialect)
* `Generation` (Arrow from LLVM dialect to LLVM IR)
* `Optimisation` (Arrow between the two "accelerator device-side dialects" boxes)
* `Printing` (Arrow from the second accelerator dialects box to C/C++ with API)
### Detailed Analysis
The pipeline begins with **Fortran source code** as input. This enters the **Flang** compiler frontend, where it undergoes **Parsing, lexing and some optimisation** to produce **HLFIR & FIR** (High-Level Fortran IR and Fortran IR).
From HLFIR & FIR, an **Existing Lowering** step produces **Standard dialects**. At this point, the pipeline bifurcates based on the target hardware:
**1. CPU Code Generation Path (Top Branch):**
* The flow is labeled `For the CPU`.
* The **Standard dialects** are lowered to a **host dialect**.
* A **Lowering pass** transforms the **host dialect** into a **func dialect**.
* Multiple **Lowering passes** then convert the **func dialect** into an **LLVM dialect**.
* Finally, a **Generation** step produces the final **LLVM IR** (Intermediate Representation), which is the standard output for CPU-targeting compilers.
**2. RISC-V Accelerator Code Generation Path (Bottom Branch):**
* The flow is labeled `For the accelerator`.
* The **Standard dialects** are lowered to **accelerator device-side dialects** (first green box).
* An **Optimisation** step occurs, resulting in a second instance of **accelerator device-side dialects** (second green box). This suggests an iterative or multi-stage optimization process within the accelerator-specific IR.
* A **Printing** step then converts these optimized dialects into **C/C++ with API**, which is likely the source code representation used to program the accelerator, possibly with a vendor-specific API.
### Key Observations
* **Parallel Compilation Strategy:** The diagram explicitly shows a single frontend (Flang) feeding two distinct backend paths, enabling a single Fortran codebase to be compiled for heterogeneous systems (CPU + accelerator).
* **Dialect-Centric Design:** The use of "dialects" (host, func, LLVM, accelerator device-side) indicates a modular compiler architecture, likely based on MLIR (Multi-Level IR), where different representations are tailored for specific optimizations and hardware targets.
* **Asymmetric Final Outputs:** The CPU path culminates in a low-level IR (LLVM IR), while the accelerator path culminates in high-level source code (C/C++ with API). This suggests the accelerator compilation might be a source-to-source transformation, relying on a separate downstream compiler for final device code generation.
* **Color Coding:** Boxes with a green outline (HLFIR & FIR, accelerator device-side dialects) appear to represent Fortran-specific or accelerator-specific intermediate representations. Blue-outlined boxes represent more generic or target-specific IRs (Standard, host, func, LLVM, final outputs).
### Interpretation
This flowchart depicts a sophisticated, modern compiler infrastructure designed for performance portability in high-performance computing (HPC). The core innovation is the ability to take standard Fortran code and automatically generate both the CPU portion of an application and the offload kernels for an accelerator (here, a RISC-V based one).
The "Standard dialects" stage acts as a crucial pivot point. By lowering the initial Fortran IR to a common set of dialects, the compiler can then apply target-specific lowering and optimization strategies. The CPU path follows a well-established route towards LLVM IR, leveraging the mature LLVM ecosystem for code generation and optimization for traditional architectures.
The accelerator path is particularly interesting. Instead of generating binary code or LLVM IR directly, it "prints" C/C++ code with an API. This implies a strategy where the Fortran compiler handles the high-level semantics, data mapping, and kernel identification, while delegating the final, hardware-specific code generation and optimization to a dedicated C/C++ compiler for the accelerator. This could be due to the accelerator having a unique instruction set or programming model not directly supported by LLVM, or it could be a design choice to reuse existing accelerator programming frameworks (like CUDA or OpenCL, but for RISC-V).
In essence, the diagram illustrates a bridge between the legacy world of Fortran and modern heterogeneous computing, using a multi-level IR approach to manage complexity and target diversity.
</details>
Figure 1: Illustration of our approach lowering Flang to target RISC-V accelerators
Figure 1 provides an overview of our approach, where the existing work of [brown2024fully] lowers Fortran into the standard dialects and a transformation is provided which lowers into the specific host and device dialects for the RISC-V accelerator. Some PCIe based RISC-V accelerators already provide an MLIR-based compiler stack and-so we can then leverage their existing dialects in combination with their compilation pipeline to generate the resulting executables.
However, many of these accelerators either do not provide an MLIR stack or such a stack is immature. In such a case, as per Figure 1, we develop a backend for these accelerators which comprises host and device-side dialects that map one-to-one to the accelerator API. A lowering is then developed that converts the host-side dialect to the func dialect, calling runtime functions and eventually into LLVM-IR. On the device-side we develop a printer which accepts the device specific dialects and standard MLIR dialects, such as memref for memory management, and this prints out target code comprising a programming language, commonly C or C++, calling into the device’s API.
### 2.1 Tenstorrent example
Tenstorrent ship RISC-V PCIe accelerator cards that are built upon their Tensix technology. Each Tensix core comprises five RISC-V cores; one for data movement in, one for data movement out, and three drive a 16384 wide vector unit. The Wormhole n300, for example, contains 128 Tensix cores. The decoupling of data movement from compute makes this a very interesting potential architecture for HPC, and indeed early experiments porting a scientific computing workload to the Grayskull delivered similar performance to a 24-core Xeon Platinum CPU but at five times less energy usage [brown2024accelerating]. However, to run codes on this architecture programmers must learn a new architecture and significantly recast their applications.
We developed a Tenstorrent specific backend which comprises a host dialect and three device-side dialects, one for data movement, one for circular buffers between RISC-V cores, and one for compute. We also developed a printer that, from the device-side dialects, generates C++ that calls into the Metalium API.
⬇
subroutine saxpy (a, x, y, n)
…
! $omp omp target parallel &
! $omp & do simd num_threads (20) simdlen (32)
do i =1, n
y (i) = a * x(i) + y(i)
end do
!$omp end target parallel do simd
end subroutine
Listing 1: Example Fortran code running Single-precision A times X Plus Y (SAXPY) on the Tenstorrent accelerator card (argument declarations omitted for brevity)
A question is how, in Figure 1, to lower from the standard dialects into the device-specific ones that map to the RISC-V accelerator. The programmer drives this via OpenMP target offload, and Listing 1 illustrates Single-precision A times X Plus Y (SAXPY) written in Fortran and offloaded to the Tenstorrent PCIe accelerator via OpenMP. HPC programmers are already familiar with OpenMP, both for threaded and GPU programming so it is a natural choice. The code example in Listing 1 will run the loop in parallel over two Tensix cores, due to the num_teams(2), leveraging the SIMD capabilities of each Tensix core.
## 3 Conclusions
We have described offloading Fortran code to RISC-V based accelerators via Flang. OpenMP provides a clear abstraction which can be used to drive such an offloading, and MLIR is a powerful compiler technology for supporting these accelerators because it enables the sharing of compiler infrastructure between them.