## TPU-MLIR: A Compiler For TPU Using MLIR
Pengchao Hu Man Lu Lei Wang Guoyue Jiang pengchao.hu,man.lu,lei.wang,guoyue.jiang } @sophgo.com
{ Sophgo Inc.
## Abstract
Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end compiler based on MLIR that deploys pre-trained neural network (NN) models to a custom ASIC called a Tensor Processing Unit (TPU). TPU-MLIR defines two new dialects to implement its functionality: 1. a Tensor operation (TOP) dialect that encodes the deep learning graph semantics and independent of the deep learning framework and 2. a TPU kernel dialect to provide a standard kernel computation on TPU. A NN model is translated to the TOP dialect and then lowered to the TPU dialect for different TPUs according to the chip's configuration. We demonstrate how to use the MLIR pass pipeline to organize and perform optimization on TPU to generate machine code. The paper also presents a verification procedure to ensure the correctness of each transform stage.
## 1. Introduction
The development of deep learning (DL) has profoundly impacted various scientific fields, including speech recognition, computer vision, and natural language processing. In order to facilitate the process of training deep learning models, industry and academia have developed many frameworks, such as Caffe, Tensorflow, Pytorch, Mxnet, and PaddlePaddle, which boost deep learning in many areas. However, each framework has its proprietary graph representation, which brings lots of work for deploying as we need to support many DL model formats.
At the same time, matrix multiplication and high dimensional tensor convolution are the heavy computation in DL, which evoke the passion of chip architects to design customized DL accelerators to achieve high performance at low energy. Although GPU is still the leading hardware in training DL models and all the
DL frameworks have contributed much work to support this general-purpose hardware, GPU is not the perfect piece in the inference domain of DL. GPU is for gaming, graph rendering, scientific computation, and much more, not tailored for DL only. Thus, many DL accelerators, such as Google TPU, Apple Bonic, Graphcore IPU, and SOPHGO TPU, are more energy efficient than GPU and benefit many of these emerging DL applications.
In addition, the DL community has resorted to domain-specific compilers for rescue to address the drawback of DL libraries and alleviate the burden of manually optimizing the DL models on each DL hardware. The DL compilers take the model described in the DL frameworks as inputs and generate efficient code for various DL hardware as outputs. The transformation between a model definition and specific code implementation is highly optimized, considering the model specification and hardware architecture. Several popular DL compilers, such as TVM, Tensor Comprehension, and XLA, have been proposed by industry and academia. Specifically, they incorporate DL-oriented optimizations such as layer and operator fusion, which enables highly efficient code generation.
Herein, We provide TPU-MLIR, an open-source DL compiler for TPU. In particular, we chose Open Neural Network Exchange (ONNX)[1] as a DL format to represent our compiler's input model and use Multi-level Intermediate Representation (MLIR) [7], a modern opensource compiler infrastructure for multi-level intermediate representation, to design TPU-MLIR 1 compiler.
In this work, we will introduce our compiler by
- presenting the overall design and architecture of the compiler,
- introducing two new dialects: TOP dialect to encode the deep learning graph semantics independent of the deep learning framework and TPU dialect to provide a common lowering point for all TOP dialect operations but device-dependent,
1 https://github.com/sophgo/tpu-mlir
- detailing each compile stage, such as converting NN models to Top dialect as device independent and then converting TOP to TPU for various chips and types,
- defining WeightOp for weight operation and store weight data in the NumPy npz file, and
- providing InferenceInterface for TOP and TPU to ensure correct conversions.
We organize the remainder of the paper as follows. In Sec. 2, we briefly discuss MLIR, ONNX, on which our compiler is based, and the calibration processing, which tailors computation for TPU. Sec. 3, we introduce our compiler's design principle and architecture and discuss TOP and TPU dialects. We also discuss using inference to ensure correctness in each conversion stage. Finally, we conclude our paper and discuss future work in Sec. 4.
## 2. Background
## 2.1. MLIR
The MLIR, with much reusable and extensible, is a novel approach for constructing new domain-specific compilers. An open ecosystem is the most significant difference from LLVM. MLIR standardizes the Static Single Assignment (SSA)-based IR data structures allowing one to express a range of concepts as first-class operations. Operations can represent many different levels of abstraction and computations, from dataflow graphs to target-specific instructions and even hardware circuitry. They take and produce zero or more values, called operands and results, respectively. A value represents data at runtime and is associated with a type known at compile-time, whereas types model compile-time information about values. Complementary to this, attributes contain compile-time information to operations. Operations, Attributes, and type systems are open and extensible. The custom types, operations, and attributes are logically grouped into dialects. A dialect is one of the most fundamental aspects of MLIR that enables the infrastructure to implement a stack of reusable abstractions. Each abstraction encodes and preserves transformation validity preconditions directly in its IR, reducing the complexity and cost of analysis passes. The MLIR IR has a recursive structure where operations contain a list of regions, and regions contain a list of blocks, which in turn, contain a list of operations.
In particular, MLIR features operation, attribute and type interfaces providing a generic way of interacting with the IR. Interfaces allow transformations and analyses to work with abstract properties rather than fixed lists of supported concepts. Interfaces can be implemented separately from operations and mixed in using MLIR's registration mechanism, thus fully separating IR concepts from transformations. Furthermore, transformations can be written as compositions of orthogonal localized 'match and rewrite' primitives. These are often decomposed further into rewriting rules when applied within a dialect and lowering rules when converting from a higher-level dialect to a lower-level dialect. Throughout the compilation, separate dialects can co-exist to form a hybrid program representation. The ability to progressively lower dialects to the target hardware during the compilation process has made MLIR an excellent compiler infrastructure for domainspecific languages.
This article relies on several MLIR dialects and types, briefly described below.
## 2.1.1 Ranked Tensor Type
Values with tensor type represent aggregate Ndimensional homogeneous data indicated by element type and a fixed rank with a list of dimensions 2 . Each dimension could be a static non-negative integer constant or be dynamically determined (marked by ? ).
This abstracted runtime representation carries both the tensor data values and information about the tensor shape, but the compiler has not decided on its representation in memory. Tensor values are immutable and subject to def-use SSA semantics[9]. Operations on tensors are often free of side effects, and operations always create new tensors with a value. The textual format of the tensor is tensor γ d 1 xd 2 x Β· Β· Β· xd N xdtype γ , where d 1 , d 2 , ... d N are integers or symbol ? representing the dimensions of a tensor, and dtype is the type of the elements in a tensor, e.g., F32 for float32. A tensor can be unranked when its shapes are unknown. MLIR uses tensor γβ xdtype γ to represent unranked tensor types.
## 2.1.2 Quantization Dialect
Quantization dialect 3 provides a family of quantized types and type-conversion operations. The 'quantization' refers to the conversion of floating-point computations to corresponding variants expressed in integer math for inference, as has been supported by lowbit depth inference engines such as various accelerator hardware and many DSPs. There are three types defined in quantization dialect: UniformQuantizedType, UniformQuantizedPerAxisType, and CalibratedQuantizedType. The UniformQuantizedType and Unifor-
2 https://mlir.llvm.org/docs/Dialects/Builtin/#rankedtensortype
3 https://mlir.llvm.org/docs/Dialects/QuantDialect
mQuantizedPerAxisType represent the mapping between expressed values (e.g., a floating-point computer type) and storage values (typically of an integral computer type), expressing the affine transformations from uniformly spaced points to the real number line. The relationship is: realValue = scale Γ ( quantizedValue -zeroPoint ) and will be discussed in more detail in Section 2.3. Where CalibratedQuantizedType holds the range from the given min and max value of the histogram data of the tensor, used for recording the statistics information of the tensor. The UniformQuantizedPerAxisType applies affine transformation individually to each index along a specific axis of a tensor type. However, UniformQuantizedType applies the affine transformation to every value within the target type. The type-conversion defined in quantization dialect provides three operations for converting between types based on a QuantizedType and its expressed and storage sub-types. Those operations are: quant . qcast converting from an expressed type to QuantizedType, quant . dcast converting from a QuantizedType to its expressed type, and quant . scast converting between a QuantizedType and its storage type.
## 2.2. ONNX
ONNX is an open-source framework-independent format widely used for exchanging computation graph models, including deep learning and traditional machine learning. It was accepted as a graduate project in Linux Foundation AI and maintained by open-source communities. ONNX defines an extensible computation graph model, operators, and standard data types for deep learning and provides a set of specifications to convert a model to a basic ONNX format and another to get the model back from this ONNX form. It is an ideal tool for framework interoperability, especially when deploying a model to specific hardware[5].
ONNX reduces the friction of moving trained DL models among AI frameworks and platforms. ONNX uses the Protocol Buffers language for its syntax and provides rich documents and tools to formalize each operation's semantics and verify its correctness.
## 2.3. Quantization
Quantization is a promising technique to reduce deep learning models' memory footprint, inference latency, and power consumption, which replaces highcost floating-point (always F32) computation with low-cost fixed-point numbers[4] (e.g., INT8/INT16) or float-point (e.g., BF16/F16). Because most current DL models are heavily over-parameterized and robust to extreme discretization, there is much opportunity for reducing numeral precision without impact- ing the model's accuracy, bringing ample search space for tuning. Although many quantization methods have emerged, there is not a single well-posed or wellconditioned problem being solved[3]. Instead, one is interested in some error metric (based on classification quality, data similarity, etc.). to guide the quantization process. However, due to the over-parameterization, it is possible to have a high error between a quantized and the original model while still attaining excellent generalization performance. Finally, different layers in a Neural Net have a different impact on the loss function, which motivates a mixed-precision approach quantization.
## 2.3.1 Uniform Quantization
The quantization process is a function mapping from real values r to some numeral values. Quantization function such as
<!-- formula-not-decoded -->
where quant is the quantization operator, r is a realvalued input (activation or weight), s is a float-point scaling factor, and zp is an integer zero point, is known as uniform quantization, as the resulting quantized values are evenly spaced.
## 2.3.2 Symmetric and Asymmetric Quantization
Acrucial factor in uniform Quantization is choosing the scaling factor s in Equation 1. This scaling factor, also known as resolution, divides a given range of real-values r into several partitions s = Ξ² -Ξ± 2 b -1 , where [ Ξ±, Ξ² ] denotes the clipping range that we are clipping the real-values with, and b is the quantization bit width[4][6]. Therefore, one should determine the clipping range [ Ξ±, Ξ² ] before generating the scaling factor. If the clipping range of Ξ± equals -Ξ² , we get Symmetric Quantization, and on the contrary, we get asymmetric Quantization. The asymmetric quantization method often results in a tighter clipping range than symmetric Quantization, which is especially important when the dynamic range of the tensor is imbalanced, e.g., the result of ReLU always has non-negative values.
## 2.3.3 Calibration
The process of choosing the clipping range is called 'calibration.' One popular method for pre-calculation is to run a series of inferences on some sample data and then get the distribution of each tensor in the graph. Using the min/max of the signal for both symmetric and asymmetric Quantization is typical in most
Figure 1: Architecture of tpu-mlir.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Flowchart: Model Conversion and Inference Pipeline
### Overview
The flowchart illustrates a multi-stage pipeline for converting and optimizing machine learning models across frameworks (e.g., PyTorch, TensorFlow) and hardware (TPU). It highlights key steps in model transformation, optimization passes, and inference result comparisons between ONNX and TPU execution.
### Components/Axes
- **Top Section (NN Framework)**:
- Input: `sample.onnx` (ONNX model format).
- Frameworks: PyTorch, TensorFlow, and others.
- Key Steps:
1. **OnnxConverter**: Converts `sample.onnx` to `origin.mlir`.
2. **canonicalize**: Transforms `origin.mlir` to `canonical.mlir`.
3. **calibration pass**: Generates `cali.mlir` for quantization.
4. **lowering**: Reduces precision (F32/BF16/F16 β int8).
5. **Inference**: Executes `canonical.mlir` for ONNX Results.
- **Bottom Section (Tpu)**:
- Key Steps:
1. **lowering int8**: Further optimizes for TPU.
2. **Conversion**: Transforms `tpu.mlir` to `lg.mlir` (layer group pass).
3. **mem assign pass**: Generates `addr.mlir` for memory allocation.
4. **codegen pass**: Produces `sample.model` for PyRuntime.
5. **Inference**: Executes `tpu.mlir` for TPU Results.
- **Comparison**: "VS" arrows link ONNX Results and TPU Results to Top Results.
### Detailed Analysis
- **Flow Direction**:
- Top-to-bottom flow from framework-specific models (`sample.onnx`) to optimized TPU models (`sample.model`).
- Parallel paths for ONNX and TPU inference, converging at Top Results.
- **Critical Passes**:
- **canonicalize**: Ensures model compatibility across frameworks.
- **calibration pass**: Prepares for quantization (e.g., int8).
- **layer group pass (lg.mlir)**: Optimizes TPU-specific layer groupings.
- **mem assign pass**: Allocates memory efficiently for TPU execution.
- **Outputs**:
- **ONNX Results**: Inference output from `canonical.mlir`.
- **TPU Results**: Inference output from `tpu.mlir`.
- **Chip Results**: Final output from PyRuntime execution of `sample.model`.
### Key Observations
1. **Branching Logic**: After `canonicalize`, the pipeline splits into calibration (for ONNX) and lowering (for TPU).
2. **Precision Reduction**: Explicit lowering steps (F32 β BF16 β F16 β int8) indicate quantization for efficiency.
3. **Hardware-Specific Optimization**: TPU-specific passes (layer grouping, memory assignment) highlight hardware-aware optimizations.
4. **Result Comparison**: "VS" arrows suggest benchmarking ONNX and TPU inference performance.
### Interpretation
This flowchart represents a model optimization workflow for deploying ML models on TPUs. The process begins with framework-agnostic ONNX models, which undergo canonicalization and quantization for cross-framework compatibility. For TPU deployment, additional passes (layer grouping, memory allocation) tailor the model to TPU architecture, improving inference speed and resource utilization. The comparison of ONNX and TPU results underscores the trade-offs between framework flexibility and hardware-specific optimization. The pipeline emphasizes precision reduction (e.g., int8) and memory efficiency, critical for edge or cloud deployment. The absence of numerical data suggests the focus is on architectural flow rather than performance metrics.
</details>
cases. However, this approach is susceptible to outlier data in the activations, which could unnecessarily increase the range and reduce quantization resolution. One approach to address this is using percentile or selecting Ξ± and Ξ² to minimize KL divergence between the real and the quantized values[8][11]. Besides, there are other metrics to find the best range, including minimizing Mean Squared Error (MSE)[2], entropy, and cosine similarity.
## 3. Compiler
This section introduces the compiler, TPU-MLIR, which creates two layers by the TOP and TPU dialects for converting NN models to executable files by various types and chips. We discuss TPU-MLIR's overall architecture first.
## 3.1. Overview
Figure 1 shows the overall architecture of TPUMLIR. We divide it into the NN Framework, Top, and Tpu.
- 1) NN Framework : TPU-MLIR supports ONNX models directly. Other NN framework models, such as Pytorch, and Tensorflow, need to convert
to ONNX modes.
- 2) TOP : refer to the TOP dialect as the top abstraction level representing NN models in the MLIR language. It is device independent.
- 3) TPU : refer to the TPU dialect, which is the TPU abstraction level and represents TPU operations. It is device dependent.
We first convert a NN model to TOP abstraction with TOP dialect and built-in dialect defined in MLIR, which we call TOP mlir file, by python script, i.e., OnnxConverter in figure 1. Then we lower the top mlir file to TPU abstraction with TPU dialect and built-in dialect defined in MLIR, which we call tpu mlir through some passes, such as canonicalization pass and calibration pass. At last, we convert tpu mlir to tpu models by some passes, such as layer group pass and memory assign pass. These passes will be discussed in the later section.
## 3.2. Module
We introduce our module definition by a simple mlir file showed Listing 1:
Module has some attributes: module . name is related to the NN model name; module . weight file is a npz 4 file that stores weight data needed by operations. We use location to express operation name. For example, '%2 = 'top.Weight'()' (Line 6 in Listing 1) is a weight op, and location is 'filter conv1'. So the real weight data is stored in 'conv2d weight.npz' file by name 'filter conv1'.
## 3.3. Top Dialect
TOP dialect is very similar to TOSA (Tensor Operator Set Architecture) 5 dialect in MLIR. So why we don't use TOSA dialect? There are two reasons: the first is that we need to do inference for each operations, and may create some new features in the futrue; the second is that we need to keep extend capability to support various NN models.
TOP Dialect is defined as below:
```
```
4 https://numpy.org/neps/nep-0001-npy-format.html
https://www.mlplatform.org/tosa
```
```
Listing 1: Simple convolution computation represented by TPU dialect.
```
```
TOP Op has two interfaces: 'InferenceInterface' and 'FlopsInterface'. 'InferenceInterface' is used to do inference for operation, which would be introduced later. 'FlopsInterface' is used to count FLOPs (floating point operations) of operation, as we are interested in the FLOPs of a NN model, also we use it to evaluate chip performance after running on the chip.
There are top operations defined based on TOP BaseOp or TOP Op. Here just using ConvOp and WeightOp for examples.
## 3.3.1 top::ConvOp
ConvOp is defined as below:
```
```
```
```
ConvOp represents conv operation of NN models, like Figure 2:
and in mlir file experessed as below:
```
```
## 3.3.2 top::WeightOp
WeightOp is a special operation for weight datas. Defined as below:
```
```
Figure 2: Convolution operation defined in ONNX.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Convolutional Neural Network Layer (Conv)
### Overview
The image depicts a technical diagram of a convolutional neural network (CNN) layer, specifically a "Conv" layer. It includes a flowchart on the left and a structured table on the right. The flowchart illustrates the flow of data through the layer, while the table details the layer's attributes, inputs, and outputs.
### Components/Axes
#### Left Diagram (Flowchart):
- **Input**: Labeled as "input" with dimensions `1x16x100x100`.
- **Conv Layer**: Contains parameters:
- `W (32x16x3x3)` (weights)
- `B (32)` (bias)
- **Output**: Labeled as "output" with dimensions `1x32x100x100`.
- **Arrows**: Indicate the flow from input β Conv layer β output.
#### Right Table (Attributes/Inputs/Outputs):
- **Attributes**:
- `dilations`: `1, 1`
- `group`: `1`
- `kernel_shape`: `3, 3`
- `pads`: `1, 1, 1, 1`
- `strides`: `1, 1`
- **Inputs**:
- `X` (name: "input")
- `W` (name: "weight")
- `B` (name: "bias")
- **Outputs**:
- `Y` (name: "output")
### Detailed Analysis
- **Input Dimensions**: The input tensor has a shape of `1x16x100x100`, indicating a single-channel (grayscale) image with height and width of 100.
- **Conv Layer Parameters**:
- **Kernel Shape**: `3x3` (3x3 spatial kernel).
- **Pads**: `1, 1, 1, 1` (padding added to all sides of the input).
- **Strides**: `1, 1` (step size for sliding the kernel).
- **Dilations**: `1, 1` (no dilation, standard convolution).
- **Group**: `1` (single group, no grouped convolution).
- **Output Dimensions**: The output tensor has a shape of `1x32x100x100`, indicating 32 output channels (feature maps) with the same spatial dimensions as the input.
- **Inputs/Outputs**:
- `X` (input) is the original data.
- `W` (weights) and `B` (bias) are learnable parameters.
- `Y` (output) is the result of the convolution operation.
### Key Observations
- The Conv layer preserves spatial dimensions (`100x100`) due to padding (`1` on all sides) and stride (`1`).
- The number of output channels (`32`) is determined by the weight tensor `W` (32x16x3x3), where the first dimension (32) corresponds to the number of filters.
- The `kernel_shape` of `3x3` suggests a standard convolution for edge detection or feature extraction.
- The `group=1` indicates no grouped convolution, meaning all input channels are connected to all output channels.
### Interpretation
This diagram illustrates a standard 2D convolutional layer in a CNN. The input is processed with a `3x3` kernel, padded to maintain spatial dimensions, and transformed into 32 feature maps. The parameters (`dilations`, `group`, `kernel_shape`, `pads`, `strides`) define how the kernel interacts with the input. The output retains the same spatial resolution as the input but increases the number of channels, enabling the network to learn hierarchical features. The table on the right provides a structured summary of the layer's configuration, critical for implementation or debugging.
No numerical trends or anomalies are present, as this is a static diagram of a layer configuration rather than a dynamic dataset.
</details>
WeightOp is corresponding to weight operation. Weight data is stored in 'module.weight file', WeightOp can read data from weight file by read method, or create new WeightOp by create method.
## 3.4. TPU Dialect
TPU Dialect is defined as below:
```
```
TPU dialect is for TPU chips, here we support SOPHGO AI chips first. It is used to generate chip command instruction sequences by tpu operations.
In TPU dialect, TPU BaseOp and TPU Op define as:
```
```
```
```
TPU Op has two interfaces, 'GlobalGenInterface' and 'InferenceInterface'. 'GlobalGenInterface' is used to generate chip command. 'InferenceInterface' is used to do inference for tpu operations.
There are top operations defined based on TOP BaseOp or TOP Op. Here using tpu::ConvOp , and tpu::CastOp, and tpu::GroupOp for example.
## 3.4.1 tpu::Conv2DOp
Conv2DOp is defined as below:
```
```
Compared to top::ConvOp, tpu::Conv2DOp has some new attributes: 'multiplier', 'rshift' and 'group info'. 'multiplier' and 'rshift' are used to do INT8 convolution after quantization, and not used if the convolution is float. 'group info' is used for the layer group. We will discuss layer group later.
## 3.4.2 tpu::CastOp
CastOp is defined as below:
```
```
```
```
Listing 2: top . cast operaiotn convertion a quant . calibrated type to quant . uniform type.
```
```
Listing 3: top . cast operaiotn convertion a quant . uniform type to float32 type.
CastOp is for transferring tensor type from one type to another. It can convert the F32 type to BF16[10] type or F16 type, or INT8 type, and the other way around is also OK.
Specially, if input is F32 type and output is quantization type, such as Listing 2, then:
<!-- formula-not-decoded -->
If input is quantization type and output is F32 type, such as Listing 3, then
<!-- formula-not-decoded -->
.
## 3.4.3 tpu::GroupOp
GroupOp is defined as below:
```
```
GroupOp contains serial operations that can inference in tpu local memory. We will discuss it later.
## 3.5. Conversion
This section we discuss how to convert top ops to tpu ops.
We define 'ConvertTopToTpu' pass like this:
```
```
There are there options: mode, chip and isAsymmetric.
- 1) mode : set quantization mode, e.g. INT8, BF16, F16 or F32. Types should be supported by the chip.
- 2) chip : set chip name. TPU operations will act by this chip.
- 3) isAsymmetric : if mode is INT8, set true for asymmetric quantization; false for symmetric quantization.
Typically, most attributes are the same after converting from TPU ops to TPU ops at float type (F32/BF16/F16). However, if the conversion is from float type to INT8 type, we should do PTQ (Posttraining Quantization)[4] at TOP dialect and add some external quantization attributes to TPU ops. At the same time, weight data, inputs, and outputs will be quantized to INT8. In addition, inputs and outputs
Figure 3: The neural network model conversion flow of TPU-MLIR.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram Type: Model Transformation and Similarity Metrics Flowchart
### Overview
The image presents a technical diagram illustrating the transformation of machine learning models through different precision levels (F32, Int8) and the mathematical definitions of similarity metrics used in these models. It combines flowchart elements with code snippets and mathematical operations.
### Components/Axes
#### Flowchart Components
1. **Origin Model**
- Input (Float)
- Network (Float)
- Output (Float)
2. **Top Milr**
- Input (Float)
- Network (Float)
- Output (Float)
3. **F32 Tpu Milr**
- Input (Float)
- Network (Float)
- Output (Float)
4. **Int8 Tpu Milr**
- Input (Float)
- Network (Int8)
- Output (Float)
#### Code Sections
1. **Cosine Similarity Function**
```python
def cosine_similarity(self, x, y):
numerator = sum(a*b for a, b in zip(x,y))
denominator = self.square_root(x)*self.square_root(y)
return round(numerator/float(denominator),3)
```
2. **Euclidean Similarity Function**
```python
def euclidean_similarity(x, y):
ed = euclidean_distance(x, y)
sr = square_root((x+y)/2)
return 1 - ed/sr
```
#### Mathematical Definitions
- **Square Root Function**
```python
def square_rooted(self, x):
return sqrt(sum([a*a for a in x]))
```
- **Cosine Similarity Formula**
```python
cosine_similarity = (sum(a*b for a,b in zip(x,y))) / (sqrt(sum(a**2 for a in x)) * sqrt(sum(b**2 for b in y)))
```
- **Euclidean Similarity Definition**
```python
euclidean_similarity = 1 - (euclidean_distance(x, y) / square_root((x+y)/2))
```
### Detailed Analysis
#### Model Transformation Flow
1. **Origin Model β Top Milr**
- Direct transformation via "To Top Milr" arrow
- Maintains Float precision throughout
2. **Top Milr β F32 Tpu Milr**
- Explicit "To F32 Milr" conversion
- Preserves Float precision
3. **Top Milr β Int8 Tpu Milr**
- Requires "CastOp (Int8)" operation
- Reduces precision from Float to Int8
- Maintains Float output
#### Code Snippets
- **Square Root Function**
- Computes L2 norm via element-wise squaring and summation
- Returns `sqrt(sum([a*a for a in x]))`
- **Cosine Similarity**
- Calculates dot product divided by product of magnitudes
- Rounds result to 3 decimal places
- **Euclidean Similarity**
- Uses Euclidean distance normalized by square root of average
- Returns similarity score between 0 and 1
### Key Observations
1. **Precision Optimization**
- Model transformations show deliberate precision reduction (Float β Int8)
- Suggests hardware efficiency optimization for TPU inference
2. **Mathematical Consistency**
- All similarity metrics use standardized mathematical operations
- Rounding to 3 decimals implies precision requirements for model stability
3. **Hardware-Specific Adaptation**
- Int8 Tpu Milr introduces quantization (CastOp) while maintaining Float outputs
- Indicates post-quantization calibration steps
### Interpretation
The diagram reveals a systematic approach to model optimization for edge deployment:
1. **Quantization Strategy**
- Maintains Float precision in Top Milr stage
- Applies 8-bit quantization (Int8) only in final Tpu Milr stage
- Suggests staged optimization balancing accuracy and efficiency
2. **Similarity Metric Implementation**
- Cosine similarity uses explicit rounding, critical for stable comparisons
- Euclidean similarity combines distance calculation with normalization
- Both metrics form the foundation for model decision-making processes
3. **Hardware-Software Co-Design**
- Model architecture (F32/Int8) directly maps to TPU capabilities
- Mathematical operations align with hardware acceleration requirements
- CastOp operations indicate explicit precision management for hardware compatibility
The absence of numerical data points suggests this is a conceptual architecture diagram rather than empirical results. The emphasis on precision management and standardized mathematical operations indicates a focus on reproducible, hardware-optimized model deployment.
</details>
of a NN model need to insert CastOp if the convert type is not F32. The conversion flow chart shows in Figure 3.
## 3.6. Inference
This section discusses why we need inferences and how to support inferences for TOP and TPU dialects.
## 3.6.1 Why
TOP dialect run inference and get inference results, which has three uses.
- 1) It can be used to compare with original model results, to make sure NN model converts to TOP dialect correctly.
- 2) It can be used for calibration, which uses a few sampled inputs to run inference by top mlir file and get every intermediate result to stat proper min/max threshold used by Quantization.
- 3) It can be used to compare with the inference results tpu dialect to ensure tpu mlir is correct.
TPU dialect runs inference and gets inference results, which would compare with top mlir results. If tpu mlir is in F32 mode, the results should be the same. If tpu mlir is BF16/F16 mode, the tpu results may have some loss but should still have a good cosine ( > 0.95) and euclidean ( > 0.85) similarity. If tpu mlir is INT8 mode, cosine similarity should be greater than 0.9, and euclidean similarity should be greater than 0.5, based on experience. If the cosine similarity and euclidean similarity are not satisfied, the conversion correction from top to tpu is not guaranteed.
Cosine similarity is defined as below:
At last, after being compiled, the model can deploy in the tpu device and check the result with tpu mlir to ensure codegen is correct. If not similar, there are some bugs in codegen.
## 3.6.2 How
The NN models will run on NN runtime. For example, ONNX models can run on ONNX runtime. TOP dialect and TPU dialect run inference by 'InferenceInterface', which defines as below:
```
```
Figure 4: Buffer allocation in TPU-MLIR.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Simple Neural Network Model and System Memory Allocation
### Overview
The image depicts a simplified neural network (NN) model on the left and its corresponding system memory allocation on the right. The left diagram illustrates the computational flow of the NN, while the right diagram shows how parameters (filters, biases) and activations are stored in memory.
### Components/Axes
#### Left Diagram (Neural Network Model):
- **Components**:
- Two convolutional layers (`Conv`), labeled `Conv0` and `Conv1`.
- Two addition operations (`Add`), labeled `Add`.
- Nodes numbered `0`, `1`, `2`, and `3`, representing intermediate outputs.
- **Connections**:
- `Conv0` receives inputs from `filter0` and `bias0`, outputs to node `0`.
- `Conv1` receives inputs from `filter1` and `bias1`, outputs to node `1`.
- `Add` combines outputs from node `1` and node `2` (unlabeled in the diagram but implied by the flow), producing node `3`.
#### Right Diagram (System Memory):
- **Structure**:
- A vertical table with two columns: `Activations` (top) and `Weights` (bottom).
- Rows are numbered `3` (top) to `0` (bottom) for `Activations`, and `0` (top) to `3` (bottom) for `Weights`.
- **Labels**:
- `Activations` column lists: `bias1`, `filter1`, `bias0`, `filter0`, and `...` (truncated).
- `Weights` column lists: `filter0`, `bias0`, `filter1`, `bias1`, and `...` (truncated).
### Detailed Analysis
#### Left Diagram:
1. **Convolutional Layers**:
- `Conv0` and `Conv1` represent feature extraction stages. Each `Conv` layer has associated `filter` and `bias` parameters.
- Filters (`filter0`, `filter1`) and biases (`bias0`, `bias1`) are stored in memory (right diagram).
2. **Add Operation**:
- Combines outputs from `Conv1` (node `1`) and an intermediate node (`2`), producing the final activation (node `3`).
#### Right Diagram:
1. **Memory Allocation**:
- **Activations**: Stored in descending order (`3` to `0`), starting with `bias1` and ending with `filter0`.
- **Weights**: Stored in ascending order (`0` to `3`), starting with `filter0` and ending with `bias1`.
- The `...` indicates additional parameters not shown in the diagram.
### Key Observations
1. **Memory Layout**:
- Activations and weights are stored in separate memory regions, with distinct ordering.
- Activations follow a reverse numerical order (`3` to `0`), while weights follow a forward numerical order (`0` to `3`).
2. **Component Relationships**:
- Each `Conv` layerβs `filter` and `bias` are stored sequentially in memory (e.g., `filter0`, `bias0` for `Conv0`).
- The `Add` operationβs output (node `3`) corresponds to the topmost activation (`bias1`) in memory.
### Interpretation
1. **Computational Flow vs. Memory Storage**:
- The NNβs forward pass (left) processes data through `Conv` layers and an `Add` operation, while the right diagram shows how parameters and intermediate results are stored for efficient access.
- The memory layout suggests a design optimized for sequential data retrieval, aligning with the NNβs computation steps.
2. **Significance of Ordering**:
- The reverse ordering of activations (`3` to `0`) may reflect the backward flow of gradients during backpropagation, though this is speculative without explicit labels.
- Weights are stored in the order they are used during forward propagation (`filter0`, `bias0`, etc.).
3. **Implications for Efficiency**:
- The memory allocation minimizes redundant data access by organizing parameters and activations in a predictable sequence.
- The diagram highlights the importance of memory hierarchy in deep learning systems, where parameter storage and activation caching are critical for performance.
### Conclusion
The diagram illustrates the interplay between a neural networkβs computational graph and its memory architecture. The structured memory allocation ensures that parameters and activations are stored in a way that aligns with the NNβs execution flow, optimizing both speed and resource utilization. This design is foundational for efficient inference and training in deep learning systems.
</details>
'inputs' and 'outputs' in 'InferenceParameter' point to input buffers and output buffers of the operation. All buffers that tensor needed would be allocated after mlir file were loaded. Each buffer size is calculated from Value's type. For example, the tensor γ 1x32x100x100xf32 γ needs 1 Γ 32 Γ 100 Γ 100 Γ sizeof ( float ) = 1280000 bytes . Figure 4 is an example.
Weights are allocated and loaded first, and then activations are allocated. Before inference, inputs of the model will be loaded to input buffers. And then, run inference. After inference, results are stored in each activation buffers.
'handle' in 'InferenceParameter' is used to point third-party excute engine, and it is optional.
'InferenceInterface' has three functions: 'init', 'inference', 'deinit'. 'init' and 'deinit' are used to init and deinit handle of third-party engine if needed, or do nothing. 'inference' is used to run inference with 'inputs' in 'InferenceParameter' and store results in 'outputs' of 'InferenceParameter'.
## 3.7. Layer Group
Layer group in TPU-MLIR means some layers composed into one group execute in the TPU chip. The layer here is the same thing as the operation in MLIR. Typically, RAM on a chip is tiny, such as 256KB, while DDRoff-chip is very large, such as 4GB. We need layers to run on the chip successively to achieve high performance, but the RAM on a chip is too small to support it. So we slice layers into small pieces to make sure layers in a group can run successively. Usually, we slice layers by N or H dimension. Figure 5 shows an example.
In mlir, we define group attributes for tpu operations:
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Residual Network Block Diagram: Architecture Overview
### Overview
The diagram illustrates a residual network (ResNet) block architecture, showing the flow of data through convolutional layers, addition operations, and dimensional transformations. The structure emphasizes skip connections (shortcut paths) that merge with expanded feature maps to produce the final output.
### Components/Axes
- **Input**: `2x32x40x60` (batch size, channels, height, width)
- **Conv Layers**:
- Left path: Two `Conv` operations maintaining `2x32x40x60` dimensions.
- Right path: Four `Conv` operations with decreasing spatial dimensions:
- `1x32x22x60` β `1x32x20x60` (via two `Conv` layers).
- **Add Operations**: Element-wise addition of feature maps.
- **Output**: `2x32x40x60` (matches input dimensions via skip connection).
### Detailed Analysis
1. **Left Path (Shortcut)**:
- Input β `Conv(2x32x40x60)` β `Conv(2x32x40x60)` β `Add` β Output.
- Dimensions remain unchanged via identity mapping.
2. **Right Path (Expansion)**:
- Input splits into four branches:
- **Branch 1**: `Conv(1x32x22x60)` β `Conv(1x32x20x60)` β `Add` β `1x32x20x60`.
- **Branch 2**: `Conv(1x32x22x60)` β `Conv(1x32x20x60)` β `Add` β `1x32x20x60`.
- **Branch 3**: `Conv(1x32x22x60)` β `Conv(1x32x20x60)` β `Add` β `1x32x20x60`.
- **Branch 4**: `Conv(1x32x22x60)` β `Conv(1x32x20x60)` β `Add` β `1x32x20x60`.
- Branches merge via element-wise addition to produce `1x32x20x60`.
3. **Final Merge**:
- Right path output (`1x32x20x60`) is upsampled or transformed (implied but not explicitly labeled) to match the left path's `2x32x40x60` dimensions.
- Merged with left path output via `Add` to produce final `2x32x40x60` output.
### Key Observations
- **Dimensional Consistency**: The left path preserves input dimensions, while the right path reduces spatial resolution (40β20) but increases channel depth (1β2 via branching).
- **Skip Connection**: The identity shortcut ensures gradients flow directly through the network, mitigating vanishing gradient issues.
- **Feature Aggregation**: Four parallel `Conv` branches suggest feature diversification before merging.
### Interpretation
This ResNet block demonstrates the core innovation of residual networks: **learning residual functions** instead of raw outputs. By adding skip connections, the network learns identity mappings (via the left path) and complex transformations (via the right path), enabling deeper architectures without degradation. The four-branch design on the right likely captures multi-scale features, while the final upsampling (implied) restores spatial resolution for seamless integration with the shortcut path. This architecture is foundational for training very deep networks (e.g., 152+ layers) by balancing depth with gradient stability.
</details>
Figure 5: Slice the H dimensional in a layer group.
```
```
Different architecture TPU may have different attributes, attributes in 'LayerGroup' are examples:
- 1) out addr: output address in RAM on chip
- 2) out size: output memory size in RAM on chip
- 3) buffer addr: buffer address for operation in RAM on chip
- 4) buffer size: buffer size in RAM on chip
- 5) eu align: whether data arranged in RAM on chip is aligned
- 6) h idx: offset positions in h dimension as h has been sliced
- 7) h slice: size of each piece after sliced
- 8) n idx, n slice: for n dimension slice
MLIR file with groups, like this Listing 4 (to make it simple, we have removed unrelated info from the file): Layers in a group will execute on a chip successively, and the DMA will load data from DDR off-chip to RAM on-chip and store results back to DDR at the frontier of each group.
```
```
Listing 4: MLIR file with layer groups.
## 3.8. Workflow
This section we discuss the workflow of TPU-MLIR, expecially the main passes.
- 1) OnnxConverter: use python interface to convert ONNX NN models to the TOP dialect mlir.
- 2) Canolicalize for TOP: do graph optimization on top operations. For example, we fuse top::ReluOp into top::ConvOp, and we use depthwise conv to take the place of the batchNorn operation.
- 3) Calibration for TOP: use a few sampled inputs to do inference by top mlir file, and get every intermediate result, to stat proper min/max threshold. We use quant::CalibratedQuantizedType to express these calibration informations. For example, a value type is tensor γ 1x16x100x100xf32 γ , and it's calibration informations are: min = -4.178, max = 4.493, threshold = 4.30. Then new type would be tensor γ 1x16x100x100x!quant.calibrated γ f32 γ -4.178:4.493 γγγ for asymmetric quantizaion, and tensor γ 1x16x100x100x!quant.calibrated γ f32 γ -4.30:4.30 γγγ for symmetric quantization. Do calibation only for int8 quantizaiton, and there is no need to do it for float convertion.
- 4) Conversion for TOP: convert top operations to tpu operations. We have discussed it above.
- 5) Layer group for TPU: determine groups of operations to execute successively in ram on tpu. We have discussed it above.
- 6) Memory assign for TPU: after TPU operations are ready, all operations out of group need to assign memory in DDR, especially assign physical address. We set physical address in tensor type, such as 4295618560 in tensor γ 1x32x100x100xf32, 4295618560:i64 γ . We don't discuss how to assign memory by an optimal solution here.
- 7) Codegen for TPU: each TPU operation has codegen interface for different chips and has a corresponding TPU commands packaged in one kernel API. So what codegen to do is here just to call these APIs for each tpu operations, and collect commands to store in one model.
## 4. Conclusion
We are developing TPU-MLIR to compile NN models for TPU. We design the TOP and TPU dialects as device-independent and device-dependent, respectively. We convert NN models to Top dialect as device independent and convert TOP to TPU for various chips and types. We define WeightOp for weight operation and store weight data in the NumPy npz file. We design 'InferenceInterface' for top and tpu to ensure correct conversions. In the future, we will try to support more TPU chips and NN models with various NN frameworks.
## References
- [1] J. Bai, F. Lu, K. Zhang, et al. Onnx: Open neural network exchange. https://github.com/onnx/ onnx , 2019.
- [2] Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) , pages 3009-3018. IEEE, 2019.
- [3] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630 , 2021.
- [4] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for
efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2704-2713, 2018.
- [5] T. Jin, G.-T. Bercea, T. D. Le, T. Chen, G. Su, H. Imai, Y. Negishi, A. Leu, K. O'Brien, K. Kawachiya, et al. Compiling onnx neural network models using mlir. arXiv preprint arXiv:2008.08272 , 2020.
- [6] R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 , 2018.
- [7] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko. Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , pages 2-14. IEEE, 2021.
- [8] S. Migacz. Nvidia 8-bit inference with tensorrt. GPU Technology Conference , 2017.
- [9] N. Vasilache, O. Zinenko, A. J. Bik, M. Ravishankar, T. Raoux, A. Belyaev, M. Springer, T. Gysi, D. Caballero, S. Herhut, et al. Composable and modular code generation in mlir: A structured and retargetable approach to tensor compiler construction. arXiv preprint arXiv:2202.03293 , 2022.
- [10] S. Wang and P. Kanwar. Bfloat16: The secret to high performance on cloud tpus. Google Cloud Blog , 4, 2019.
- [11] H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602 , 2020.