2211.12877

Model: gemini-2.0-flash

## End-to-End DNN Inference on a Massively Parallel Analog In Memory Computing Architecture Nazareno Bruschi , Giuseppe Tagliavini , Angelo Garofalo , Francesco Conti Irem Boybat , Luca Benini , Davide Rossi ∗ ∗ ∗† ∗ , ‡ ∗† ∗ University of Bologna , Bologna, Italy, ETH , Zurich, Switzerland ∗ † ‡ IBM Research , Zurich, Switzerland Abstract -The demand for computation resources and energy efficiency of Convolutional Neural Networks (CNN) applications requires a new paradigm to overcome the 'Memory Wall'. Analog In-Memory Computing (AIMC) is a promising paradigm since it performs matrix-vector multiplications, the critical kernel of many ML applications, in-place in the analog domain within memory arrays structured as crossbars of memory cells. However, several factors limit the full exploitation of this technology, including the physical fabrication of the crossbar devices, which constrain the memory capacity of a single array. Multi-AIMC architectures have been proposed to overcome this limitation, but they have been demonstrated only for tiny and custom CNNs or performing some layers off-chip. In this work, we present the full inference of an end-to-end ResNet-18 DNN on a 512-cluster heterogeneous architecture coupling a mix of AIMC cores and digital RISCV cores, achieving up to 20.2 TOPS. Moreover, we analyze the mapping of the network on the available non-volatile cells, compare it with state-of-the-art models, and derive guidelines for next-generation many-core architectures based on AIMC devices. Index Terms -In-Memory Computing, Heterogenous systems, many-core architectures, Convolutional Neural Networks ## I. INTRODUCTION Matrix-Vector Multiplication (MVM) is the critical operation in modern Deep Neural Networks (DNN), and its optimization has been tackled from different perspectives, from software kernels to hardware accelerators. In recent years, Analog InMemory Computing (AIMC) has been a widely studied computing paradigm since it promises outstanding performance and energy efficiency on MVM operations [1]. However, the largescale usage of AIMC in commercial products is limited by technological issues, especially in fabricating large arrays. AIMC can be employed using very different memory technologies, which can be classified as volatile and non-volatile. The former has a more mature community, especially for SRAM technology, due to its robustness and viability for largescale integration in any CMOS node. Several SRAM-based chips have been developed targeting any DNN requirements [2]. However, they generally require moving network parameters among large off-chip memories to be temporarily stored in the on-chip computational memory, negatively impacting energy consumption [3]. Non-volatile AIMC (nvAIMC) instead merges parameter storage with computational memory. In this way, parameters do not need to be transferred from onor off-chip storage through the memory hierarchy. However, the limited writing access speed [4] of nvIMC devices introduces the need for a static mapping strategy to preserve the performance capability of such devices. Moreover, a further challenge is the fabricable size of nvIMC devices, which de-facto is limited to 1024 × 1024 with up to 8-bit equivalent memory cells [4]. During the last few years, the fabrication of several prototypes exploiting these technologies [5]-[7] demonstrated the feasibility of the approach, despite several open challenges related to the intrinsic variability of analog computing, the need for specialized training to address analog noise and nonidealities. On the other hand, most of these works aimed at demonstrating the technology rather than targeting end-to-end inference of deep neural networks. One of the limitations of nvAIMC cores is their little flexibility due to the ability to implement only MVMs. For this reason, few recent works [8]-[10] proposed the integration of nvAIMC cores into digital System-on-Chips (SoC), exploiting a mix of nvAIMC cores and more flexible specialized and programmable digital processors. Thanks to this mix, they demonstrated remarkable performance on the full inference of neural networks in the mobile domain, such as MobileNetV2, time multiplexing computations on several nvAIMC cores [9], [10]. Indeed, in the mobile domain, it is common that only one sample is processed at a time, relaxing the requirements of layer pipelining. This constraint significantly limits the potential of nvAIMC since only one core can be active at a given time. This constraint can be relaxed when leaving the mobile domain: high-performance inference of DNNs typically exploits batching due to the large number of images typically processed in HPC and data centers applications. Several recent works exploited this feature proposing many-core data-flow architectures. On the other hand, most of these works made strong assumptions about the characteristics of the networks to be processed to better fit the shape of the DNN on the proposed architectures. For example, Dazzi et al. [11] targeted a relatively small ResNet-like network targeting the CIFAR10 dataset, while Shafiee et al. [12] and Ankit et al. [13] target VGG-like networks featuring no residual layers, nicely fitting mapping on pipelined data-flow architectures. However, DNNs generally feature data flow graph loops (e.g., residual layers) that make a straightforward pipelining implementation much more challenging. Moreover, most of these architectures only feature specialized accelerators for implementing digital functions such as ReLU, and MaxPool, somehow limiting the flexibility of their approach. Fig. 1. A) Cluster architecture. B) Massively parallel system architecture. C) IMA subsystem. D) Router model. <details> <summary>Image 1 Details</summary> ![55084b1e](/v1/image/55084b1e542e618b57b422e5c6a5fdde9c3b13ec9be465a1fadac4a640cd281c) ### Visual Description ## System Architecture Diagram: Multi-Level Memory Hierarchy ### Overview The image presents a system architecture diagram showcasing a multi-level memory hierarchy. It consists of four sub-diagrams (A, B, C, and D) illustrating different aspects of the system's memory organization and data flow. The architecture includes components such as L1 Local TCDM, Crossbar, IMA, I-Cache, HBM, L1/L2/L3 caches, AIMC CORE, and associated control and data paths. ### Components/Axes **A) Local TCDM and Crossbar:** * **L1 Local TCDM:** Contains blocks labeled B0, B1, B2, B3, B4, ..., BN. * **EV:** Connected to the Crossbar. * **DMA:** Connected to the Crossbar. * **Crossbar:** Interconnects L1 Local TCDM, EV, DMA, IMA, and RV blocks. * **IMA:** Connected to the Crossbar. * **RV:** Multiple instances connected to the Crossbar and I-Cache. * **I-Cache:** Connected to the RV blocks. **B) HBM and Cache Hierarchy:** * **HBM:** High Bandwidth Memory at the top. * **HBM Ctrl:** HBM Controller, connected to HBM and the Wrapper. * **Chip Domain:** Enclosed by a dotted line, containing the cache hierarchy. * **L1 quad:** Four L1 cache blocks (labeled "L1") arranged in a square, each connected to a "CL" block. The L1 quad is colored green. * **L2 quad:** A yellow block labeled "L2 quad". * **L2:** Vertical yellow block connecting L1 quads and L3 blocks. Each L1 quad is connected to the L2 block via a "CL" block. * **L3 quadrant:** A red block labeled "L3 quadrant". * **L3:** Horizontal red blocks connecting the L2 block to the L3 quadrant. * **Wrapper:** A blue vertical block connecting HBM Ctrl to the L3 blocks. * **CL:** Small blocks connecting L1 quads to L2 and L2 to L3. **C) Analog Domain and Streamers:** * **Streamers:** Block on the left, connected to the input buffer. * **Input Buffer:** Connected to the Streamers and DAC. * **Analog Domain:** Enclosed by a dotted line. * **AIMC CORE:** Core component within the Analog Domain. * **DAC:** Digital-to-Analog Converter, connected to the Input Buffer and AIMC CORE. * **ADC:** Analog-to-Digital Converter, connected to the AIMC CORE and Output Buffer. * **Output Buffer:** Connected to the ADC and CTRL. * **CTRL:** Control block, connected to Streamers and Output Buffer. **D) L2/L3 Interconnection:** * **L2_Q0, L2_Q1, L2_Q2, L2_Q3:** Four blocks representing L2 quadrants. * **L2:** Central block connecting the L2 quadrants. * **L3:** Block connected to the L2 block. * **M -> S:** Direction indicator. * **Write:** Green arrows indicating write direction. * **Read:** Red arrows indicating read direction. ### Detailed Analysis or ### Content Details **A) Local TCDM and Crossbar:** * The L1 Local TCDM appears to be a local memory with multiple banks (B0 to BN). * The Crossbar acts as a central interconnect, allowing communication between EV, DMA, IMA, RV blocks, and the L1 Local TCDM. **B) HBM and Cache Hierarchy:** * The HBM is the highest level of memory in this hierarchy. * The cache hierarchy consists of L1, L2, and L3 caches. * The L1 caches are organized in quads, connected to the L2 cache. * The L2 cache connects the L1 quads to the L3 caches. * The L3 caches are connected to the L3 quadrant. * The Wrapper facilitates communication between the HBM Controller and the L3 caches. **C) Analog Domain and Streamers:** * The Streamers likely handle data input. * The Analog Domain contains the AIMC CORE, which is connected to DAC and ADC. * The DAC converts digital signals to analog, and the ADC converts analog signals to digital. * The CTRL block manages control signals for the Streamers and Output Buffer. **D) L2/L3 Interconnection:** * The L2 quadrants (L2_Q0 to L2_Q3) are interconnected via the L2 block. * The L2 block is connected to the L3 block. * Write operations are indicated by green arrows, and read operations are indicated by red arrows. ### Key Observations * The system architecture features a multi-level memory hierarchy with L1, L2, and L3 caches. * The HBM provides high-bandwidth memory access. * The Analog Domain includes the AIMC CORE, DAC, and ADC, suggesting analog signal processing capabilities. * The Crossbar enables flexible communication between various components. ### Interpretation The diagram illustrates a complex system architecture designed for high-performance computing. The multi-level memory hierarchy, including HBM and L1/L2/L3 caches, aims to reduce memory access latency and improve overall system performance. The Analog Domain suggests the system is capable of processing analog signals, potentially for applications such as signal processing or machine learning. The Crossbar provides a flexible interconnect, allowing different components to communicate efficiently. The presence of DMA indicates support for direct memory access, further enhancing performance. The diagram suggests a heterogeneous architecture integrating digital and analog processing capabilities with a sophisticated memory subsystem. </details> In this work, we tackle the problem from another perspective. We present a general-purpose system based on RISC-V cores for digital computations and nvAIMC cores for analogamenable operations, such as 2D convolutions. A scalable hierarchical network-on-chip interconnects the system to maximize on-chip bandwidth and reduce communication latency. We evaluate all the system inefficiencies, especially for the nonideal mappings and communication infrastructure bottleneck running a real-life network for state-of-the-art applications such as ResNet-18 inference on 256 × 256 image dataset. We perform an experimental assessment on an extended version of an opensource system-level simulator [14], resulting in up to 20.2 TOPS and 6.5 TOPS/W for the whole ResNet-18 inference of a batch of 16 256x256 images in 4.8 ms. The hardware and software described in this work are open-source, intending to support and boost an innovation ecosystem for next-generation computing platforms. ## II. MASSIVELY PARALLEL HETEROGENEOUS SYSTEM ARCHITECTURE This section presents the proposed heterogeneous manycore SoC architecture. It consists of multiple heterogeneous (analog/digital) clusters communicating through a hierarchical AXI interconnect gathering data from a shared High-Bandwith Memory (HBM), as shown in Fig. 1B. 1) Cluster : The core of the proposed system architecture consists of heterogeneous analog/digital clusters (Fig. 1A). Each cluster contains a set of RISC-V cores (CORES) [15], a shared multi-bank scratchpad data memory (L1) enabling Single Program Multiple Data (SPMD) computations, a hardware synchronizer to accelerate common parallel programming primitives such as thread dispatching and barriers, and a DMA for the cluster to cluster and cluster to HBM communication. Each cluster also includes a nvAIMC Accelerator (IMA) sharing the same multi-banked memory as the CORES for efficient communication, similarly to the architecture presented in Garofalo et al. [9]. 2) IMA : The IMA is built around a Phase-Change Memory (PCM) computational memory organised as a 2D array featuring horizontal word lines and vertical bit lines (Fig. 1C). In computational memory, the PCM cells are exploited as TABLE I ## GVSOC ARCHITECTURE PARAMETERS | Parameter | Value | |---------------------------------------------|------------------------| | Number of clusters | 512 | | Number of IMA per cluster | 1 | | Number of CORES per cluster | 16 | | L1 memory size | 1 MB | | HBM size | 1.5 GB | | Operating frequency | 1 GHz | | Number of streamers ports (read and write) | 16 | | IMA crossbar size | 256 × 256 | | Analog latency (MVM operation) | 130 ns | | Quadrant factor (HBM link,wrapper,L3,L2,L1) | (1,8,4,4,4) | | Data Width (HBM link,wrapper,L3,L2,L1) | (64,64,64,64,64) Bytes | | Latency (HBM, link,wrapper,L3,L2,L1) | (100,4,4,4,4) cycles | programmable resistors placed at the cross points between the word lines and the bit lines, which allows the implementation of MVM in the analog domain with high parallelism and efficiency. In this work, we assume an MVM to be executed in 130 ns as reported in Khaddam et al. [7]. At the beginning of each word lines and the end of each bit lines , Digital-toAnalog (DAC) and Analog-to-Digital converters (ADC) converters perform the conversion between analog and digital domains, respectively. ADCs and DACs connect to two digital buffers connected to the L1 memory through a set of streamers featuring programmable address generation. - 3) Interconnect : The interconnect infrastructure connecting the clusters in the proposed many-core architecture consists of a highly parametrizable hierarchical network composed of a set of AXI4 nodes , as proposed in Kurth et al. [16]. The network topology specifies different regions called quadrants connecting groups of clusters: the Level 1 nodes connect N 1 quadrants (clusters), the Level 2 nodes connect N 2 Level 1 quadrants, and the Level Level N nodes connect N N Level N-1 quadrants, as shown in Fig. 1B. The Quadrant Factor for a given level N defines the number of quadrants (either clusters or level N-1 quadrants) connected to the AXI node for each level. Clusters feature a master and a slave port, which means that a transaction can either be initiated by the target cluster through its master port or by any other cluster through the target cluster's slave port. The same concept applies to the whole hierarchy of quadrants. In both cases, transactions can be either read or write transactions with full support for bursts according to AXI4 specifications. The last level of the interconnect architecture, called Wrapper, connects all the levels below to the off-chip HBM through an HBM controller. ## III. SIMULATION INFRASTRUCTURE We modeled the proposed architecture by extending an opensource simulator named GVSOC [14] meant to simulate RISCV-based clustered multi-core architectures. It is a C++ eventbased simulator featuring a simulation speed of 25 MIPS and an accuracy of more than 90% compared to a cycle-accurate equivalent architecture when simulating a full DNN in a single cluster, as reported in Bruschi et al. [14]. The main components integrated into the simulator are the IMA and the interconnect infrastructure extending the capabili- Fig. 2. A) Directed Acyclic Graph (DAG) of the ResNet-18 execution. B) Mapping example on 512 clusters. C) High-level description of pipelining computational model. <details> <summary>Image 2 Details</summary> ![cbc4f20b](/v1/image/cbc4f20b512081ad5e062e21b32a0987c8acba86aea5f4d1049c6e55908fcaef) ### Visual Description ## Neural Network Pipeline Diagrams ### Overview The image presents three diagrams illustrating different aspects of a neural network pipeline. Diagram A shows the architecture of a convolutional neural network (CNN). Diagram B visualizes the processing of batches through the network. Diagram C details the pipeline stages and batch processing flow. ### Components/Axes **Diagram A: CNN Architecture** * **Nodes:** Represent layers in the CNN. Each node is labeled with a number (0-27) and a layer type (conv, pool, res, FC). * **Layer Types:** * `conv`: Convolutional layer * `pool`: Pooling layer * `res`: Residual block * `FC`: Fully connected layer * **Connections:** Arrows indicate the flow of data between layers. Some arrows bypass layers, indicating residual connections. * **Color Coding:** Each layer type is associated with a color: * `conv`: Blue, Green, Yellow * `pool`: Purple * `res`: Light Green, Light Pink * `FC`: Red **Diagram B: Batch Processing Visualization** * **Grid:** Represents the processing of batches through the network over time. * **Squares:** Each square represents a batch at a specific stage of processing. * **Color Coding:** The colors of the squares correspond to the layer types in Diagram A. * **Empty Squares:** Represent idle or unprocessed batches. **Diagram C: Pipeline Stages and Batch Flow** * **Pipeline Stages:** Shows three stages: `conv`, `pool`, and `conv`. * **Batch ID:** Indicates the ID of each batch. * **Processing Steps:** Each stage has "in," "compute," and "out" steps. * **Time (t):** Represents the progression of time from left to right. * **Arrows:** Indicate the movement of batches through the pipeline. ### Detailed Analysis **Diagram A: CNN Architecture** * The network starts with a convolutional layer (0) followed by a pooling layer (1) and then a series of convolutional and residual blocks. * Residual connections are present, allowing data to bypass certain layers. For example, the output of layer 3 is added to the output of layer 7. * The network ends with a fully connected layer (27). * The sequence of layers is approximately: conv-pool-conv-conv-res-conv-res-conv-conv-res-conv-conv-res-conv-conv-res-conv-conv-res-conv-conv-res-conv-conv-res-pool-FC **Diagram B: Batch Processing Visualization** * The diagram shows how batches are processed through the network in parallel. * The colors indicate the stage of processing for each batch. * The diagram illustrates the concept of pipelining, where multiple batches are processed simultaneously at different stages. * The first few batches are purple, then red, then yellow. * The diagram shows a total of 10 rows and approximately 25 columns. **Diagram C: Pipeline Stages and Batch Flow** * The diagram illustrates the flow of batches through a three-stage pipeline. * Each stage consists of "in," "compute," and "out" steps. * The diagram shows how batches are processed in parallel, with each stage working on a different batch at the same time. * The batches are processed in order, with batch ID increasing from left to right. * The diagram shows how a new batch is started after the maximum of the "in," "compute," and "out" times for the previous batch. ### Key Observations * **Diagram A:** The CNN architecture includes convolutional, pooling, residual, and fully connected layers. * **Diagram B:** The batch processing visualization demonstrates the pipelined execution of batches through the network. * **Diagram C:** The pipeline stages diagram illustrates the flow of batches through a three-stage pipeline. ### Interpretation The diagrams provide a comprehensive overview of a CNN pipeline, from the architecture of the network to the flow of batches through the pipeline. Diagram A shows the structure of the CNN, including the different types of layers and their connections. Diagram B visualizes the parallel processing of batches through the network, demonstrating the concept of pipelining. Diagram C provides a detailed view of the pipeline stages and the flow of batches through each stage. Together, these diagrams provide a clear understanding of how a CNN pipeline works and how it can be optimized for performance. </details> ties of the simulator towards many-core accelerators (i.e., up to 512 clusters and 8192 RISC-V cores). The IMA is integrated into the cluster as a master of the cluster crossbar. All the components of the IMA have been modeled, including the input and output buffers and the streamers. At the system level, the interconnect infrastructure has been modeled as a set of parametric router components with configurable data width, latency, and the number of master and slave ports combined together to create the topology described in Fig. 1D. Table I describes the configuration parameters of the platform used in this work. All the modules in the simulator have been calibrated using the cycle-accurate RTL and FPGA equivalent. 256 × 256 IMA size has been used since it has been demonstrated in more works and shows better technological feasibility at this time [7]. This infrastructure allows to simulate the execution of a full ResNet-18 on 512 instantiated clusters in less than 20 minutes on a 32 GB RAM, Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz. ## IV. COMPUTATIONAL MODEL This section presents the computational model of the proposed massively parallel heterogeneous architecture, detailing its main characteristics: Layer Mapping, IMA execution, Pipelining, Data Tiling, and Self-Timed Execution Flow. - 1) Static Layer Mapping : According to the computational model of the proposed many-core architecture, each layer of a DNN is statically mapped to a certain number of clusters, while the input/output features maps (IFM/OFM) are streamed from producer to consumer clusters. Fig. 2B shows the mapping of the ResNet-18 on the architecture, where each node of the graph in Fig. 2A represents a CNN layer, grouped by color according to the IFM dimensions, and every layer is mapped on different clusters of the system, as shown in Fig. 2B. The number of clusters used to map a specific layer depends on the number of Fig. 3. IMA execution model. <details> <summary>Image 3 Details</summary> ![f111f539](/v1/image/f111f539299e8c2740901fbf3f365c79c634dc14c6e2585e25e078e284533e7e) ### Visual Description ## Diagram: L1 TCDM and IMA Architecture ### Overview The image presents a block diagram illustrating the architecture of L1 TCDM (likely a memory component) and IMA (likely an image processing accelerator). It depicts data flow between these components, highlighting input and output buffers, weight crossbar, and analog-to-digital conversion (ADC) processes. ### Components/Axes * **L1 TCDM (Left Side)**: * Two 3D cube-like structures, representing memory blocks. * The top cube has blue and yellow sections in the top-left corner. * The bottom cube has red sections in the bottom-left corner. * Diagrams showing data transformation from IFM0 to OFM0. * IFM0 consists of three stacked blocks colored blue, green, and yellow, labeled with "N" and "i". * OFM0 consists of two blocks colored red, labeled with "i" and "N". * **IMA (Right Side)**: * Input Buffer: Contains three blocks colored blue, green, and yellow, each labeled "i". * DAC (Digital-to-Analog Converter): Converts digital input to analog. * Weights crossbar: Represents the weight matrix used in computation. * Computation Block: Shows the matrix multiplication W^T x IFMi = OFMi. * ADC (Analog-to-Digital Converter): Converts analog output to digital. * Output Buffer: Contains a red block labeled "i". * **Data Flow**: * STREAM-IN: Data flows into the IMA from the top-left. * STREAM-OUT: Data flows out of the IMA from the bottom-left. * COMPUTE: Indicates the computation stage within the IMA. ### Detailed Analysis * **L1 TCDM**: * The top cube has a blue section that occupies approximately 1/8 of the cube's volume, and a yellow section that occupies approximately 1/8 of the cube's volume. * The bottom cube has a red section that occupies approximately 1/8 of the cube's volume. * The IFM0 data transformation shows three blocks stacked vertically, colored blue, green, and yellow. The blocks are labeled "N" and "i". * The OFM0 data transformation shows two blocks, colored red, labeled "i" and "N". * **IMA**: * The Input Buffer contains three blocks, colored blue, green, and yellow, each labeled "i". * The DAC is a single block. * The Weights crossbar is represented as a grid. * The computation block shows the equation W^T x IFMi = OFMi. * The ADC is a single block. * The Output Buffer contains a red block labeled "i". * **Data Flow**: * STREAM-IN: Data flows into the Input Buffer. * STREAM-OUT: Data flows out of the Output Buffer. * COMPUTE: The computation involves the Weights crossbar and the ADC. ### Key Observations * The diagram illustrates the flow of data from L1 TCDM to IMA. * The data transformation from IFM0 to OFM0 is shown. * The computation within the IMA involves a weight crossbar and ADC. * The colors blue, green, yellow, and red are used to represent different data types or stages. ### Interpretation The diagram provides a high-level overview of the architecture and data flow between L1 TCDM and IMA. The L1 TCDM likely serves as a memory component, providing input data (IFM0) to the IMA. The IMA performs computations using a weight crossbar and ADC, producing output data (OFM0). The STREAM-IN and STREAM-OUT arrows indicate the direction of data flow. The colors may represent different data types or stages in the processing pipeline. The diagram suggests a hardware architecture optimized for image processing or similar tasks involving matrix multiplication. </details> parameters of the layer. For example, Layer 22 features 2.3M parameters, requiring 40 clusters for the mapping, assuming each 256x256 IMA can store 64K parameters. - 2) IMA Execution : As described in Sec. II, the IMA subsystem communicates directly to the L1 of the cluster, acting as a master of the TCDM interconnect. Assuming DNN parameters of a specific layer are being pre-loaded to the nonvolatile array, IMA execution is composed of three distinct phases, as shown in Fig. 3. Stream-in fetches the IFM of the layer and moves them to the input buffer of the IMA. Compute performs the input data conversion by the DACs, the analog MVM execution on the crossbar, and the ADCs conversion. Stream-out moves the output digital MVM result from the output buffers to the L1 memory. Input and output buffers are duplicated to enable double buffering, completely overlapping the cost of transfers between the L1 and the buffers with the computation, maximizing the computational efficiency of the accelerator. - 3) Pipelining : When the inference starts, the IFM of the first layer is streamed into the first set of clusters which process it generating the OFM, which is then passed to the second set of clusters and so on. Assuming the possibility of having large batches of images allows for the creation of the software pipeline described in Fig. 2C, where different chunks of data are processed by a different set of clusters simultaneously, fully overlapping the data movements (i.e., in charge of the DMA) with the computation (i.e., in charge of IMA and/or CORES). Ensuring that all pipeline stages execute in the same amount of time is essential when creating such a pipeline structure. Techniques to speed-up pipeline stages, such as parallelization and data-replication , will be discussed in Sec. V. - 4) Data Tiling : To fit IFM/OFM of large DNN models within the limited memory resources of the clusters (1 MB of L1 memory is assumed in this work), we split IFM/OFM into smaller chunks of data called tiles , processed by the clusters as soon as the input data is transferred to the L1 memory. In particular, data tiling is always performed along the W in and W out dimensions for input and output, respectively. In this work, we assume a static tiling strategy, and W in/out implicitly defines the batching dimension. Therefore, the batches are composed of vertical slices of IFM/OFM. The other dimensions ( C in and H in ) are, when necessary, tiled in other clusters to fit the memory requirements (parameters mapping) or to speed up the computation ( parallelization ). - 5) Self-Timed Execution : To implement the pipeline between the tiled structure described in IV-4, we exploit a dataflow self-timed execution model. Computation in a cluster can be performed by the CORES, IMA, or both in parallel. While software execution on the CORES is synchronous, IMA execution is managed asynchronously (like DMA transfers). A cluster can perform a certain computation whenever three conditions are satisfied: a) Chunk N+1 from the producers can be loaded to the L1 memory, b) the consumers are ready to accept the output data of chunk N-1, c) both IMA and CORES are free to compute chunk N. If all the conditions are satisfied, the new iteration can start with the following execution flow: 1) the CORE0 (i.e., master core) first waits for the events from the input and output DMA channels and IMA, 2) the CORE0 configures and triggers I/O DMA channels and IMA for computation of next tile 3) digital processing is performed in parallel on the CORES. 4) All the CORES go to sleep, waiting for the events described in point 1). Fig. 4. A) Generic layer IFM, parameters and OFM. B) Data-replication and parallelization . C) Reduction operation. <details> <summary>Image 4 Details</summary> ![cc7a7ed0](/v1/image/cc7a7ed095f17494a76b16ab3c89d0590ebd07fe382158c26cd57c8e69744011) ### Visual Description ## Diagram: Data Replication and Reduction ### Overview The image illustrates data replication and reduction techniques, likely in the context of parallel processing or hardware acceleration. It consists of three sections: A) a depiction of data tiling, B) a diagram of data replication/parallelization, and C) a diagram of a reduction operation. ### Components/Axes **Section A: Data Tiling** * **Left Cube:** Represents input data. * Dimensions: H_in (height), W_in (width), C_in (channels). * Tiling: Shows H_in_tile (height of tile), C_in_tile (channels of tile). * Labels on the side indicate indices 0, 1, and N. * **Middle Cubes:** Represent the tiled data. * Dimensions: C_in (channels), K_x (kernel width). * C_out is indicated with a double arrow. * **Right Cube:** Represents output data. * Dimensions: H_out (height), W_out (width), C_out (channels). * Tiling: Shows H_out_tile (height of tile). * Labels on the side indicate indices 0, 1, and M. * K_y is indicated with a double arrow. **Section B: Data Replication/Parallelization** * **Top Row:** Shows data tiles, colored blue, orange, and green. * Labels: 0, i, j, N. * Formula: H_in_tile * C_in * **Middle Row:** Shows blue ovals, representing processing units. * **Bottom Row:** Shows output tiles, colored red. * Labels: 0, i, j, M. * Formula: H_out_tile * C_out * Labels: IMAs / CORES **Section C: Reduction** * **Top Row:** Shows data tiles, colored blue, orange, and green. * Labels: 0, 0, 0. * Formula: H_in_tile * C_in_tile * **Middle Row:** Shows "+" symbols within circles, representing addition operations. * **Bottom Row:** Shows output tile, colored red. * Label: 0. * Formula: H_out_tile * C_out * Labels: IMAs, CORES, pipeline stages (reduction tree) ### Detailed Analysis **Section A:** * The input data is divided into tiles of size H_in_tile x C_in_tile. * The output data is divided into tiles of size H_out_tile. * The middle cubes show the kernel with dimensions C_in x K_x. **Section B:** * The data tiles are replicated and processed in parallel by multiple processing units (IMAs/CORES). * Each processing unit produces an output tile. * The number of processing units is likely related to the number of input tiles. **Section C:** * The data tiles are reduced using a tree-like structure of addition operations. * The output of each addition operation is fed into the next stage of the reduction tree. * The final output is a single tile. ### Key Observations * The image illustrates two common techniques for parallel processing: data replication/parallelization and reduction. * Data tiling is used to divide the input and output data into smaller chunks. * The reduction operation is implemented using a tree-like structure, which allows for efficient parallel computation. ### Interpretation The image demonstrates how data can be processed in parallel using data replication and reduction techniques. Data tiling is used to divide the input and output data into smaller chunks, which can be processed independently. Data replication allows multiple processing units to work on different parts of the data simultaneously. Reduction is used to combine the results of the parallel processing into a single output. These techniques are commonly used in hardware accelerators and parallel processing systems to improve performance. </details> ## V. EVALUATION: RESNET-18 In this section, we present the mapping of the ResNet-18 on the proposed many-core architecture, providing insights on the adaptations required to the baseline mapping and execution model presented in Sec. IV to map all the layers and balance the pipeline optimally. The key operation of a ResNet-18 is a sequence of two 3x3 2D convolutions followed by a tensor addition (i.e., residual layer) between the OFM of the previous layer and the OFM of the previous residual layer. The first two layers are a 7x7 2D convolution followed by a MaxPool activation layer that starts propagating the residuals. The topology is shown in Fig. 2A. - 1) Multi-cluster layers : When mapping a real-life network to the many-core architecture, the ideal condition would consist of perfectly matching the parameters of every layer with the size of the IMA. Unfortunately, this is not the case in the most 2. general case. Two types of situations might arise, depending on the dimension of the IFM/OFM. When the size of the input channels multiplied by the kernel size ( C in × K x × K y ) is greater than the number of rows of the IMA (i.e., 256), multiple IMAs are needed to compute the partial outputs. Then, a reduction among these partial outputs has to be performed to compute the OFM. On the other hand, while the size of the output channels is larger than the number of columns of the IMA (i.e., 256), the inputs need to be broadcasted to all the IMA involved (storing a different set of output channels parameters), and multiple IMAs compute part of the output channels at the same time. In some layers of ResNet-18, the two situations arise concurrently. This mapping approach has to be applied to all the layers computed in the analog domain, excluding Layer 0 . - 2) Data-replication and Parallelization : In a pipelined computational model, the throughput of the whole pipeline is limited by the latency of the slowest stage. Hence, the pipeline has to be balanced as much as possible to achieve high throughput. Unbalancing might depend on many causes, and we will explain some of them in Sec. VI. A technique to speed up the execution of layers (i.e., one stage of the pipeline) executed in the analog domain (i.e., on the IMA) is data-replication . Data-replication increases the parallelism, replicating the parameters of a layer on different IMAs and computing at the same time multiple jobs on multiple chunks of the IFM. With this approach, the speed-up, net of overheads due to communication and data tiling, is theoretically equal to the number of replications at the cost of area, as the same layer parameters are stored on multiple IMAs (Fig. 4B). In this work, we extensively use this technique, especially for the first layers of the network. If the bottleneck of the pipeline is a layer executed in the digital domain (e.g., residual, reductions, pooling), one option is to parallelize the computation on the CORES over multiple clusters. A plain parallelization scheme is used for pooling and residual layers (i.e., Layers 1, 4, 7, 13, 19). - 3) Reduction Management : A different approach has to be adopted for the reduction since this operation requires a hierarchical tree (Fig. 4C) featuring limited and decreasing parallelism. In particular, the level of parallelism of this operation is implicit in the structure of the network. In ResNet18, we might have to sum up the partial products of up to 20 clusters (i.e., Layers 20-21, 23-24) according to the multicluster mapping strategy described in Sec. V-1. In this context, the computation of the residuals might form a bottleneck for the pipeline since this operation has to be performed by the CORES in the digital domain. To accelerate these layers we split the hierarchical tree into several pipeline stages and assign each pipeline stage to a logarithmically decreasing number of clusters with well-balanced latency. This approach has been exploited in all reduction layers. - 4) Residuals Management : In an ideal pipelined data flow, data are exchanged only among consecutive pipeline stages. On the other hand, in many modern DNNs such as ResNet18, this is not the case due to the presence of residual layers. Fig. 5. ResNet-18 inference results. A) Throughput with different mapping optimizations. B) Execution time on every cluster in naive implementation. C) Execution time on every cluster in data-replication and parallelization implementation. D) Execution time on every cluster in final implementation. <details> <summary>Image 5 Details</summary> ![91c3f41b](/v1/image/91c3f41b7de8590f34484f0c022afad7f181e5aca6ada49f7fb0ed1150228404) ### Visual Description ## Chart/Diagram Type: Performance and Latency Analysis ### Overview The image presents a multi-part figure analyzing the performance and latency of different computational strategies. Part A is a bar chart comparing the performance (in TOPS) of "naive", "data-replication and parallelization", and "residual management" approaches, categorized by the number of clusters (261, 322, and 324). Parts B, C, and D are stacked bar charts showing latency breakdowns for different cluster IDs, with latency components including "EXEC (ANALOG BOUND)", "EXEC (DIGITAL BOUND)", "IDLE", "SYNCHRONIZATION OVERHEAD", and "COMMUNICATION OVERHEAD". ### Components/Axes **Part A: Performance Bar Chart** * **Y-axis:** Performance [TOPS], with a scale from 0 to 25. * **X-axis:** Categorical, representing three approaches: "naive", "data-replication and parallelization", and "residual management". * **Legend (Top-Left):** * Red: # of clusters 261 * Yellow: # of clusters 322 * Green: # of clusters 324 **Parts B, C, and D: Latency Stacked Bar Charts** * **Y-axis:** Latency [ms], with a scale from 0 to 14 in B, 0 to 10 in C, and 0 to 5 in D. * **X-axis:** Cluster ID (discrete values). * **Legend (Part D, Bottom):** * Green: EXEC (ANALOG BOUND) * Red: EXEC (DIGITAL BOUND) * Yellow: IDLE * Purple: SYNCHRONIZATION OVERHEAD * Blue: COMMUNICATION OVERHEAD ### Detailed Analysis **Part A: Performance Bar Chart** * **Naive (Red, 261 clusters):** Performance is approximately 6.5 TOPS. * **Data-replication and parallelization (Yellow, 322 clusters):** Performance is approximately 10.5 TOPS. An arrow indicates a 1.6x improvement over the "naive" approach. * **Residual management (Green, 324 clusters):** Performance is approximately 19.5 TOPS. An arrow indicates a 1.9x improvement over the "data-replication and parallelization" approach. **Part B: Latency Stacked Bar Chart (Cluster ID unspecified)** * The chart shows latency breakdown for a series of cluster IDs. * "IDLE" (Yellow) dominates the latency for most clusters, reaching up to approximately 12 ms. * "COMMUNICATION OVERHEAD" (Blue) increases with cluster ID, reaching approximately 2 ms. * "EXEC (ANALOG BOUND)" (Green) and "EXEC (DIGITAL BOUND)" (Red) are relatively low for most clusters, with some spikes. * "SYNCHRONIZATION OVERHEAD" (Purple) is minimal. **Part C: Latency Stacked Bar Chart (Cluster ID unspecified)** * The chart shows latency breakdown for a series of cluster IDs. * "COMMUNICATION OVERHEAD" (Blue) dominates the latency for most clusters, reaching up to approximately 8 ms. * "IDLE" (Yellow) is relatively low for most clusters, reaching approximately 2 ms. * "EXEC (ANALOG BOUND)" (Green) and "EXEC (DIGITAL BOUND)" (Red) are relatively low for most clusters. * "SYNCHRONIZATION OVERHEAD" (Purple) is minimal. **Part D: Latency Stacked Bar Chart (Specific Cluster IDs)** * The chart shows latency breakdown for specific cluster IDs: 16, 2, 56, 28, 53, and 167. * "IDLE" (Yellow) is a significant component of latency, varying across clusters, up to approximately 3 ms. * "COMMUNICATION OVERHEAD" (Blue) increases with cluster ID, reaching approximately 1.5 ms. * "EXEC (ANALOG BOUND)" (Green) varies significantly, with some clusters showing high values (up to 3 ms) and others showing low values. * "EXEC (DIGITAL BOUND)" (Red) is generally low, with some spikes. * "SYNCHRONIZATION OVERHEAD" (Purple) is relatively constant at approximately 0.5 ms. ### Key Observations * Performance increases significantly from "naive" to "data-replication and parallelization" to "residual management". * Latency breakdowns vary significantly across different cluster IDs and configurations. * "IDLE" and "COMMUNICATION OVERHEAD" are major contributors to latency. * "EXEC (ANALOG BOUND)" and "EXEC (DIGITAL BOUND)" show variability depending on the cluster. ### Interpretation The data suggests that "residual management" offers the best performance (TOPS) compared to "naive" and "data-replication and parallelization" approaches. The latency breakdowns indicate that different computational strategies and cluster configurations result in varying latency profiles. The dominance of "IDLE" time in some configurations suggests potential areas for optimization. The increase in "COMMUNICATION OVERHEAD" with cluster ID in Part D may indicate scaling challenges. The variability in "EXEC (ANALOG BOUND)" and "EXEC (DIGITAL BOUND)" highlights the impact of specific cluster characteristics on execution time. </details> Unfortunately, with limited resources in terms of memory storage (1 MB per cluster), and considering the residual's data lifetime between when it is produced and when it is consumed, external temporary storage has to be used to store this temporary data. In particular, in our pipeline, ResNet-18 requires 1.6 MB to simultaneously store all the residuals of the whole network, where the minimum dimension can be calculated as C out ∗ H out . While a first intuitive approach to tackle this issue is to exploit the off-chip HBM memory due to its large capacity, exploiting such memory as temporary storage for residual blocks significantly increases the traffic towards this high-latency memory controller, forming a bottleneck for the whole pipeline reducing its overall performance. Instead, a better solution is to exploit the L1 memory of clusters not used for computations for residual storage, improving the performance compared to the baseline approach by 1.9 × . ## VI. RESULTS AND DISCUSSION In this section, we analyze the results of ResNet-18 execution mapped on the proposed many-core heterogeneous architecture. To extract reliable physical implementation information from the architecture, we performed the physical implementation (down to ready for silicon layout) of the cluster in 22nm FDX technology from Global Foundries. We used Synopsys Design Compiler for physical synthesis, Cadence Innovus for Place&Route, Siemens Questasim for extracting the value change dump for activity annotation, and Synopsys PrimeTime for power analysis. Area, frequency, and power figures are Fig. 6. Performance degradation considering non-idealities due to static mapping, network topology, and communication. <details> <summary>Image 6 Details</summary> ![3e17eab0](/v1/image/3e17eab023193a20b5d7e0522a5850f8dcea1ef04d85c5c6a37c0b1fa6dd5d87) ### Visual Description ## Bar Chart: Performance Analysis ### Overview The image is a bar chart comparing the performance (in TOPS) of different mapping and communication strategies. The chart shows a decreasing trend in performance from the "ideal" scenario to "communication," with annotations indicating the performance degradation factor at each stage. ### Components/Axes * **Y-axis:** Performance [TOPS]. Scale ranges from 0 to 500, with implicit markers at 100, 200, 300, 400, and 500. * **X-axis:** Categorical axis representing different scenarios: "ideal", "global mapping", "local mapping", "intra-layer unbalance", and "communication". * **Bars:** Light blue bars represent the performance for each scenario. * **Annotations:** Black arrows and text indicate the performance degradation factor between consecutive bars. ### Detailed Analysis * **Ideal:** The first bar, representing the "ideal" scenario, has a performance of approximately 510 TOPS. * **Global Mapping:** The second bar, "global mapping," has a performance of approximately 320 TOPS. The annotation "1.6x" indicates that the performance is 1.6 times lower than the "ideal" scenario. * **Local Mapping:** The third bar, "local mapping," has a performance of approximately 110 TOPS. The annotation "3.0x" indicates that the performance is 3.0 times lower than the "global mapping" scenario. * **Intra-layer Unbalance:** The fourth bar, "intra-layer unbalance," has a performance of approximately 20 TOPS. The annotation "5.0x" indicates that the performance is 5.0 times lower than the "local mapping" scenario. * **Communication:** The fifth bar, "communication," has a performance of approximately 15 TOPS. The annotation "1.2x" indicates that the performance is 1.2 times lower than the "intra-layer unbalance" scenario. * **Vertical Arrows:** Two vertical arrows point from the top of the "ideal" bar to the top of the "intra-layer unbalance" and "communication" bars, labeled "23.8x" and "28.4x" respectively. These indicate the overall performance degradation compared to the ideal scenario. ### Key Observations * There is a significant drop in performance from the "ideal" scenario to "global mapping." * The performance continues to decrease as we move from "global mapping" to "local mapping" and then to "intra-layer unbalance." * The "communication" scenario has the lowest performance. * The overall performance degradation from the "ideal" scenario to "intra-layer unbalance" is 23.8x, and to "communication" is 28.4x. ### Interpretation The chart illustrates the impact of different factors on the overall performance. The "ideal" scenario represents the maximum achievable performance. The subsequent bars show how performance degrades due to global mapping, local mapping, intra-layer unbalance, and communication overhead. The annotations provide a quantitative measure of the performance degradation at each stage. The large performance degradation factors (23.8x and 28.4x) highlight the significant impact of intra-layer unbalance and communication on the overall performance compared to the ideal scenario. This suggests that optimizing these aspects is crucial for improving the overall system performance. </details> Fig. 7. Area efficiency per group of clusters as defined in Fig. 2 without communication inefficiencies. <details> <summary>Image 7 Details</summary> ![c6dbb7cb](/v1/image/c6dbb7cb15982343b6cfed3c4d1c676ab0fde212c9dfb1e09c896a301e6cb776) ### Visual Description ## Bar Chart: Area Efficiency by Layer Group and IFM Shape ### Overview The image is a horizontal bar chart comparing the area efficiency (GOPS/mm²) of different layer groups (0-5) based on their IFM (Input Feature Map) shape. The chart displays the area efficiency for various IFM shapes within each layer group. ### Components/Axes * **Y-axis (Vertical):** "Layer Group" with labels 0, 1, 2, 3, 4, and 5. * **X-axis (Horizontal):** "Area Efficiency [GOPS/mm²]" with a scale from 0 to 400, incrementing by 50. * **Legend (Bottom-Right):** "IFM shape" with the following shapes and corresponding colors: * 256x256x3 (Purple) * 128x128x64 (Blue) * 64x64x64 (Dark Blue) - Not explicitly present in the chart, but inferred from the legend. * 32x32x128 (Light Green) * 16x16x256 (Green) * 8x8x512 (Yellow) ### Detailed Analysis Here's a breakdown of the area efficiency for each layer group and IFM shape: * **Layer Group 0:** * 256x256x3 (Purple): Area efficiency is approximately 140 GOPS/mm². * **Layer Group 1:** * 128x128x64 (Blue): Area efficiency is approximately 210 GOPS/mm². * **Layer Group 2:** No data is present for this layer group. * **Layer Group 3:** * 32x32x128 (Light Green): Area efficiency is approximately 400 GOPS/mm². * **Layer Group 4:** * 16x16x256 (Green): Area efficiency is approximately 200 GOPS/mm². * **Layer Group 5:** * 8x8x512 (Yellow): Area efficiency is approximately 60 GOPS/mm². ### Key Observations * Layer Group 3, with an IFM shape of 32x32x128, has the highest area efficiency, reaching approximately 400 GOPS/mm². * Layer Group 5, with an IFM shape of 8x8x512, has the lowest area efficiency, at approximately 60 GOPS/mm². * Layer Groups 2 has no data. * The IFM shape 256x256x3 (Purple) is only present in Layer Group 0. * The IFM shape 128x128x64 (Blue) is only present in Layer Group 1. * The IFM shape 32x32x128 (Light Green) is only present in Layer Group 3. * The IFM shape 16x16x256 (Green) is only present in Layer Group 4. * The IFM shape 8x8x512 (Yellow) is only present in Layer Group 5. ### Interpretation The chart illustrates how the area efficiency of a system varies depending on the layer group and the IFM shape used. The data suggests that the IFM shape significantly impacts area efficiency. The 32x32x128 IFM shape in Layer Group 3 is the most efficient, while the 8x8x512 IFM shape in Layer Group 5 is the least efficient. The absence of data for Layer Group 2 suggests that either no measurements were taken for that group or that the area efficiency was negligible. The chart highlights the importance of selecting appropriate IFM shapes for different layer groups to optimize area efficiency. </details> then scaled to a 5nm tech node more suitable for modern HPC architectures. Fig. 5D shows the execution time of a batch of 16 256 × 256 images on the architecture. For each cluster, it shows the amount of time spent on computation, communication, synchronization, and sleeping. Since analog and digital computations are performed in parallel, execution bars are indicated in green when analog bound and in red when digital bound. We can note an expected increasing trend of latency with the cluster ID caused by the head and tail of the pipeline execution (i.e., idle times waiting for the first and the last batches are propagated through the pipeline). Fig. 5A shows the performance gain achieved thanks to the techniques described in Sec. IV. Fig. 5B shows the latency breakdown of the clusters in a naive implementation, where all the network parameters are mapped into the architecture exploiting the multi-cluster technique described (Sec. V-1) but with no further optimizations. It is possible to note the large unbalance between the first layers and the deeper layers in the network. Fig. 5C shows the latency breakdown after datareplication and parallelization , better balancing the pipeline and improving performance by 1.6 × at the cost of utilising 61 more clusters. Reducing compute latency moves the bottleneck of the execution on communication due to large contentions on the HBM, mainly caused by residual management. Fig. 5d shows the optimized mapping of residual described in Sec. V-4 further improving performance by 1.9 × at the cost of 2 more clusters (to exploit 2 MB of available on-chip memory). To provide insights into the sources of inefficiency highlighted in Fig. 5D, we analyze the mapping and latency breakdown of the Resnet-18 inference. The first source of inefficiency (global mapping) is caused by the fact that not all the clusters are used for mapping network parameters. In our mapping, 322 clusters out of 512 have been exploited. This is an intrinsic characteristic of all systolic architectures exploiting pipelining as a computational model, worsened by the constraints in terms of mapping imposed by IMA. However, this has only an effect on the area efficiency since, in such regular architecture, each cluster can be easily clock and power gated, minimizing the impact on energy efficiency. The second source of inefficiency (local mapping) is caused by the fact that even if a specific cluster is being used, the mapping on it might under-utilize the analog and digital resources. In some cases, parameters cannot fill the whole IMA; in other cases, the array is not used at all. The same happens for digital computing, e.g., in the case of purely digital layers. A possible solution to mitigate this degradation could be to integrate heterogeneous clusters configured to fit better all the possibilities, such as IMA and a single CORE (i.e., analog clusters) or 16 CORES without IMA (i.e., digital clusters). The third source of inefficiency is caused by the pipeline unbalance. Different layers feature different computational efficiency, as described in Fig. 7, where the layer groups are defined depending on the IFM dimension. Some layer groups feature significant area efficiency, thanks to large IFM/OFM implying high data reuse (i.e., several iterations over the same parameters statically mapped on the IMA). In particular, Layer 12 (i.e., group 3) is executed on 10 clusters, with datareplication factor of 2, leading to a peak of efficiency of 600 GOPS/mm 2 . Conversely, deeper layers in the network, because of the stride, feature analog layers with very poor parameters reuse interleaved with stages of reductions executed by the CORES. In particular, Layer 20, 21, 23, and 24 (i.e., group 5) are executed on 40 clusters each. This causes extremely low latency for the execution of the analog layers (less than 0.2 ms), which translates into lower area efficiency (50 GOPS/mm 2 ) compared to the first layers. A possible approach to tackle this inefficiency might lead to further exploiting heterogeneity by coupling IMA and CORES with a set of more compact specialized digital accelerators more suitable for low-data reuse layers, improving the silicon efficiency. Another approach could be to use larger IMA arrays [17]. However, this would require more data transfers per cluster. Despite the analyzed sources of inefficiency, the proposed architecture delivers 20.2 TOPS (i.e., 3303 images/s) and 42 GOPS/mm 2 on the end-to-end inference of Resnet-18. Performing the inference in 9.2 ms and 15 mJ, which corresponds to an energy efficiency of 6.5 TOPS/W, paves the way for a new generation of general-purpose many-core architectures exploiting a mix of analog and digital computing. ## VII. CONCLUSION In this work, we have proposed a general-purpose heterogeneous multi-core architecture based on a PULP cluster augmented with nvAIMC accelerators to efficiently execute real end-to-end networks, exploiting the throughput of such a paradigm. We have proposed a mapping based on the combination of pipelining execution flow and many techniques to increase the parallelism and split the workload among the nvAIMC cores. We have shown the results of the inference of a batch of 16 256 × 256 images on a ResNet-18, obtaining up to 20.2 TOPS and 6.5 TOPS/W for a 480 mm2 architecture. We have finally provided an exhaustive performance analysis, considering a real case of traffic and highlighting the criticisms when nvAIMC is used in real applications, providing several insights on how to mitigate this effect, to drive the design and the usage of nvAIMC architecture as a general-purpose platform for DNN acceleration. ## ACKNOWLEDGEMENT This work was supported by the WiPLASH project (g.a. 863337), founded by the European Union's Horizon 2020 research and innovation program. ## REFERENCES - [1] A. Sebastian et al. , 'Memory devices and applications for in-memory computing,' Nature nanotechnology , vol. 15, no. 7, pp. 529-544, 2020. - [2] J.-s. Seo et al. , 'Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs,' IEEE Solid-State Circuits Magazine , vol. 14, no. 3, pp. 65-79, 2022. - [3] J.-M. Hung et al. , 'Challenges and Trends of Nonvolatile In-MemoryComputation Circuits for AI Edge Devices,' IEEE Open Journal of the Solid-State Circuits Society , vol. 1, pp. 171-183, 2021. - [4] S. Yu et al. , 'Compute-in-Memory Chips for Deep Learning: Recent Trends and Prospects,' IEEE Circuits and Systems Magazine , vol. 21, no. 3, pp. 31-56, 2021. - [5] H. Jia et al. , '15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing,' in 2021 IEEE International Solid- State Circuits Conference (ISSCC) , vol. 64, pp. 236238, 2021. - [6] I. A. Papistas et al. , 'A 22 nm, 1540 TOP/s/W, 12.1 TOP/s/mm2 inMemory Analog Matrix-Vector-Multiplier for DNN Acceleration,' 2021 IEEE Custom Integrated Circuits Conference (CICC) , pp. 1-2, 2021. - [7] R. Khaddam-Aljameh et al. , 'HERMES-Core-A 1.59-TOPS/mm2 PCM on 14-nm CMOS In-Memory Compute Core Using 300-ps/LSB Linearized CCO-Based ADCs,' IEEE Journal of Solid-State Circuits , vol. 57, no. 4, pp. 1027-1038, 2022. - [8] H. Jia et al. , 'A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing,' IEEE Journal of Solid-State Circuits , vol. 55, no. 9, pp. 2609-2621, 2020. - [9] A. Garofalo et al. , 'A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks,' IEEE Journal on Emerging and Selected Topics in Circuits and Systems , vol. 12, no. 2, pp. 422-435, 2022. - [10] C. Zhou et al. , 'AnalogNets: ML-HW co-design of noise-robust TinyML models and always-on analog compute-in-memory accelerator,' arXiv preprint arXiv:2111.06503 , 2021. - [11] M. Dazzi et al. , 'Efficient Pipelined Execution of CNNs Based on In-Memory Computing and Graph Homomorphism Verification,' IEEE Transactions on Computers , vol. 70, no. 6, pp. 922-935, 2021. - [12] A. Shafiee et al. , 'ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,' in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , pp. 14-26, 2016. - [13] A. Ankit et al. , 'PUMA: A Programmable Ultra-Efficient MemristorBased Accelerator for Machine Learning Inference,' in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems , ASPLOS '19, (New York, NY, USA), p. 715-731, Association for Computing Machinery, 2019. - [14] N. Bruschi et al. , 'GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors,' in 2021 IEEE 39th International Conference on Computer Design (ICCD) , pp. 409-416, 2021. - [15] M. Gautschi et al. , 'Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices,' IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 25, no. 10, pp. 2700-2713, 2017. - [16] A. Kurth et al. , 'An Open-Source Platform for High-Performance NonCoherent On-Chip Communication,' IEEE Transactions on Computers , pp. 1-1, 2021. - [17] P. Narayanan et al. , 'Fully On-Chip MAC at 14 nm Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel VectorTransport in Duration-Format,' IEEE Transactions on Electron Devices , vol. 68, no. 12, pp. 6629-6636, 2021.

Rendering Paper...