2305.10459

Model: nemotron-free

## AnalogNAS: A Neural Network Design Framework for Accurate Inference with Analog In-Memory Computing Hadjer Benmeziane ∗ , Corey Lammie † , Irem Boybat † , Malte Rasch ‡ ,Manuel Le Gallo † , Hsinyu Tsai § , Ramachandran Muralidhar , Smail Niar , Ouarnoughi Hamza , Vijay Narayanan ‡ ∗ ∗ ‡ , † ‡ Abu Sebastian and Kaoutar El Maghraoui , ∗ Univ. Polytechnique Hauts-de-France, CNRS, UMR 8201 - LAMIH, F-59313 Valenciennes, France † IBM Research Europe, 8803 R¨ uschlikon, Switzerland ‡ IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA § IBM Research Almaden, 650 Harry Road, San Jose, CA USA Abstract -The advancement of Deep Learning (DL) is driven by efficient Deep Neural Network (DNN) design and new hardware accelerators. Current DNN design is primarily tailored for general-purpose use and deployment on commercially viable platforms. Inference at the edge requires low latency, compact and power-efficient models, and must be cost-effective. Digital processors based on typical von Neumann architectures are not conducive to edge AI given the large amounts of required data movement in and out of memory. Conversely, analog/mixedsignal in-memory computing hardware accelerators can easily transcend the memory wall of von Neuman architectures when accelerating inference workloads. They offer increased areaand power efficiency, which are paramount in edge resourceconstrained environments. In this paper, we propose AnalogNAS , a framework for automated DNN design targeting deployment on analog In-Memory Computing (IMC) inference accelerators. We conduct extensive hardware simulations to demonstrate the performance of AnalogNAS on State-Of-The-Art (SOTA) models in terms of accuracy and deployment efficiency on various Tiny Machine Learning (TinyML) tasks. We also present experimental results that show AnalogNAS models achieving higher accuracy than SOTA models when implemented on a 64-core IMC chip based on Phase Change Memory (PCM). The AnalogNAS search code is released 1 . Index Terms -Analog AI, Neural Architecture Search, Optimization, Edge AI, In-memory Computing ## I. INTRODUCTION W ITH the growing demands of real-time DL workloads, today's conventional cloud-based AI deployment approaches do not meet the ever-increasing bandwidth, realtime, and low-latency requirements. Edge computing brings storage and local computations closer to the data sources produced by the sheer amount of Internet of Things (IoT) objects, without overloading network and cloud resources. As © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 1 https://github.com/IBM/analog-nas DNNs are becoming more memory and compute intensive, edge AI deployments on resource-constrained devices pose significant challenges. These challenges have driven the need for specialized hardware accelerators for on-device Machine Learning (ML) and a plethora of tools and solutions targeting the development and deployment of power-efficient edge AI solutions. One such promising technology for edge hardware accelerators is analog-based IMC, which is herein referred to as analog IMC . Analog IMC [1] can provide radical improvements in performance and power efficiency, by leveraging the physical properties of memory devices to perform computation and storage at the same physical location. Many types of memory devices, including Flash memory, PCM, and Resistive Random Access Memory (RRAM), can be used for IMC [2]. Most notably, analog IMC can be used to perform Matrix-Vector Mutliplication (MVM) operations in O (1) time complexity [3], which is the most dominant operation used for DNN acceleration. In this novel approach, the weights of linear, convolutional, and recurrent DNN layers are mapped to crossbar arrays (tiles) of Non-Volatile Memory (NVM) elements. By exploiting basic Kirchhoff's circuit laws, MVMs can be performed by encoding inputs as Word-Line (WL) voltages and weights as device conductances. For most computations, this removes the need to pass data back and forth between Central Processing Units (CPUs) and memory. This back and forth data movement is inherent in traditional digital computing architectures, and is often referred to as the von Neumann bottleneck . Because there is greatly reduced movement of data, tasks can be performed in a fraction of the time, and with much less energy. NVM crossbar arrays and analog circuits, however, have inherent non-idealities, such as noise, temporal conductance drift, and non-linear errors, which can lead to imprecision and noisy computation [4]. These effects need to be properly quantified and mitigated to ensure the high accuracy of DNN models. In addition to the hardware constraints that are prevalent in edge devices, there is the added complexity of designing DNN Fig. 1. The effect of PCM conductance drift after one day on standard CNN architectures and one architecture ( AnalogNAS\_T500 ) obtained using HW-NAS, evaluated using CIFAR-10. FP refers to the original network accuracy, and 1-day to the simulated analog network accuracy after 1-day device drift. <details> <summary>Image 1 Details</summary> ![775d4ec4](/v1/image/775d4ec41748172739878ff980be7aff328de534953488d31bd980d6f4fc292c) ### Visual Description ## Scatter Plot: Test Set Accuracy vs. Network Parameters ### Overview The image is a scatter plot comparing the test set accuracy of various neural network architectures against their number of parameters. It highlights the relationship between model complexity (parameters) and performance (accuracy), with a focus on the gap between "FP" (Fine-tuning) and "1-day" accuracy metrics. A key annotation states: *"For robust networks, this gap is minimized."* ### Components/Axes - **X-axis**: "Number of Network Parameters" (logarithmic scale, 10⁶ to 10⁷). - **Y-axis**: "Test Set Accuracy (%)" (linear scale, 80% to 100%). - **Legend**: - **Networks**: Resnet20, Resnet32, Resnext29, Wide Resnet, AnalogNAS_T500. - **Metrics**: - **FP** (open symbols: hexagon, square, cross, diamond, circle). - **1-day** (filled symbols: same shapes as FP but colored). - **Colors**: - Resnet20: Red (hexagon). - Resnet32: Blue (square). - Resnext29: Green (cross). - Wide Resnet: Yellow (diamond). - AnalogNAS_T500: Purple (circle). ### Detailed Analysis - **Resnet20**: - FP (open red hexagon): ~10⁶ parameters, 87.5% accuracy. - 1-day (filled red hexagon): ~10⁶ parameters, 85% accuracy. - **Resnet32**: - FP (open blue square): ~10⁶.⁵ parameters, 92.5% accuracy. - 1-day (filled blue square): ~10⁶.⁵ parameters, 90% accuracy. - **Resnext29**: - FP (open green cross): ~10⁶.⁷ parameters, 95% accuracy. - 1-day (filled green cross): ~10⁶.⁷ parameters, 92.5% accuracy. - **Wide Resnet**: - FP (open yellow diamond): ~10⁷ parameters, 97.5% accuracy. - 1-day (filled yellow diamond): ~10⁷ parameters, 95% accuracy. - **AnalogNAS_T500**: - FP (open purple circle): ~10⁶.⁸ parameters, 93% accuracy. - 1-day (filled purple circle): ~10⁶.⁸ parameters, 91% accuracy. ### Key Observations 1. **Parameter-Accuracy Correlation**: - Higher parameter counts generally correlate with higher accuracy (e.g., Wide Resnet at 10⁷ parameters achieves 97.5% accuracy vs. Resnet20 at 10⁶ parameters with 87.5%). 2. **FP vs. 1-day Gap**: - All networks show a consistent ~2.5% gap between FP and 1-day accuracy, except AnalogNAS_T500, which has a smaller gap (~2%). 3. **Robustness Note**: - The annotation suggests that networks with higher parameters (e.g., Wide Resnet, AnalogNAS_T500) exhibit minimized gaps, aligning with the definition of "robustness" in this context. ### Interpretation The plot demonstrates that increasing model complexity (parameters) improves test set accuracy, but robust networks (those with higher parameters) maintain smaller gaps between FP and 1-day accuracy. This implies that larger, more complex models may generalize better across different evaluation scenarios. The exception of AnalogNAS_T500, with a smaller gap despite fewer parameters, suggests architectural efficiency can also contribute to robustness. The consistent gap across most networks indicates a trade-off between parameter count and performance stability. </details> architectures which are optimized for the edge on a variety of hardware platforms. This requires hardware-software co-design approaches to tackle this complexity, as manually-designed architectures are often tailored for specific hardware platforms. For instance, MobileNet [5] uses a depth-wise separable convolution that enhances CPU performance but is inefficient for Graphics Processing Unit (GPU) parallelization [6]. These are bespoke solutions that are often hard to implement and generalize to other platforms. HW-NAS [7] is a promising approach that seeks to automatically identify efficient DNN architectures for a target hardware platform. In contrast to traditional Neural Architecture Search (NAS) approaches that focus on searching for the most accurate architectures, HW-NAS searches for highly accurate models while optimizing hardware-related metrics. Existing HW-NAS strategies cannot be readily used with analog IMC processors without significant modification for three reasons: (i) their search space contains operations and blocks that are not suitable for analog IMC, (ii) lack of a benchmark of hardwareaware trained architectures, and (iii) their search strategy does not include noise injection and temporal drift on weights. To address these challenges, we propose AnalogNAS , a novel HW-NAS strategy to design dedicated DNN architectures for efficient deployment on edge-based analog IMC inference accelerators. This approach considers the inherent characteristics of analog IMC hardware in the search space and search strategy. Fig. 1 depicts the necessity of our approach. As can be seen, when traditional DNN architectures are deployed on analog IMC hardware, non-idealities, such as conductance drift, drastically reduce network performance. Networks designed by AnalogNAS are extremely robust to these non-idealities and have much fewer parameters compared to equivalently-robust traditional networks. Consequently, they have reduced resource utilization. Our specific contributions can be summarized as follows: - We design and construct a search space for analog IMC, which contains ResNet-like architectures, including ResNext [8] and Wide-ResNet [9], with blocks of varying widths and depths; - We train a collection of networks using Hardware-Aware (HWA) training for image classification, Visual Wake Words (VWW), and Keyword Spotting (KWS) tasks. Using these networks, we build a surrogate model to rank the architectures during the search and predict robustness to conductance drift; - We propose a global search strategy that uses evolutionary search to explore the search space and efficiently finds the right architecture under different constraints, including the number of network parameters and analog tiles; - We conduct comprehensive experiments to empirically demonstrate that AnalogNAS can be efficiently utilized to carry out architecture search for various edge tiny applications, and investigate what attributes of networks make them ideal for implementation using analog AI; - We validate a subset of networks on hardware using a 64-core IMC chip based on PCM. The rest of the paper is structured as follows. In Section II, we present related work. In Section III, relevant notations and terminology are introduced. In Section IV, the search space and surrogate model are presented. In Section V, the search strategy is presented. In Section VI, the methodology for all experiments is discussed. The simulation results are presented in Section VI-B, along with the hardware validation and performance estimation in Section VII. The results are discussed in Section VIII. Section IX concludes the paper. ## II. RELATED WORK ## A. NAS for TinyML HW-NAS has been successfully applied to a variety of edge hardware platforms [7], [10] used to deploy networks for TinyMLPerf tasks [11] such as image classification, VWW, KWS, and anomaly detection. MicroNets [12] leverages NAS for DL model deployment on micro-controllers and other embedded systems. It utilizes a differentiable search space [13] to find efficient architectures for different TinyMLPerf tasks. For each task, the search space is an extension of current SOTA architectures. µ -nas [14] includes memory peak usage and a number of other parameters as constraints. Its search strategy combines aging evolution and Bayesian optimization to estimate the objectives and explore a granular search space efficiently. It constructs its search space from a standard CNN and modifies the operators' hyper-parameters and a number of layers. ## B. NAS for Mixed-Signal IMC Accelerators Many works [15]-[18] target IMC accelerators using HW-NAS. FLASH [15] uses a small search space inspired by DenseNet [19] and searches for the number of skip connections that efficiently satisfy the trade-off between accuracy, latency, energy consumption, and chip area. Its surrogate model uses linear regression and the number of skip connections to predict model accuracy. NAS4RRAM [17] uses HW-NAS to find an efficient DNN for a specific RRAMbased accelerator. It uses an evolutionary algorithm, trains each sampled architecture without HWA training, and evaluates each network on a specific hardware instance. NACIM [16] uses coexploration strategies to find the most efficient architecture and the associated hardware platform. For each sampled architecture, networks are trained considering noise variations. This approach is limited by using a small search space due to the high time complexity of training. UAE [18] uses a Monte-Carlo simulation-based experimental flow to measure the device uncertainty induced to a handful of DNNs. Similar to NACIM [16], evaluation is performed using HWA training with noise injection. AnalogNet [20] extends the work of Micronet by converting their final models to analog-friendly models, replacing depthwise convolutions with standard convolutions and tuning hyperparameters. Compared to the above-mentioned SOTA HW-NAS strategies, our AnalogNAS is better tailored to analog IMC hardware for two reasons: (i) Our search space is much larger and more representative, featuring resnet-like connections. This enables us to answer the key question of what architectural characteristics are suitable for analog IMC which cannot be addressed with small search spaces. (ii) We consider the inherent characteristics of analog IMC hardware directly in the objectives and constraints of our search strategy in addition to noise injection during the HWA training as used by existing approaches. ## III. PRELIMINARIES ## A. Analog IMC Accelerator Mechanisms Analog IMC accelerators are capable of performing MVM operations Y T = X T W using the laws of physics, where W is an M × N matrix, X is a M × 1 vector, and Y is a N × 1 vector. When arranged in a crossbar configuration, M × N , NVM devices can be used to compute MVM operations. This is done by encoding elements of X as WL voltages, denoted using V , and elements of W as conductances of the unit cells, denoted using G . Negative conductance states cannot be directly encoded/represented using NVM devices. Consequently, differential weight mapping schemes are commonly employed, where either positive weights, i.e., W + = max( W , 0) , and negative weights, i.e., W -= -min( W , 0) , are encoded within unit cells, using alternate columns, or on different tiles [3]. The analog computation, i.e., I = VG is performed, where the current flow to the end of the N -th column is I N = ∑ M i =0 G i,N V i . Typically, Digital-to-Analog Converters (DACs) are required to encode WL voltages and Analog-toDigital Converters (ADCs) are required to read the output currents of each column. The employed analog IMC tile, its weight mapping scheme, and computation mechanism are depicted in Fig. 2. ## B. Temporal Drift of Non-Volatile Memory Devices Many types of NVM devices, most prominantly, PCM, exhibit temporal evolution of the conductance values referred to as the conductance drift. This poses challenges for maintaining synaptic weights reliably [2]. Conductance drift is most commonly modelled using Eq. (1), as follows:  where G ( t 0 ) is the conductance at time t 0 and ν is the drift exponent. In practice, conductance drift is highly stochastic because ν depends on the programmed conductance state and varies across devices. Consequently, when reporting the network accuracy at a given time instance (after device programming), it is computed across multiple experiment instances (trials) to properly capture the amount of accuracy variations. ## C. HWA-training and analog hardware accuracy evaluation simulation To simulate training and inference on analog IMC accelerators, the IBM Analog Hardware Acceleration Kit (AIHWKIT) [21] is used. The AIHWKIT is an open-source Python toolkit for exploring and using the capabilities of inmemory computing devices in the context of artificial intelligence and has been used for HWA training of standard DNNs with hardware-calibrated device noise and drift models [22]. ## D. Hardware-aware Neural Architecture Search (HW-NAS) HW-NAS refers to the task of automatically finding the most efficient DNN for a specific dataset and target hardware platform. HW-NAS approaches often employ black-box optimization methods such as evolutionary algorithms [23], reinforcement learning [24], [25], and Bayesian optimization [26], [27]. The optimization problem is either cast as a constrained or multi-objective optimization [7]. In AnalogNAS, we chose constrained optimization over multi-objective optimization for several reasons. First, constrained optimization is more computationally efficient than multi-objective optimization, which is important in the context of HW-NAS, to allow searching a large search space in a practical time frames. Multi-objective optimization is computationally expensive and can result in a Fig. 2. Employed analog IMC tile and weight mapping scheme. <details> <summary>Image 2 Details</summary> ![ba179171](/v1/image/ba179171c4c27cb4826182323a79443d094523a432073cab2223fd50abcafe10) ### Visual Description ## Circuit Diagram: Matrix Multiplication Implementation ### Overview The image depicts a block diagram of a linear system implementing matrix multiplication, represented by the equation **XTW = YT**. The system uses a cascaded array of "Unit Cells" to process input voltages (**V₀, V₁, ..., Vₘ**) through resistor networks and analog-to-digital converters (ADCs) to generate output currents (**I₀, I₁, ..., Iₙ**). ### Components/Axes 1. **Matrix Equation**: - **X**: Input vector with elements **X₀, X₁, ..., Xₘ** (transposed as a row vector). - **W**: Weight matrix with elements **Wij** (e.g., **W₀₀, W₀₁, ..., Wₘₙ**). - **Y**: Output vector with elements **Y₀, Y₁, ..., Yₙ** (transposed as a row vector). - Equation: **XTW = YT** implies **Y = WX** (standard matrix multiplication). 2. **Circuit Diagram**: - **Input Voltages**: **V₀, V₁, ..., Vₘ** (labeled in red) connected to DACs (Digital-to-Analog Converters). - **Resistor Network**: Labeled **Gij** (e.g., **G₀₀, G₀₁, ..., Gₘₙ**), representing conductances. - **Voltage Sources**: **V_SL₀, V_SL₁, ..., V_SLₙ** (labeled in gray) applied to the top of each resistor column. - **Unit Cells**: Repeated blocks containing DACs, resistors, and current summation paths. - **Output Currents**: **I₀, I₁, ..., Iₙ** (labeled in blue) extracted via ADCs at the bottom. ### Detailed Analysis - **Matrix Structure**: - **X** is a **(m+1)×1** column vector. - **W** is a **(m+1)×(n+1)** matrix (rows = m+1, columns = n+1). - **Y** is a **(n+1)×1** column vector. - Example: For **m=2, n=2**, **W** has 3 rows (W₀₀, W₁₀, W₂₀; W₀₁, W₁₁, W₂₁; W₀₂, W₁₂, W₂₂). - **Circuit Flow**: 1. Input voltages (**V₀–Vₘ**) are converted to analog signals via DACs. 2. Each DAC output drives a column of resistors (**Gij**) connected to voltage sources (**V_SLⱼ**). 3. Currents flow through resistors, summing at the bottom of each column. 4. ADCs convert summed currents (**I₀–Iₙ**) to digital outputs. ### Key Observations - **Resistor Network**: The **Gij** values determine the weighting of input voltages in the matrix multiplication. - **Voltage Sources**: **V_SLⱼ** likely act as reference voltages to bias the resistor network. - **ADC Placement**: ADCs are positioned to measure the total current from each resistor column, corresponding to **Yⱼ** in the matrix equation. ### Interpretation This system implements a **linear transformation** (matrix multiplication) in analog hardware. The resistor network (**W**) scales and combines input voltages (**X**) to produce output currents (**Y**). The ADCs digitize these currents for further processing. - **Technical Significance**: - The circuit could function as a **current-steering DAC** or **analog filter**, where **W** defines the filter coefficients or transformation matrix. - The use of DACs and ADCs suggests integration with digital control systems. - **Design Constraints**: - Resistor values (**Gij**) must be precisely matched to avoid errors in matrix multiplication. - Voltage sources (**V_SLⱼ**) must be stable to ensure accurate current summation. - **Anomalies**: - No explicit grounding paths are shown for the voltage sources or ADCs, which could impact real-world implementation. - The diagram assumes ideal DAC/ADC behavior (no noise or nonlinearity). This design bridges analog and digital domains, enabling efficient matrix operations for applications like signal processing or machine learning accelerators. </details> Fig. 3. Resnet-like macro architecture. <details> <summary>Image 3 Details</summary> ![57e9a860](/v1/image/57e9a8608bccccc8ac920dd400db974053a9a4fc19915330ff5dd2b3ad453680) ### Visual Description ## Diagram: Residual Neural Network Architecture ### Overview The diagram illustrates a residual neural network (ResNet) architecture with a focus on residual blocks, convolutional layers, and pooling operations. It includes a hierarchical structure with widening factors and branching mechanisms, visualized through color-coded components and labeled connections. ### Components/Axes 1. **Main Diagram Elements**: - **Conv**: Convolutional layer with parameters `out_channel_0` and `kernel_size_0`. - **Main Block**: Contains multiple residual blocks. - **Residual Block**: Composed of: - **WF (Widening Factor)**: Indicated by purple blocks. - **Conv 3x3**: Two parallel 3x3 convolutional layers (labeled "A" and "B"). - **Conv 1x1**: Three sequential 1x1 convolutional layers (labeled "A", "B", and "C"). - **Shortcut Connection**: Dashed line bypassing the residual block. - **Avg Pooling**: Average pooling layer. - **FC**: Fully connected (dense) layer. 2. **Architecture Representation**: - **Color-Coded Legend**: - **OC (Out Channel)**: Red (`OC_0`). - **KS (Kernel Size)**: Purple (`KS_0`). - **M (Main Blocks)**: Purple (`M`). - **R (Residual Blocks)**: Teal (`R`). - **NB (Number of Branches)**: Black (`NB`). - **CT (Conv Type)**: Yellow (`CT`). - **WF (Widening Factor)**: Magenta (`WF`). ### Detailed Analysis - **Residual Block Structure**: - The residual block (teal) includes two parallel paths: 1. **Path A**: Two `Conv 3x3` layers (yellow blocks). 2. **Path B**: Three `Conv 1x1` layers (blue blocks). - A shortcut connection (dashed line) skips the residual block, enabling direct addition of inputs to outputs (residual learning). - **Widening Factor (WF)**: - Purple blocks indicate multiplicative scaling of channel dimensions across layers. - **Branching Mechanism**: - Three branches (`NB = 3`) are represented by black blocks, suggesting parallel processing paths. - **Conv Type (CT)**: - Yellow blocks denote the type of convolution (e.g., standard vs. dilated). ### Key Observations 1. **Hierarchical Design**: - The network progresses from input (`Conv`) through stacked residual blocks to a final fully connected layer (`FC`). 2. **Residual Learning**: - Shortcut connections (dashed lines) allow gradients to flow directly, mitigating vanishing gradient issues in deep networks. 3. **Modularity**: - The architecture representation summarizes the model's configuration using color-coded blocks, simplifying parameter tracking. ### Interpretation This diagram represents a ResNet variant with residual blocks, widening factors, and branching mechanisms. The use of shortcut connections and residual learning enables training of deeper networks by preserving gradient flow. The architecture representation provides a compact summary of hyperparameters (e.g., `out_channel_0`, `kernel_size_0`, `WF`), which are critical for model configuration. The absence of numerical values suggests this is a schematic diagram rather than a trained model visualization. The widening factor (`WF`) likely scales channel dimensions to balance depth and computational cost, while the branching mechanism (`NB = 3`) introduces parallelism for feature extraction. </details> large number of non-dominated solutions that can be difficult to interpret. Secondly, by using constrained optimization, we can explicitly incorporate the specific constraints of the analog hardware in our search strategy. This enables us to find DNN architectures that are optimized for the unique requirements and characteristics of analog IMC hardware, rather than simply optimizing for multiple objectives. ## IV. ANALOG-NAS The objective of AnalogNAS is to find an efficient network architecture under different analog IMC hardware constraints. AnalogNAS comprises three main components: (i) a resnetlike search space, (ii) an analog-accuracy surrogate model, and (iii) an evolutionary-based search strategy. We detail each component in the following subsections. ## A. Resnet-like Search Space Resnet-like architectures have inspired many manually designed SOTA DL architectures, including Wide ResNet [9] and EfficientNet [28]. Their block-wise architecture offers a flexible and searchable macro-architecture for NAS [29]. Resnet-like architectures can be implemented efficiently using IMC processors, as they are comprised of a large number of MVM and element-wise operations. Additionally, due to the highly parallel nature of IMC, Resnet architectures can get free processing of additional input/output channels. This makes Resnet-like architectures highly amenable to analog implementation. Fig. 3 depicts the macro-architecture used to construct all architectures in our search space. The architecture consists of a series of M distinct main blocks. Each main block contains R residual blocks. The residual blocks use skip connections with or without downsampling. Downsampling is performed using 1x1 convolution layers when required, i.e., when the input size does not match the output size. The residual block can have B branches. Each branch uses a convolution block. We used different types of convolution blocks to allow the search space to contain all standard architectures such as Resnets [30], ResNext [8], and Wide Resnets [9]. The standard convolution blocks used in Resnets, commonly referred to as BottleNeckBlock and BasicBlock , are denoted as A and B respectively. We include variants of A and B in which we inverse the order of the ReLU and Batch normalization operations. The resulting blocks are denoted as C and D. Table I summarizes the searchable hyper-parameters and their respective ranges. The widening factor scales the width of the residual block. We sample architectures with different depths by changing the number of main and residual blocks. The total size of the search space is approximately 73B architectures. The larger architecture would contain 240 convolutions and start from an output channel of 128 multiplying that by 4 for every 16 blocks. ## B. Analog-accuracy Surrogate Model - 1) Evaluation Criteria: To efficiently explore the search space, a search strategy requires evaluating the objectives of each sampled architecture. Training the sampled architectures is very time-consuming; especially when HWA retraining is performed, as noise injection and I/O quantization modeling greatly increases the computational complexity. Consequently, we build a surrogate model capable of estimating the objectives of each sampled architecture in IMC devices. To find architectures that maximize accuracy, stability, and resilience against IMC noise and drift characteristics, we have identified the following three objectives. - a) The 1-day accuracy: is the primary objective that most NAS algorithms aim to maximize. It measures the performance of an architecture on a given dataset. When weights are encoded using IMC devices, the accuracy of the architecture can drop over time due to conductance drift. Therefore, we have selected the 1-day accuracy as a metric to measure the architecture's performance. - b) The Accuracy Variation over One Month (AVM): is the difference between the 1-month and 1-sec accuracy. This objective is essential to measure the robustness over a fixed time duration. A 30-day period allows for a reasonable trade-off between capturing meaningful accuracy changes and avoiding short-term noise and fluctuations that may not reflect long-term trends. - c) The 1-day accuracy standard deviation: measures the variation of the architecture's performance across experiments, as discussed in Section III-B. A lower standard deviation indicates that the architecture produces consistent results on hardware deployments, which is essential for real-world applications. TABLE I SEARCHABLE HYPER-PARAMETERS AND THEIR RESPECTIVE RANGES. | Hyper- parameter | Definition | Range | |--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| | OC 0 | First layer's output channel First layer's kernel size Number of main blocks Number of residual block per main block Number of branches per main block Convolution block type per main block Widening factor per main block | Discrete Uniform [8, 128] Discrete Uniform [3, 7] Discrete Uniform [1, 5] Discrete Uniform [1, 16] Discrete Uniform [1, 12] Uniform Choice [A; B; C; Uniform [1, 4] | | KS 0 | | | | M | | | | R* | | | | B* | | | | CT* | | D] | | WF* | | | To build the surrogate model, we follow two steps: Dataset creation and Model training: - 2) Dataset Creation: The surrogate model will predict the rank based on the 1-day accuracy and estimates the AVM and 1-day accuracy standard deviation using the Mean Squared Error (MSE). Since the search space is large, care has to be taken when sampling the dataset of architectures that will be used to train the surrogate model. The architectures of the search space are sampled using two methods: (i) Latin Hypercube Sampling (LHS) [31] and (ii) NAS with full training. A more detailed description of the AnalogNAS algorithm is presented in Section V. We use LHS to sample architectures distributed evenly over the search space. This ensures good overall coverage of different architectures and their accuracies. NAS with full training is performed using an evolutionary algorithm to collect high-performance architectures. This ensures good exploitation when reaching well-performing regions. In Fig. 4, we present a visualization of the search space coverage, which does not show any clustering of similarly performing architectures at the edge of the main cloud of points. Thus, it is not evident that architectures with similar performance are located close to each other in the search space. This suggests that traditional search methods that Fig. 4. t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization of the sampled architectures for CIFAR-10. <details> <summary>Image 4 Details</summary> ![d187935f](/v1/image/d187935f3589f9ae224d6a46d2a4bace64ae6d35c146782f633e78df6189284e) ### Visual Description ## Scatter Plot: Architecture Ranking Visualization ### Overview The image is a 2D scatter plot visualizing data points distributed across two principal components. Points are color-coded by "Architecture Ranking" (0–1000) and include gray test points. The plot suggests a dimensionality reduction technique (e.g., PCA) applied to high-dimensional data. ### Components/Axes - **X-axis**: "1st Component" (ranges approximately -100 to 100). - **Y-axis**: "2nd Component" (ranges approximately -100 to 100). - **Legend**: - **Color Gradient**: Red (0) to Purple (1000) for "Architecture Ranking." - **Test Points**: Gray circles labeled "Test Points" in the bottom-left legend. ### Detailed Analysis - **Data Points**: - Colors correspond to architecture rankings, with red (lowest) and purple (highest) extremes. - Clusters of similar colors (e.g., purple, blue) suggest groupings of high/medium-ranked architectures. - Test points (gray) are interspersed across the plot but concentrated in mid-range regions. - **Spatial Distribution**: - High-ranked architectures (purple) dominate the upper-right quadrant (1st Component > 50, 2nd Component > 50). - Low-ranked architectures (red) cluster in the lower-left quadrant (1st Component < -50, 2nd Component < -50). - Test points are dispersed but avoid extreme regions, suggesting intentional sampling. ### Key Observations 1. **Color-Component Correlation**: Higher-ranked architectures (purple) tend to occupy regions with larger magnitudes in both components. 2. **Test Point Placement**: Gray test points are evenly distributed but avoid the extreme clusters, possibly indicating a validation set. 3. **Density Variations**: The central region (near 0,0) has higher density of mid-ranked architectures (blue/green). ### Interpretation The plot demonstrates that architecture rankings correlate with their position in the reduced-dimensional space. High-ranked architectures (purple) are spatially separated from lower-ranked ones (red), suggesting distinct features or performance metrics. Test points, likely used for validation, are embedded within the mid-range clusters, indicating they represent typical or average cases. The absence of a clear gradient from red to purple implies no strong linear progression in rankings across the components. This visualization could aid in identifying architectural patterns or biases in the dataset. **Note**: Exact numerical values for individual points cannot be extracted from the image alone. The analysis relies on visual clustering and color coding. </details> Fig. 5. Surrogate models comparison. <details> <summary>Image 5 Details</summary> ![5a0ae889](/v1/image/5a0ae8890d952a5559c8f042654c779f2c0a0c95041fe6857574d1169780970d) ### Visual Description ## Bar Chart: Performance Metrics of Surrogate Models ### Overview The chart compares three performance metrics (Kendal Tau Correlation, MSE AVM, and MSE 1-day STD-DEV) across four surrogate models (XGBoost, LGBoost, MLP, MLP Ensemble). Each model is represented by grouped bars in red (Kendal Tau), purple (MSE AVM), and cyan (MSE 1-day STD-DEV). Values are labeled on top of bars, with axes scaled to 0–1 for Kendal Tau and 0–15 for Mean Squared Error. ### Components/Axes - **X-axis**: Surrogate Models (XGBoost, LGBoost, MLP, MLP Ensemble) - **Left Y-axis**: Kendal Tau Correlation (0–1) - **Right Y-axis**: Mean Squared Error (0–15) - **Legend**: Located at bottom-left, mapping colors to metrics: - Red: Kendal Tau Correlation - Purple: MSE AVM - Cyan: MSE 1-day STD-DEV ### Detailed Analysis #### Kendal Tau Correlation (Red Bars) - **XGBoost**: 0.98 (highest) - **LGBoost**: 0.94 - **MLP**: 0.85 - **MLP Ensemble**: 0.89 #### MSE AVM (Purple Bars) - **XGBoost**: 0.35 - **LGBoost**: 0.60 - **MLP**: 1.00 (highest) - **MLP Ensemble**: 0.85 #### MSE 1-day STD-DEV (Cyan Bars) - **XGBoost**: 0.12 - **LGBoost**: 0.15 - **MLP**: 0.35 - **MLP Ensemble**: 0.20 ### Key Observations 1. **Kendal Tau Correlation**: All models show strong correlation (>0.85), with XGBoost leading at 0.98. 2. **MSE AVM**: MLP has the highest error (1.00), while XGBoost performs best (0.35). 3. **MSE 1-day STD-DEV**: XGBoost has the lowest variability (0.12), followed by MLP Ensemble (0.20). 4. **MLP Ensemble**: Balances moderate Kendal Tau (0.89) with mid-range MSE metrics (0.85 AVM, 0.20 STD-DEV). ### Interpretation - **Model Performance**: XGBoost excels in correlation and error metrics, suggesting robustness. MLP, while strong in correlation, shows high MSE AVM, indicating potential overfitting or instability. - **MLP Ensemble**: Demonstrates a trade-off between correlation and error, performing better than individual MLP but worse than XGBoost/LGBoost in correlation. - **Anomalies**: MLP’s high MSE AVM (1.00) contrasts with its moderate Kendal Tau (0.85), suggesting possible discrepancies in model stability or calibration. - **Practical Implications**: XGBoost and LGBoost are optimal for high-correlation tasks, while MLP Ensemble may suit scenarios requiring balanced error metrics. </details> rely on local optimization may not be effective in finding the best-performing architectures. Instead, population-based search strategies, which explore a diverse set of architectures, could be more effective in finding better-performing architectures. Our search strategy extracted 400 test points, and we found that architectures were distributed throughout the main cloud, indicating that our dataset covers a diverse portion of the search space, despite the limited size of only 1,000. Each sampled architecture is trained using different levels of weight noise and HWA training hyper-parameters using the AIHWKIT [21]. Specifically, we modify the standard deviation of the added weight noise between [0.1, 5.0] in increments of 0.1. The tile size was assumed to be symmetric and varied in [256, 512], representing 256-by-256 and 512-by-512 arrays respectively. Training with different configurations allowed us to generalize the use of the surrogate model across a range of IMC hardware configurations, and to increase the size of the constructed dataset. - 3) Model training: To train the surrogate model, we used a hinge pair-wise ranking loss [32] with margin m = 0 . 1 . The hinge loss, defined in Eq. (2), allows the model to learn the relative ranking order of architectures rather than the absolute accuracy values [32], [33].  a j refers to architectures indexed j , and y j to its corresponding 1-day accuracy. P ( a ) is the predicted score of architecture a . P ( a ) during training, the output score is trained to be correlated with the actual ranks of the architectures. Several algorithms were tested. After an empirical comparison, we adopted Kendall's Tau ranking correlation [34] as the direct criterion for evaluating ranking surrogate model performance. Fig. 5 shows the comparison using different ML algorithms to predict the rankings and AVMs. Our dataset is tabular. It contains each architecture and its corresponding features. XGBoost outperforms the different surrogate models in predicting the architectures' ranking order, the A VM of each architecture, and the 1-day standard deviation. Fig. 6. Overview of the AnalogNAS framework. <details> <summary>Image 6 Details</summary> ![0e6d1e97](/v1/image/0e6d1e97ea41d84d4ec782ccd59bb7c5a9e50b24f6b03e4422d2a860b945e856) ### Visual Description ## Flowchart: Architecture Optimization Process ### Overview The image depicts a two-phase flowchart for optimizing machine learning architectures. The left column outlines the **Off-Search Stage**, involving data generation, training, and evaluation. The right column details the **Best Architecture Selection** phase, focusing on iterative refinement using a surrogate model and evaluation metrics. Arrows indicate the flow of data and decision-making. ### Components/Axes - **Left Column (Off-Search Stage)**: - **Search Space**: Defines the range of possible architectures. - **Uniform Sampling**: Randomly selects architectures from the search space. - **RPU HW Configurations**: Hardware-specific configurations for training. - **Dataset of Sampled Architectures**: Curated dataset for training. - **Training with Noise**: Introduces noise during training to improve robustness. - **Dataset Construction**: Builds a dataset for evaluation. - **Training Procedure**: Includes metrics like Accuracy, Std. Dev., and AVM (Architecture Validation Metric). - **Right Column (Best Architecture Selection)**: - **Initial Population Gen.**: Generates an initial set of architectures. - **Surrogate Model**: A model to approximate the performance of architectures. - **Evaluation**: Assesses architectures using metrics like AVM. - **Decision Diamond (AVM < T_AVM)**: Compares AVM to a threshold (T_AVM). - **Best Architecture Selection**: Identifies the optimal architecture. - **Union & Mutations**: Combines and modifies architectures for refinement. - **End of Iteration?**: Determines if the process should continue. - **Legend Colors**: Red, Green, Blue (no explicit legend; colors likely differentiate steps or processes). ### Detailed Analysis - **Off-Search Stage**: - **Search Space** → **Uniform Sampling** → **RPU HW Configurations** → **Dataset of Sampled Architectures** → **Training with Noise** → **Dataset Construction** → **Training Procedure**. - Metrics (Accuracy, Std. Dev., AVM) are visualized in a box labeled "Training Procedure." - **Best Architecture Selection**: - **Initial Population Gen.** → **Surrogate Model** → **Evaluation** → **Decision Diamond (AVM < T_AVM)**. - If **Yes**: **Best Architecture Selection** → **Union** → **Mutations**. - If **No**: Loop back to **Initial Population Gen.**. - The process repeats until **End of Iteration?** is "Yes." ### Key Observations 1. **Iterative Refinement**: The flowchart emphasizes a cyclical process where architectures are repeatedly evaluated and refined. 2. **Surrogate Model Use**: The surrogate model reduces computational cost by approximating architecture performance. 3. **Threshold-Based Decision**: The AVM threshold (T_AVM) determines whether an architecture is retained or discarded. 4. **Noise in Training**: Introducing noise during training suggests a focus on generalization and robustness. ### Interpretation The flowchart represents a hybrid approach combining **evolutionary algorithms** (e.g., genetic operators like Union and Mutations) with **surrogate modeling** for efficient architecture search. The Off-Search Stage focuses on generating diverse architectures, while the Best Architecture Selection phase prioritizes quality through iterative evaluation. The use of AVM as a metric implies a focus on validation accuracy, and the threshold (T_AVM) introduces a trade-off between exploration and exploitation. The absence of a legend for colors is a potential ambiguity, as the red, green, and blue nodes may represent different stages (e.g., training, evaluation, refinement) or data types. The process highlights the importance of balancing computational efficiency (via surrogate models) with architectural diversity (via mutations and unions) to identify optimal models. The feedback loop ensures continuous improvement, aligning with principles of **Bayesian optimization** or **neural architecture search (NAS)**. </details> ## V. SEARCH STRATEGY Fig. 6 depicts the overall search framework. Given a dataset and a hardware configuration readable by AIHWKIT, the framework starts by building the surrogate model presented in Section IV-B. Then, we use an optimized evolutionary search to efficiently explore the search space using the surrogate model. Similar to traditional evolutionary algorithms, we use real number encoding. Each architecture is encoded into a vector, and each element of the vector contains the value of the hyper-parameter, as listed in Table I. ## A. Problem Formulation Given the search space S , our goal is to find an architecture α , that maximizes the 1-day accuracy while minimizing the 1-day standard deviation, subject to constraints on the number of parameters and the AVM. The number of parameters is an important metric in IMC, because it directly impacts the amount of on-chip memory required to store the weights of a DNN. Eq. (3) formally describes the optimization problem as follows:  ACC refers to the 1-day accuracy objective, σ denotes the 1-day accuracy's standard deviation, and ψ is the number of parameters. T p and T AVM are user-defined thresholds that correspond to the maximum number of parameters and AVM, respectively. ## B. Search Algorithm Our evolutionary search algorithm, i.e., AnalogNAS, is formally defined using Algorithm 1. AnalogNAS is an algorithm to find the most accurate and robust neural network architecture ## Algorithm 1 AnalogNAS algorithm. Input: Search space: S , RPU Configuration: rpu config , target task: task , population size: population size , AVM threshold: T AVM, parameter threshold: T p , number of iterations: N , time budget Output: Most efficient architecture for rpu config in S Begin D = sample( S , dataset size) HW Train( D , task ) AVMs = compute AVM( D ) surrogate model = XGBoost train(surrogate model, D , AVMs) ## repeat population = LHS(population size, T p ) AVMs, ranks = surrogate model(population) until AVMs > T AVM while i < N or time < time budget do top 50 = select(population, ranks) mutated = mutation(top 50, T p ) population = top 50 ⋃ mutated AVMs, ranks = surrogate model(population) ## end while return top 1 (population, ranks) for a given analog IMC configuration and task. The algorithm begins by generating a dataset of neural network architectures, which are trained on the task and evaluated using AIHWKIT. A surrogate model is then created to predict the efficiency of new architectures. The algorithm then generates a population of architectures using an LHS technique and selects the topperforming architectures to be mutated and generate a new population. The process is repeated until a stopping criterion is met, such as a maximum number of iterations or a time budget. Finally, the most robust architecture is returned. In the following, we detail how the population initialization, fitness evaluation, and mutations are achieved. - 1) Population Initialization: The search starts by generating an initial population. Using the LHS algorithm, we sample the population uniformly from the search space. LHS ensures that the initial population contains architectures with different architectural features. LHS is made faster with parallelization by dividing the sampling into multiple independent subsets, which can be generated in parallel using multiple threads. - 2) Fitness Evaluation: We evaluate the population using the aforementioned analog-accuracy surrogate model. In addition to the rankings, the surrogate model predicts the AVM of each architecture. As previously described, the AVM is used to gauge the robustness of a given network. If the AVM is below a defined threshold, T AVM, the architecture is replaced by a randomly sampled architecture. The new architecture is constrained to be sampled from the same hypercube dimension as the previous one. This ensures efficient exploration. - 3) Selection and Mutation: We select the top 50% architectures from the population using the predicted rankings. These architectures are mutated. The mutation functions are classified as follows: - a) Depth-related mutations: modify the depth of the architectures. Mutations include adding a main block, by increasing or decreasing M or a residual block R , or modifying the type of convolution block, i.e., { A,B,C,D } , for each main block. - b) Width-related mutations: modify the width of the architectures. Mutations include modifying the widening factor W of a main block or adding or removing a branch B , or modifying the initial output channel size of the first convolution, OC . - c) Other mutations: modify the kernel size of the first convolution, KS , and/or add skip connections,denoted using ST . Depth- and width-related mutations are applied with the same probability of 80%. The other mutations are applied with a 50% probability. In each class, the same probability is given to each mutation. The top 50% architectures in addition to the mutated architectures constitute the new population. For the remaining iterations, we verify the ranking correlation of the surrogate model. If the surrogate model's ranking correlation is degraded, we fine-tune the surrogate model with the population's architectures. The degradation is computed every 100 iterations. The surrogate model is tested on the population architectures after training them. It is fine-tuned if Kendall's tau correlation drops below 0.9. ## VI. EXPERIMENTS This section describes the experiments used to evaluate AnalogNAS on three tasks: CIFAR-10 image classification, VWW, and KWS. The AIHWKIT was used to perform hardware simulations. ## A. Experimental Setup - 1) Training Details: We detail the hyper-parameters used to train the surrogate model and different architectures on CIFAR-10, VWW, and KWS tasks. - a) Surrogate model training: We trained a surrogate model and dataset of HWA trained DNN architctures for each task. The sizes of the datasets were 1,200, 600, and 1,500, respectively. An additional 500 architectures were collected during the search trials for validation. All architectures were first trained without noise injection (i.e., using vanilla training routines), and then converted to AIHWKIT models for HWA retraining. The surrogate model architecture used was XGBoost. For VWW and KWS, the surrogate model was fine-tuned from the image classification XGBoost model. - b) Image classification training: We first trained the network architectures using the CIFAR-10 dataset [35], which contains 50,000 training and 10,000 test samples, evenly distributed across 10 classes. We augmented the training images with random crops and cutouts only. For training, we used Stochastic Gradient Descent (SGD) with a learning rate of 0.05 and a momentum of 0.9 with a weight decay of 5e-4. The learning rate was adjusted using a cosine annealing learning rate scheduler with a starting value of 0.05 and a maximum number of 400 iterations. - c) Visual Wake Words (VWW) training: We first trained the network architectures using the VWW dataset [36], which contains 82,783 train and 40,504 test images. Images are labeled 1 when a person is detected, and 0 when no person is present. The image pre-processing pipeline includeded horizontal and vertical flipping, scale augmentation [37], and random Red Green Blue (RGB) color shift. To train the architectures, we used the RMSProp optimizer [38] with a momentum of 0.9, a learning rate of 0.01, a batch normalization momentum of 0.99, and a l 2 weight decay of 1e-5. - d) Keyword Spotting (KWS) training: We first trained the network architectures using the KWS dataset [39], which contains 1-second long incoming audio clips. These are classified into one of twelve keyword classes, including 'silence' and 'unknown' keywords. The dataset contains 85,511 training, 10,102 validation, and 4,890 test samples. The input was transformed to 49 × 10 × 1 features from the Mel-frequency cepstral coefficients [40]. The data pre-processing pipeline included applying background noise and random timing jitter. To train the architectures, we used the Adam optimizer [41] with a decay of 0.9, a learning rate of 3e-05, and a linear learning rate scheduler with a warm-up ratio of 0.1. - 2) Search Algorithm: The search algorithm was run five times to compute the variance. The evolutionary search was executed with a population size of 200. If not explicitly mentioned, the AVM threshold was set to 10%. The width and depth mutation probability was set to 0.8. The other mutations' probability was set to 0.5. The total number of iterations was 200. After the search, the obtained architecture for each task was compared to SOTA baselines for comparison. ## B. Results The final architecture compositions for the three tasks are listed in Table II. In addition, figure 10 highlights the architectural differences between AnalogNAS T500 and resnet32. We modified T p to find smaller architectures. To determine the optimal architecture for different parameter thresholds, we use T X , where X represents the threshold T p in K units (e.g., T100 refers to the architecture with a threshold of 100K parameters). When searching for T200 and T100, the probability of increasing the widening factor or depth to their highest values, was lessened to 0.2. In Fig. 7, the simulated hardware comparison of the three tasks is depicted. Our models outperform SOTA architectures with respect to both accuracy and resilience to drift. On CIFAR-10, after training the surrogate model, the search took 17 minutes to run. We categorize the results based on the number of parameters threshold into two distinct groups. We consider edge models with a number of parameters below 1M and above 400k. Below 400K, architectures are suitable for TinyML deployment. The final architecture, T500, is smaller than Resnet32, and achieved +1.86% better accuracy and a drop of 1.8% after a month of inference, compared to 5.04%. This model is ∼ 86 × smaller than Wide Resnet [9], which has 36.5M parameters. Our smallest model, T100, Fig. 7. Simulated hardware comparison results on three benchmarks: (a,b) CIFAR-10, (c)VWW, and (d) KWS. The size of the marker represents the size (i.e., the number of parameters) of each model. The shaded area corresponds to the standard deviation at that time. <details> <summary>Image 7 Details</summary> ![28a8bae2](/v1/image/28a8bae24738e3742598889659435def44b7f63e8eb00114de219d154dd08e45) ### Visual Description ## Line Graphs: Model Performance Over Time Across Datasets ### Overview The image contains four line graphs comparing the test accuracy of various machine learning models over time intervals (1-second to 30-days) on different datasets: CIFAR-10 (Edge Models), CIFAR-10 (Tiny Models), VW (All Models), and KWS (All Models). Each graph includes multiple data series (models) with shaded confidence intervals. --- ### Components/Axes 1. **Subplot (a): CIFAR-10 with Edge Models** - **X-axis**: Time intervals (Floating Point, 1-sec, 1-hr, 1-day, 10-days, 30-days) - **Y-axis**: Test Accuracy (%) - **Legend**: - Resnet32 (464K) - Blue - AnalogNAS_T500 (422K) - Red - AnalogNAS_T1M (860K) - Maroon - Resnext29 (25M) - Teal - Wide Resnet (26.5M) - Dark Blue - **Shaded Areas**: Confidence intervals (light blue for Resnet32, light red for AnalogNAS_T500, etc.) 2. **Subplot (b): CIFAR-10 with Tiny Models** - **X-axis**: Time intervals (Floating Point, 1-sec, 1-hr, 1-day, 10-days, 30-days) - **Y-axis**: Test Accuracy (%) - **Legend**: - Resnet_V1 (74K) - Dark Blue - AnalogNAS_T100 (91K) - Brown - AnalogNAS_T300 (240K) - Red - Resnet20 (270K) - Teal - **Shaded Areas**: Confidence intervals (dark blue for Resnet_V1, dark red for AnalogNAS_T300, etc.) 3. **Subplot (c): VW (All Models)** - **X-axis**: Time intervals (Floating Point, 1-sec, 1-hr, 1-day, 10-days, 30-days) - **Y-axis**: Test Accuracy (%) - **Legend**: - Micro_Nets (130K) - Light Blue - Analog_Net (221K) - Purple - AnalogNAS_T200 (188K) - Maroon - AnalogNAS_T400 (364K) - Red - MCU_Net (1.2M) - Dark Blue - Mobilenet_V1 (4.2M) - Teal - **Shaded Areas**: Confidence intervals (light purple for Analog_Net, light red for AnalogNAS_T400, etc.) 4. **Subplot (d): KWS (All Models)** - **X-axis**: Time intervals (Floating Point, 1-sec, 1-hr, 1-day, 10-days, 30-days) - **Y-axis**: Test Accuracy (%) - **Legend**: - Micro_Nets (130K) - Light Blue - Analog_Net (179K) - Purple - AnalogNAS_T200 (188K) - Maroon - AnalogNAS_T500 (456K) - Red - DSCNN (1.1M) - Dark Blue - **Shaded Areas**: Confidence intervals (light purple for Analog_Net, light red for AnalogNAS_T500, etc.) --- ### Detailed Analysis #### Subplot (a): CIFAR-10 with Edge Models - **Trends**: - Resnet32 (Blue) starts at ~95% accuracy but declines sharply to ~85% over 30 days. - AnalogNAS_T500 (Red) maintains ~92-93% accuracy consistently. - AnalogNAS_T1M (Maroon) shows a gradual decline from ~94% to ~90%. - Resnext29 (Teal) and Wide Resnet (Dark Blue) decline from ~94% to ~88% and ~86%, respectively. - **Confidence Intervals**: All models show widening intervals over time, indicating increased uncertainty. #### Subplot (b): CIFAR-10 with Tiny Models - **Trends**: - Resnet_V1 (Dark Blue) drops from ~85% to ~75% over 30 days. - AnalogNAS_T100 (Brown) declines from ~88% to ~84%. - AnalogNAS_T300 (Red) remains stable at ~88-89%. - Resnet20 (Teal) declines from ~85% to ~78%. - **Confidence Intervals**: Narrower than subplot (a), but still widen over time. #### Subplot (c): VW (All Models) - **Trends**: - Micro_Nets (Light Blue) declines from ~90% to ~85%. - Analog_Net (Purple) drops from ~90% to ~85%. - AnalogNAS_T200 (Maroon) maintains ~90-91% accuracy. - AnalogNAS_T400 (Red) declines from ~92% to ~88%. - MCU_Net (Dark Blue) and Mobilenet_V1 (Teal) show steep declines to ~80% and ~78%, respectively. - **Confidence Intervals**: Moderate widening, with AnalogNAS_T200 showing the least variability. #### Subplot (d): KWS (All Models) - **Trends**: - Micro_Nets (Light Blue) declines from ~96% to ~92%. - Analog_Net (Purple) drops from ~96% to ~94%. - AnalogNAS_T200 (Maroon) remains stable at ~96-97%. - AnalogNAS_T500 (Red) declines from ~97% to ~95%. - DSCNN (Dark Blue) shows a steep decline from ~96% to ~90%. - **Confidence Intervals**: Widest among all subplots, especially for DSCNN. --- ### Key Observations 1. **Model Robustness**: AnalogNAS variants (e.g., AnalogNAS_T200, AnalogNAS_T500) consistently outperform other models in maintaining accuracy over time. 2. **Degradation Over Time**: All models exhibit accuracy degradation, with steeper declines in larger models (e.g., Wide Resnet, Mobilenet_V1). 3. **Confidence Intervals**: Wider intervals over time suggest reduced reliability in long-term predictions. 4. **Dataset-Specific Performance**: - Edge Models (subplots a/b) show higher initial accuracy but faster degradation. - VW and KWS datasets (subplots c/d) have higher baseline accuracy but similar degradation patterns. --- ### Interpretation The data highlights the trade-off between model complexity and long-term reliability. AnalogNAS models demonstrate superior robustness, likely due to their architecture or training strategies that mitigate performance decay. The widening confidence intervals over time underscore the challenges of deploying models in dynamic environments where data distributions may shift. For real-world applications, prioritizing models with stable accuracy (e.g., AnalogNAS_T200) and smaller confidence intervals is critical to ensure sustained performance. </details> TABLE II FINAL ARCHITECTURES FOR CIFAR-10, VWW, AND KWS. OTHER NETWORKS FOR VWW AND KWS ARE NOT LISTED, AS THEY CANNOT EASILY BE REPRESENTED USING OUR MACRO-ARCHITECTURE. | Network | | | Macro-Architecture Parameter R* | Macro-Architecture Parameter R* | Macro-Architecture Parameter R* | | |---------------------------------------------------------------------|----------------|-----------|-----------------------------------------------|------------------------------------|-------------------------------------|-------------------------------------| | | OC 0 | KS 0 | M | B* | CT* | WF* | | CIFAR-10 | CIFAR-10 | CIFAR-10 | CIFAR-10 | CIFAR-10 | CIFAR-10 | CIFAR-10 | | Resnet32 AnalogNAS T100 AnalogNAS T300 AnalogNAS T500 AnalogNAS T1M | 64 32 32 64 32 | 7 3 3 5 5 | 3 (5, 5, 5) 1 (2, ) 1 (3, 3) 1 (3, ) 2 (3, 3) | (1, 1, 1) (1,) (1, 1) (3, ) (2, 2) | (B, B, B) (C, ) (A, B) (A, ) (A, A) | (1, 1, 1) (2, ) (2, 1) (2, ) (3, 3) | | VWW | VWW | VWW | VWW | VWW | VWW | VWW | | AnalogNAS T200 AnalogNAS T400 | 24 68 | 3 3 | 3 (2, 2, 2) 2 (3, 5) | (1, 2, 1) (2, 1) | (B, A, A) (C, C) | (2, 2, 2) (3, 2) | | AnalogNAS T200 AnalogNAS T400 | 80 68 | 1 1 | 1 (1, ) 2 (2, 1) | (2, ) (1, 2) | (C, ) (B, B) | (4, ) (3, 3) | was 1 . 23 × bigger than Resnet-V1, the SOTA model benchmarked by MLPerf [11]. Despite not containing any depth-wise convolutions, Resnet V1 is extremely small, with only 70k parameters. Our model offers a +7.98% accuracy increase with a 5.14% drop after a month of drift compared to 10.1% drop for Resnet V1. Besides, our largest model, AnalogNAS 1M , outperforms Wide Resnet with +0.86% in the 1-day accuracy with a drop of only 1.16% compared to 6.33%. In addition, the found models exhibit greater consistency across experiment trials, with an average standard deviation of 0.43 over multiple drift times as opposed to 0.97 for SOTA models. Similar conclusions can be made about VWW and KWS. In VWW,current baselines use a depth-wise separable convolution that incurs a high accuracy drop on analog devices. Compared to AnalogNet-VWW and Micronets-VWW, the current SOTA networks for VWW in analog and edge devices, our T200 model has similar number of parameters (x1.23 smaller) with a +2.44% and +5.1% 1-day accuracy increase respectively. AnalogNAS was able to find more robust and consistent networks with an average AVM of 2.63% and a standard deviation of 0.24. MCUNet [42] and MobileNet-V1 present the highest AVM. This is due to the sole use of depth-wise separable convolutions. On KWS, the baseline architectures, including DSCNN [43], use hybrid networks containing recurrent cells and convolutions. The recurrent part of the model ensures high robustness to noise. While current models are already robust with an average accuracy drop of 4.72%, our model outperforms tiny SOTA models with 96.8% and an accuracy drop of 2.3% after a month of drift. Critically, our AnalogNAS models exhibit greater consistency across experiment trials, with an average standard deviation of 0.17 over multiple drift times as opposed to 0.36 for SOTA models. ## C. Comparison with HW-NAS In accordance with commonly accepted NAS methodologies, we conducted a comparative analysis of our search approach with Random Search. Results, presented in Fig. 8, were obtained across five experiment instances. Our findings indicate that Random Search was unable to match the 1-day accuracy levels of our final models, even after conducting experiments for a duration of four hours and using the same surrogate model. We further conducted an ablation study to evaluate the effectiveness of our approach by analyzing the impact of the LHS algorithm and surrogate model. The use of a Fig. 8. Ablation study comparison against HW-NAS. Mean and standard deviation values are reported across five experiment instances (trials). <details> <summary>Image 8 Details</summary> ![4041ff28](/v1/image/4041ff286d107a6b45de56dff57119bbf1832d9ff0fdde74487fd7447ff0f29c) ### Visual Description ## Line Chart: 1-day Accuracy (%) vs. Search Time (s) ### Overview The chart compares the 1-day accuracy of various machine learning model configurations (AnalogNAS, RS, FLASH) against search time on a logarithmic scale. Accuracy is measured in percentage, with search time ranging from 10² to 10⁴ seconds. ### Components/Axes - **X-axis (Search Time)**: Logarithmic scale (10² to 10⁴ seconds), labeled "Search Time (s)". - **Y-axis (Accuracy)**: Linear scale (60% to 95%), labeled "1-day Accuracy (%)". - **Legend**: Located at the bottom-right, with six entries: 1. **AnalogNAS w/ Surrogate and LHS** (brown, star markers) 2. **RS w/ Surrogate** (blue, circle markers) 3. **AnalogNAS w/o Surrogate** (pink, triangle markers) 4. **RS w/o Surrogate** (purple, square markers) 5. **AnalogNAS w/o LHS** (orange, diamond markers) 6. **FLASH [15]** (green, pentagon markers) ### Detailed Analysis 1. **AnalogNAS w/ Surrogate and LHS** (brown): - Starts at ~65% at 10²s, rises sharply to ~95% by 10³s, then plateaus. - Highest accuracy across all search times. 2. **RS w/ Surrogate** (blue): - Begins at ~65% at 10²s, surges to ~90% by 10³s, then stabilizes. - Second-highest accuracy after AnalogNAS with Surrogate/LHS. 3. **AnalogNAS w/o Surrogate** (pink): - Starts at ~65% at 10²s, peaks at ~85% by 10³s, then declines slightly. - Lower accuracy than AnalogNAS with Surrogate/LHS. 4. **RS w/o Surrogate** (purple): - Flat line at ~60% across all search times. - Worst-performing configuration. 5. **AnalogNAS w/o LHS** (orange): - Similar trend to AnalogNAS w/ Surrogate/LHS but ~5% lower accuracy. - Peaks at ~90% by 10³s. 6. **FLASH [15]** (green): - Starts at ~70% at 10²s, rises to ~75% by 10⁴s. - Consistently the lowest accuracy. ### Key Observations - **Surrogate and LHS** are critical for high accuracy in AnalogNAS and RS. - **RS w/o Surrogate** performs poorly, suggesting Surrogate is essential for RS. - **FLASH** lags behind other methods, indicating potential methodological differences. - All methods plateau at higher search times, suggesting diminishing returns. ### Interpretation The data demonstrates that **Surrogate and LHS** significantly enhance model accuracy for both AnalogNAS and RS. The absence of Surrogate in RS results in a flat, low-accuracy curve, highlighting its importance. AnalogNAS configurations outperform FLASH, suggesting superior architecture search strategies. The logarithmic x-axis emphasizes that accuracy improvements are most pronounced at lower search times, with diminishing gains at higher times. This implies a trade-off between computational cost and accuracy, with optimal configurations achieving near-peak performance within 10³s. </details> TABLE III AVM THRESHOLD VARIATION RESULTS ON CIFAR-10. | T AVM (%) | 1.0 | 3.0 | 5.0* | |-------------------|-------|--------|--------| | 1-day Accuracy | 88.7% | 93.71% | 93.71% | | AVM | 0.85% | 1.8% | 1.8% | | Search Time (min) | 34.65 | 28.12 | 17.65 | *Overall results computed with TAVM (%) = 5.0. random sampling strategy and exclusion of the surrogate model resulted in a significant increase in search time. The LHS algorithm helped in starting from a diverse initial population and improving exploration efficiency, while the surrogate model played a crucial role in ensuring practical search times. Besides, AnalogNAS surpasses both FLASH [15] and µ -nas [14] in performance and search time. FLASH search strategy is not adequate for large search spaces such as ours. As for µ -nas, although it manages to achieve acceptable results, its complex optimization algorithm hinders the search process, resulting in decreased efficiency. ## D. Search Time and Accuracy Variation over One Month (AVM) Threshold Trade-Off During the search, we culled architectures using their predicted AVM, i.e., any architecture with a higher AVM than the AVM threshold was disregarded. As listed in Table III, we varied this threshold to investigate the trade-off between TAVM and the search time. As can be seen, as TAVM is decreased, the delta between AVM and TAVM significantly decreases. The correlation between the search time and TAVM is observed to be non-linear. ## VII. EXPERIMENTAL HARDWARE VALIDATION AND ARCHITECTURE PERFORMANCE SIMULATIONS ## A. Experimental Hardware Validation An experimental hardware accuracy validation study was performed using a 64-core IMC chip based on PCM [44]. Each core comprises a crossbar array of 256x256 PCM-based unit-cells along with a local digital processing unit [45]. This validation study was performed to verify whether the simulated network accuracy values and rankings are representative of those when the networks are deployed on real physical hardware. We deployed two networks for the CIFAR-10 image classification task on hardware: AnalogNAS T500 and the baseline ResNet32 [30] networks from Fig. 7(a). To implement the aforementioned models on hardware, after HWA training was performed, a number of steps were carried out. First, from the AIHWKIT, unit weights of linear (dense) and unrolled convolutional layers, were exported to a state dictionary file. This was used to map network parameters to corresponding network layers. Additionally, the computational inference graph of each network was exported. These files were used to generate proprietary data-flows to be executed in-memory. As only hardware accuracy validation was being performed, all other operations aside from MVMs were performed on a host machine connected to the chip TABLE IV EXPERIMENTAL HARDWARE ACCURACY VALIDATION AND SIMULATED POWER PERFORMANCE ON THE IMC SYSTEM IN [46]. | Architecture | ResNet32 | AnalogNAS T500 | |--------------------------------------|--------------------------------------|--------------------------------------| | Hardware Experiments | Hardware Experiments | Hardware Experiments | | FP Accuracy | 94.34% | 94.54% | | Hardware accuracy* | 89.55% | 92.07% | | Simulated Hardware Power Performance | Simulated Hardware Power Performance | Simulated Hardware Power Performance | | Total weights | 464,432 | 416,960 | | Total tiles | 43 | 27 | | Network Depth | 32 | 17 | | Execution Time (msec) | 0.434 | 0.108 | | Inferences/s/W | 43,956.7 | 54,502 | *The mean accuracy is reported across five experiment repetitions. through an Field-programmable Gate Array (FPGA). The measured hardware accuracy was 92.05% for T500 and 89.87% for Resnet32, as reported in Table IV. Hence, the T500 network performs significantly better than Resnet32 also when implemented on real hardware. This further validates that our proposed AnalogNAS approach is able to find networks with similar number of parameters that are more accurate and robust on analog IMC hardware. ## B. Simulated Hardware Energy and Latency We conducted power performance simulations for AnalogNAS T500 and ResNet32 models using a 2D-mesh based heterogeneous analog IMC system with the simulation tool presented in [46]. The simulated IMC system consists of one analog fabric with 48 analog tiles of 512x512 size, on-chip digital processing units, and digital memory for activation orchestration between CNN layers. Unlike the accuracy validation experiments on the 64-core IMC chip, the simulated power performance assumes all intermediate operations to be mapped and executed on-chip. Our results, provided in Table IV, show that AnalogNAS T500 outperformed ResNet32 in terms of both execution time and energy efficiency. We believe that this power performance benefit is realized because, in analog IMC hardware, wider layers can be computed in parallel, leveraging the O (1) latency from analog tiles, and are therefore preferred over deeper layers. It is noted that both networks exhibit poor tile utilization and that the tile utilization and efficiency of these networks could be further improved by incorporating these metrics as explicit constraints. This is left to future work and is beyond the scope of AnalogNAS. ## VIII. DISCUSSION During the search, we analyzed the architecture characteristics and studied which types of architectures perform the best on IMC inference processors. The favoured architectures combine robustness to noise and accuracy performance. Fig. 9 shows the evolution of the average depth, the average widening factor, the average number of branches, and the average first convolution's output channel size of the search population for every 20 iterations. The depth represents the number of convolutions. Fig. 9. Evolution of architecture characteristics in the population during the search for CIFAR-10. Random individual networks are shown. <details> <summary>Image 9 Details</summary> ![67ba1cc5](/v1/image/67ba1cc5fccfa6b8257a5ef793561c9161dd54c9bc52b50deb5f00d1bfdb0519) ### Visual Description ## Chart/Diagram Type: Multi-subplot Analysis of Algorithm Performance Metrics ### Overview The image contains four subplots (a-d) analyzing algorithm performance across iterations for different population sizes (Tp=1M, 500K, 100K). Each subplot tracks a distinct metric: depth, widening factor, NB branches, and first-in channel values. Data is presented as line graphs with error bars, showing trends over 200 iterations. ### Components/Axes - **Subplot (a) Depth Analysis** - **X-axis**: Iterations (50, 100, 150, 200) - **Y-axis**: Average Depth (10–27.5) - **Legend**: - Black: Tp=1M - Cyan: Tp=500K - Purple: Tp=100K - **Subplot (b) Widening Factor Analysis** - **X-axis**: Iterations (50, 100, 150, 200) - **Y-axis**: Average Widening Factor (1.5–3.5) - **Legend**: Same as (a) - **Subplot (c) NB Branches Analysis** - **X-axis**: Iterations (50, 100, 150, 200) - **Y-axis**: Average NB Branches (1.4–2.6) - **Legend**: Same as (a) - **Subplot (d) First In Channel Analysis** - **X-axis**: Iterations (50, 100, 150, 200) - **Y-axis**: Average First In Channel (20–40) - **Legend**: Same as (a) ### Detailed Analysis #### (a) Depth Analysis - **Tp=1M (Black)**: Starts at ~25, drops to ~20 by 100 iterations, stabilizes near 22. - **Tp=500K (Cyan)**: Starts at ~20, fluctuates, rises to ~22 by 200 iterations. - **Tp=100K (Purple)**: Starts at ~15, fluctuates, drops to ~12 by 200 iterations. #### (b) Widening Factor Analysis - **Tp=1M (Black)**: Starts at ~1.5, increases steadily to ~3.0. - **Tp=500K (Cyan)**: Starts at ~2.0, fluctuates, decreases to ~1.8 by 200 iterations. - **Tp=100K (Purple)**: Starts at ~1.5, fluctuates, increases to ~2.5 by 200 iterations. #### (c) NB Branches Analysis - **Tp=1M (Black)**: Starts at ~2.6, decreases to ~2.0 by 100 iterations, stabilizes near 2.2. - **Tp=500K (Cyan)**: Starts at ~2.4, fluctuates, rises to ~2.2 by 200 iterations. - **Tp=100K (Purple)**: Starts at ~1.8, fluctuates, drops to ~1.4 by 200 iterations. #### (d) First In Channel Analysis - **Tp=1M (Black)**: Starts at ~35, drops sharply to ~20 by 100 iterations, fluctuates between 20–25. - **Tp=500K (Cyan)**: Starts at ~30, fluctuates, decreases to ~22 by 200 iterations. - **Tp=100K (Purple)**: Starts at ~25, fluctuates, increases to ~28 by 200 iterations. ### Key Observations 1. **Tp=1M (Black)** consistently shows higher average depth and widening factor but fewer NB branches, suggesting deeper but less exploratory behavior. 2. **Tp=100K (Purple)** exhibits the most fluctuation across metrics, indicating instability in performance. 3. **First In Channel** values for Tp=1M drop significantly after 100 iterations, implying improved channel selection efficiency over time. 4. Error bars suggest higher variability in Tp=100K compared to larger populations. ### Interpretation The data suggests that larger population sizes (Tp=1M) prioritize depth and widening factor, likely due to broader exploration. Smaller populations (Tp=100K) show erratic behavior, possibly due to limited diversity. The decline in NB branches with larger Tp indicates reduced branching, aligning with focused exploration. The First In Channel metric for Tp=1M demonstrates adaptive improvement, hinting at optimized resource allocation in later iterations. These trends highlight trade-offs between exploration (widening factor) and exploitation (depth) in algorithmic performance. </details> Fig. 10. Architectural differences between AnalogNAS T500 and Resnet32. <details> <summary>Image 10 Details</summary> ![e621b76c](/v1/image/e621b76c40ef8d3e19c4deb51040280eaf2ce12c3574edda6a66b15ca2ed7c84) ### Visual Description ## Diagram: Comparison of AnalogNAS T500 and ResNet32 Architectures ### Overview The image compares two neural network architectures: **AnalogNAS T500** (top) and **ResNet32** (bottom). Both diagrams illustrate block structures, convolutional layers, and network flow. Key differences include branching strategies, convolutional widths, and block types (BottleneckBlock vs. BasicBlock). ### Components/Axes - **Top Section (AnalogNAS T500 Architecture)**: - **Labels**: - "AnalogNAS T500 Architecture" (title). - "More branches" (dashed arrow pointing to multiple Conv 1x1 blocks). - "Wider convolutions" (dashed arrow pointing to Conv 3x3 blocks). - "BottleneckBlock vs. BasicBlock" (comparison of block types). - "Shallower and smaller network" (note at the bottom). - **Elements**: - Purple rectangles labeled "Conv 1x1" and "Conv 3x3" (convolutional layers). - Pink "+" symbols (concatenation operations). - Gray rectangles labeled "Flatten" and "FC" (fully connected layer). - Arrows indicating data flow. - **Bottom Section (ResNet32 Architecture)**: - **Labels**: - "ResNet32 Architecture" (title). - "x11" (multiplier for repeated blocks). - **Elements**: - Purple rectangles labeled "Conv 3x3" (convolutional layers). - Pink "+" symbols (concatenation operations). - Gray rectangles labeled "Flatten" and "FC" (fully connected layer). - Arrows indicating data flow. ### Detailed Analysis - **AnalogNAS T500**: - Features **more branches** (dashed arrow) and **wider convolutions** (dashed arrow), suggesting a focus on increasing model capacity and complexity. - Uses **BottleneckBlock** (narrower intermediate layers) and **BasicBlock** (standard convolutional blocks) for modularity. - Ends with a **Flatten** layer followed by a **Fully Connected (FC)** layer. - **ResNet32**: - Simpler structure with **repeated Conv 3x3 blocks** (x11 multiplier), indicating a shallower and smaller network. - Also ends with a **Flatten** layer and **FC** layer. ### Key Observations 1. **Architectural Complexity**: AnalogNAS T500 is designed with more branches and wider convolutions, likely for advanced tasks requiring higher capacity. 2. **Block Types**: The comparison of BottleneckBlock vs. BasicBlock highlights trade-offs between efficiency (Bottleneck) and simplicity (BasicBlock). 3. **ResNet32 Simplicity**: The "x11" multiplier and repeated Conv 3x3 blocks suggest a focus on depth without excessive complexity. ### Interpretation - The **AnalogNAS T500** architecture prioritizes **scalability and adaptability** through modular blocks and wider convolutions, potentially improving performance on complex tasks. - The **ResNet32** architecture emphasizes **simplicity and efficiency**, using repeated blocks to achieve depth without excessive computational overhead. - The absence of numerical data (e.g., accuracy, FLOPs) limits quantitative analysis, but the structural differences suggest AnalogNAS T500 is optimized for advanced applications, while ResNet32 is a baseline for standard tasks. ## Notes - No numerical data or charts are present; the image focuses on architectural design. - The legend (if any) is not explicitly visible, but block types (BottleneckBlock vs. BasicBlock) are labeled directly. - Spatial grounding: All elements are arranged in a top-to-bottom flow, with labels positioned near their respective components. </details> A sampled architecture has a widening factor per block. To compute the average widening factor, we first computed the average widening factor per architecture by dividing the sum of the widening factors by the number of blocks contained in the architecture. Then, we calculated the average widening factor across all architectures. Similar computations were performed for the average number of branches. For each plot, the search was run 5 times and the mean is represented in each point. The plotted error corresponds to one standard deviation from that mean. Starting from a random population obtained using LHS, the population evolves through different width and depth-related mutations. During this analysis, we want to answer the following questions: (i) does the search favor wide or deep networks? And subsequently, are wider architectures more noise resilient? (ii) what architectures are exploited by the search for different tasks when constraining the number of parameters? ## A. Are Wider or Deeper Networks More Robust to PCM Device Drift? From Fig. 9, it can be observed that the depth of all networks decreases during the search. This trend is especially seen when we constrain the model's size to 100K and 500K parameters. During the search, the widening factor also increases, allowing the blocks to have wider convolutions. The number of branches is highly dependent on T p . This number is, on average, between 1 and 2. The branches are the number of parallel convolutions in a block disregarding the skip connection. In the literature, architectures such as ResNext, that support a higher number of branches, have a number of parameters around 30M. It is still interesting to get blocks with two branches, which also reflects an increase in the width of the network by increasing the number of features extracted within the same block.The average output channel size of the first convolution decreases during the search. Its final value is around the same number of output channels as standard architectures, i.e., between 16 and 32. This follows the general trend of having wider convolutions in deeper network positions. ## B. Types Of Architectures The architectures and parameter constraints differ for each task, but they all exhibit an increasing expansion ratio in the convolution block. This allows the convolutions to effectively utilize the tile and mitigate noise from unused cells in the crossbar.For CIFAR-10, architectures behave like Wide Resnet [9] while respecting the number of parameters constraint. For the VWW task, the architectures are deeper. The input resolution is 224 × 224 , which requires more feature extraction blocks. However, they are still smaller than SOTA architectures, with a maximum depth of 22. As depth is essential to obtain high accuracy for the VWW task, no additional branches are added. For the KWS task, the architectures are the widest possible, maximizing the tile utilization for each convolutional layer. ## IX. CONCLUSION In this paper, we propose an efficient NAS methodology dedicated to analog in-memory computing for TinyML tasks entitled AnalogNAS. The obtained models are accurate, noise and drift-resilient, and small enough to run on resourceconstrained devices. Experimental results demonstrate that our method outperforms SOTA models on analog hardware for three tasks of the MLPerf benchmark: image classification on CIFAR-10, VWW, and KWS. Our AnalogNAS T500 model implemented on physical hardware demonstrates > 2% higher accuracy experimentally on the CIFAR-10 benchmark than ResNet32. Calculated speed and energy efficiency estimates reveal a > 4 × reduction in execution time, in addition to > 1 . 2 × higher energy efficiency for AnalogNAS T500 compared with ResNet32 when evaluated using a system-level simulator. While our paper has focused on a ResNet-like search space, it is important to note that our search strategy is adaptable and can be extended in future work to explore a broader range of architectures. ## ACKNOWLEDGEMENT We would like to thank Thanos Vasilopoulos and Julian B¨ uchel from IBM Research, Zurich lab, for their help with the development of the hardware-software infrastructure. We also thank the IBM Research Analog AI team for their feedback. We thank the computational support from AiMOS, an AI supercomputer made available by the IBM Research AI Hardware Center and Rensselaer Polytechnic Institute's Center for Computational Innovations (CCI). ## REFERENCES - [1] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, 'Memory devices and applications for in-memory computing,' Nature Nanotechnology , vol. 15, no. 7, pp. 529-544, Jul 2020. - [2] V. Joshi, M. L. Gallo, I. Boybat, S. Haefeli, C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou, 'Accurate deep neural network inference using computational phase-change memory,' CoRR , vol. abs/1906.03138, 2019. - [3] C. Lammie, W. Xiang, B. Linares-Barranco, and M. Rahimi Azghadi, 'Memtorch: An open-source simulation framework for memristive deep learning systems,' Neurocomputing , vol. 485, pp. 124-133, 2022. - [4] I. Boybat, B. Kersting, S. G. Sarwat, X. Timoneda, R. L. Bruce, M. BrightSky, M. L. Gallo, and A. Sebastian, 'Temperature sensitivity of analog in-memory computing using phase-change memory,' in 2021 IEEE International Electron Devices Meeting (IEDM) , 2021, pp. 28.3.1-28.3.4. - [5] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, 'Mobilenets: Efficient convolutional neural networks for mobile vision applications,' CoRR , vol. abs/1704.04861, 2017. - [6] G. Lu, W. Zhang, and Z. Wang, 'Optimizing depthwise separable convolution operations on gpus,' IEEE Transactions on Parallel and Distributed Systems , vol. 33, no. 1, pp. 70-87, 2022. - [7] H. Benmeziane, K. E. Maghraoui, H. Ouarnoughi, S. Niar, M. Wistuba, and N. Wang, 'A comprehensive survey on hardware-aware neural architecture search,' CoRR , vol. abs/2101.09336, 2021. - [8] S. Xie, R. B. Girshick, P. Doll´ ar, Z. Tu, and K. He, 'Aggregated residual transformations for deep neural networks,' in IEEE Conference on Computer Vision and Pattern Recognition, CVPR . IEEE Computer Society, 2017, pp. 5987-5995. - [9] S. Zagoruyko and N. Komodakis, 'Wide residual networks,' in Proceedings of the British Machine Vision Conference, BMVC , R. C. Wilson, E. R. Hancock, and W. A. P. Smith, Eds. BMVA Press, 2016. - [10] L. Sekanina, 'Neural architecture search and hardware accelerator cosearch: A survey,' IEEE Access , vol. 9, pp. 151 337-151 362, 2021. - [11] C. R. Banbury, V. J. Reddi, P. Torelli, N. Jeffries, C. Kir´ aly, J. Holleman, P. Montino, D. Kanter, P. Warden, D. Pau, U. Thakker, A. Torrini, J. Cordaro, G. D. Guglielmo, J. M. Duarte, H. Tran, N. Tran, W. Niu, and X. Xu, 'Mlperf tiny benchmark,' in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , J. Vanschoren and S. Yeung, Eds., 2021. - [12] C. R. Banbury, C. Zhou, I. Fedorov, R. M. Navarro, U. Thakker, D. Gope, V. J. Reddi, M. Mattina, and P. N. Whatmough, 'Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers,' in Proceedings of Machine Learning and Systems , A. Smola, A. Dimakis, and I. Stoica, Eds. mlsys.org, 2021. - [13] H. Liu, K. Simonyan, and Y. Yang, 'DARTS: differentiable architecture search,' CoRR , vol. abs/1806.09055, 2018. [Online]. Available: http://arxiv.org/abs/1806.09055 - [14] E. Liberis, L. Dudziak, and N. D. Lane, ' µ nas: Constrained neural architecture search for microcontrollers,' in EuroMLSys@EuroSys 2021, Proceedings of the 1st Workshop on Machine Learning and Systemsg Virtual Event, Edinburgh, Scotland, UK, 26 April, 2021 , E. Yoneki and P. Patras, Eds. ACM, 2021, pp. 70-79. - [15] G. Li, S. K. Mandal, ¨ U. Y. Ogras, and R. Marculescu, 'FLASH: fast neural architecture search with hardware optimization,' ACM Trans. Embed. Comput. Syst. , vol. 20, no. 5s, pp. 63:1-63:26, 2021. - [16] W. Jiang, Q. Lou, Z. Yan, L. Yang, J. Hu, X. S. Hu, and Y. Shi, 'Device-circuit-architecture co-exploration for computing-in-memory neural accelerators,' IEEE Trans. Computers , vol. 70, no. 4, pp. 595-605, 2021. - [17] Z. Yuan, J. Liu, X. Li, L. Yan, H. Chen, B. Wu, Y. Yang, and G. Sun, 'NAS4RRAM: neural network architecture search for inference on rrambased accelerators,' Sci. China Inf. Sci. , vol. 64, no. 6, 2021. - [18] Z. Yan, D.-C. Juan, X. S. Hu, and Y. Shi, 'Uncertainty modeling of emerging device based computing-in-memory neural accelerators with application to neural architecture search,' in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC) , 2021, pp. 859-864. - [19] G. Li, M. Zhang, J. Li, F. Lv, and G. Tong, 'Efficient densely connected convolutional neural networks,' Pattern Recognit. , vol. 109, p. 107610, 2021. - [20] C. Zhou, F. Redondo, J. Buchel, I. Boybat, X. Comas, S. R. Nandakumar, S. Das, A. Sebastian, M. L. Gallo, and P. N. Whatmough, 'Ml-hw codesign of noise-robust tinyml models and always-on analog computein-memory edge accelerator,' IEEE Micro , vol. 42, no. 06, pp. 76-87, 2022. - [21] M. J. Rasch, D. Moreda, T. Gokmen, M. L. Gallo, F. Carta, C. Goldberg, K. E. Maghraoui, A. Sebastian, and V. Narayanan, 'A flexible and fast pytorch toolkit for simulating training and inference on analog crossbar arrays,' in 3rd IEEE International Conference on Artificial Intelligence Circuits and Systems . IEEE, 2021, pp. 1-4. - [22] M. J. Rasch, C. Mackin, M. L. Gallo, A. Chen, A. Fasoli, F. Odermatt, N. Li, S. R. Nandakumar, P. Narayanan, H. Tsai, G. W. Burr, A. Sebastian, and V. Narayanan, 'Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators,' 2023. [Online]. Available: https://arxiv.org/abs/2302.08469 - [23] C. Termritthikun, Y. Jamtsho, J. Ieamsaard, P. Muneesawang, and I. Lee, 'Eeea-net: An early exit evolutionary neural architecture search,' Eng. Appl. Artif. Intell. , vol. 104, p. 104397, 2021. - [24] X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen, Y. Tian, M. Yu, P. Vajda, and J. E. Gonzalez, 'Fbnetv3: Joint architecture-recipe search using predictor pretraining,' in IEEE Conference on Computer Vision and Pattern Recognition, CVPR . Computer Vision Foundation / IEEE, 2021, pp. 16 276-16 285. - [25] W. Jiang, X. Zhang, E. H. Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu, 'Accuracy vs. efficiency: Achieving both through fpga-implementation aware neural architecture search,' in Proceedings of the 56th Annual Design Automation Conference 2019, DAC 2019, Las Vegas, NV, USA, June 02-06, 2019 . ACM, 2019, p. 5. - [26] A. Sarah, D. Cummings, S. N. Sridhar, S. Sundaresan, M. Szankin, T. Webb, and J. P. Munoz, 'A hardware-aware system for accelerating deep neural network optimization,' CoRR , vol. abs/2202.12954, 2022. - [27] Z. Dong, Y. Gao, Q. Huang, J. Wawrzynek, H. K. H. So, and K. Keutzer, 'HAO: hardware-aware neural architecture optimization for efficient inference,' in 29th IEEE Annual International Symposium on FieldProgrammable Custom Computing Machines, FCCM 2021, Orlando, FL, USA, May 9-12, 2021 . IEEE, 2021, pp. 50-59. - [28] M. Tan and Q. V. Le, 'Efficientnetv2: Smaller models and faster training,' in Proceedings of the 38th International Conference on Machine Learning, ICML , ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 10 096-10 106. - [29] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, 'Learning transferable architectures for scalable image recognition,' in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 . Computer Vision Foundation / IEEE Computer Society, 2018, pp. 8697-8710. - [30] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition,' in IEEE Conference on Computer Vision and Pattern Recognition, CVPR . IEEE Computer Society, 2016, pp. 770-778. - [31] M. D. McKay, 'Latin hypercube sampling as a tool in uncertainty analysis of computer models,' in Proceedings of the 24th Winter Simulation Conference, Arlington, VA, USA, December 13-16, 1992 , R. C. Crain, Ed. ACM Press, 1992, pp. 557-564. - [32] X. Ning, Y. Zheng, T. Zhao, Y. Wang, and H. Yang, 'A generic graphbased neural architecture encoding scheme for predictor-based NAS,' in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIII , ser. Lecture Notes in Computer Science, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12358. Springer, 2020, pp. 189-204. - [33] H. Benmeziane, S. Niar, H. Ouarnoughi, and K. E. Maghraoui, 'Pareto rank surrogate model for hardware-aware neural architecture search,' in International IEEE Symposium on Performance Analysis of Systems and Software, ISPASS . IEEE, 2022, pp. 267-276. - [34] H. Abdi, 'The kendall rank correlation coefficient,' Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA , pp. 508-510, 2007. - [35] A. Krizhevsky, G. Hinton et al. , 'Learning multiple layers of features from tiny images,' 2009. - [36] A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes, 'Visual wake words dataset,' CoRR , vol. abs/1906.05721, 2019. - [37] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, and A. J. Smola, 'Resnest: Splitattention networks,' in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR . IEEE, 2022, pp. 2735-2745. - [38] M. C. Mukkamala and M. Hein, 'Variants of rmsprop and adagrad with logarithmic regret bounds,' in Proceedings of the 34th International Conference on Machine Learning, ICML , ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 2545-2553. - [39] P. Warden, 'Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,' ArXiv e-prints , Apr. 2018. [Online]. Available: https://arxiv.org/abs/1804.03209 - [40] J. Mart´ ınez, H. P´ erez-Meana, E. E. Hern´ andez, and M. M. Suzuki, 'Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques,' in 22nd International Conference on Electrical Communications and Computers, CONIELECOMP , P. B. S´ anchez, R. Rosas-Romero, and M. J. O. Galindo, Eds. IEEE, 2012, pp. 248-251. - [41] D. P. Kingma and J. Ba, 'Adam: A method for stochastic optimization,' in 3rd International Conference on Learning Representations , Y. Bengio and Y. LeCun, Eds., 2015. - [42] J. Lin, W. Chen, Y. Lin, J. Cohn, C. Gan, and S. Han, 'Mcunet: Tiny deep learning on iot devices,' in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. - [43] P. M. Sørensen, B. Epp, and T. May, 'A depthwise separable convolutional neural network for keyword spotting on an embedded system,' EURASIP J. Audio Speech Music. Process. , vol. 2020, no. 1, p. 10, 2020. - [44] M. L. Gallo, R. Khaddam-Aljameh, M. Stanisavljevic, A. Vasilopoulos, B. Kersting, M. Dazzi, G. Karunaratne, M. Braendli, A. Singh, S. M. Mueller, J. Buechel, X. Timoneda, V. Joshi, U. Egger, A. Garofalo, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, T. Philip, V. Chan, C. Silvestre, I. Ahsan, N. Saulnier, V. Narayanan, P. A. Francese, E. Eleftheriou, and A. Sebastian, 'A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference,' 2022. [Online]. Available: https://arxiv.org/abs/2212.02872 - [45] R. Khaddam-Aljameh, M. Stanisavljevic, J. F. Mas, G. Karunaratne, M. Braendli, F. Liu, A. Singh, S. M. M¨ uller, U. Egger, A. Petropoulos, T. Antonakopoulos, K. Brew, S. Choi, I. Ok, F. L. Lie, N. Saulnier, V. Chan, I. Ahsan, V. Narayanan, S. R. Nandakumar, M. L. Gallo, P. A. Francese, A. Sebastian, and E. Eleftheriou, 'Hermes core - a 14nm cmos and pcm-based in-memory compute core using an array of 300ps/lsb linearized cco-based adcs and local digital processing,' in 2021 Symposium on VLSI Technology , 2021, pp. 1-2. - [46] S. Jain, H. Tsai, C.-T. Chen, R. Muralidhar, I. Boybat, M. M. Frank, S. Wo´ zniak, M. Stanisavljevic, P. Adusumilli, P. Narayanan, K. Hosokawa, M. Ishii, A. Kumar, V. Narayanan, and G. W. Burr, 'A heterogeneous and programmable compute-in-memory accelerator architecture for analogai using dense 2-d mesh,' IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 31, no. 1, pp. 114-127, 2023.

Rendering Paper...