2302.06417v1

Model: gemini-2.0-flash

## Analog, In-memory Compute Architectures for Artificial Intelligence Patrick Bowen ∗ Neurophos, 212 W Main St. #301, Durham, North Carolina 27701, USA and Center for Metamaterials and Integrated Plasmonics and Department of Electrical and Computer Engineering, Duke University, P.O. Box 90291, Durham, North Carolina 27708, USA Guy Regev † and Nir Regev ‡ AlephZero, 5141 Beeman Ave. Valley Village, CA 91607, USA and Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, David Ben-Gurion Blvd. 1, Be'er Sheva, Israel Bruno Umbria Pedroni § AlephZero, 5141 Beeman Ave., Valley Village, CA 91607, USA and Department of Bioengineering, UC San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA Edward Hanson ¶ and Yiran Chen ∗∗ Duke University, Electrical and Computer Engineering Department, Science Dr, Durham, NC 27710 (Dated: February 14, 2023) This paper presents an analysis of the fundamental limits on energy efficiency in both digital and analog in-memory computing architectures, and compares their performance to single instruction, single data (scalar) machines specifically in the context of machine inference. The focus of the analysis is on how efficiency scales with the size, arithmetic intensity, and bit precision of the computation to be performed. It is shown that analog, in-memory computing architectures can approach arbitrarily high energy efficiency as both the problem size and processor size scales. ## I. INTRODUCTION This work is focused on minimizing the energy required to evaluate neural networks, particularly in the linear layers which comprise the overwhelming majority of the computation. The linear operators that describe convolutional neural network layers can be often be characterized by three qualities: they are sparse, high in dimensionality, and high in arithmetic intensity, where arithmetic intensity is defined as the ratio between the number of basic operations (i.e. multiplications and additions) and the number of bytes read and written. This paper shows that, in the context of operators that are both high in dimensionality and arithmetic intensity, an analog in-memory computing device can drastically reduce the energy required to evaluate the operator compared to a von Neumann machine. Moreover, the degree of increased efficiency of the analog processor is related to the scale of the processor. In a classical von Neumann machine, the energy required to evaluate an operator can be broken into two components: memory access energy and computational energy. Within a typical CPU, and depending on the workload, these components can consume the same or- ∗ ptbowen@neurophos.com † guy@alephzero.ai ‡ nir@alephzero.ai § bruno@alephzero.ai ¶ edward.t.hanson@duke.edu ∗∗ yiran.chen@duke.edu der of magnitude of the total energy. Memory access related energy can easily outgrow computational energy consumption, particularly when used to evaluate sequential large linear operators like those used in neural network inference. The goal of this paper is to find highlevel architectures that can reduce the energy consumption of neural network algorithms by orders of magnitude, which requires addressing both memory access energy and computational energy. Here we show that an in-memory compute accelerator architecture can reduce memory access energy when applied to an operator/algorithm with high arithmetic intensity, while an analog processor/accelerator can reduce computational energy when specialized for particular classes of linear operators. A processor architecture that takes advantage of both in-memory compute and is analog in nature can in principle reduce the overall computational energy consumption by orders of magnitude, with the amount of reduction depending on the scale and arithmetic intensity of the algorithm to be performed and the analog processor's specialization in performing a specific set of operators. In-memory compute architectures were originally designed to speed up processing of algorithms that are parallelizable and applied to large datasets. One of the earliest examples dates back to the 1960s with Westinghouse's Solomon project. The goal of that project was to accelerate the speed of the computer up to 1 GFLOPs by using a single instruction applied to a large array of Arithmetic Logic Units (ALUs). This is perhaps the first instance of the several closely related concepts: single instruction, multiple data (SIMD) machines, vector/array processors, systolic arrays and in-memory/near-memory compute devices. Today, exploiting parallelism in high-arithmetic intensity algorithms using parallel hardware remains a wellknown technique to accelerate a computation along the time dimension. More recently, however, vector/array processors have been utilized to decrease compute energy as opposed to the original purpose of compute time, and it does this by reducing energy associated with memory accesses. Google's TPU is a good example of a systolic array being used as a near-memory compute device with digital processing elements [1, 2]. In sec. III, we explain how in-memory compute devices can reduce memory access energy in the case of linear operators with high arithmetic intensity. Separately, analog computing has recently been proposed as an approach to reduce the computational energy consumption, again for large, linear operations. In sec. IV we present a general model of analog computation that focuses on how energy consumption scales with problem size and bit precision, and show that computational energy can be reduced by orders of magnitude by using an analog processor that is specialized to implement specific classes of operators. Reconfigurable analog processors are by nature in-memory compute devices, and so these classes of processors are shown to reduce overall computational energy by orders of magnitude for particular operators. ## II. CPU ENERGY CONSUMPTION We begin by finding the energy efficiency of a computer performing multiply-accumulate (MAC) operations, which are the core of linear operators used in deep learning. The total energy required to perform a linear operation can be decomposed into memory access energy and computational energy: $$E _ { t o t } = N _ { m } e _ { m } + N _ { o p } e _ { o p } , \quad ( 1 )$$ where N m is the number of memory accesses, e m is the average energy per access, N op is the number of operations required to evaluate the overall operator, and e op is the average energy per operation (e.g., add, multiply, etc). We define the computational efficiency as the number of operations per unit energy performed by the computer: $$\eta \equiv N _ { o p } / E _ { t o t } = \frac { 1 } { ( N _ { m } / N _ { o p } ) e _ { m } + e _ { o p } } .$$ In a simple CPU with a single instruction, single data (SISD) architecture in Flynn's taxonomy and a flat memory hierarchy, for each operation that is performed, a value is read from memory for the current partial sum, the operator weight, and the input activation. The three values are operated upon, and the result is written back to memory. Therefore, regardless of the actual size of the weights or activations, the number of memory accesses per operation will always be four (i.e. three reads and one write), and the number of computational operations (multiply and add) will be 2. This results in N m = 2 N op and a computational efficiency of $$\eta = \frac { 1 } { 2 e _ { m } + e _ { o p } } . \quad ( 3 )$$ In modern CMOS devices, both e m and e op are on the order of magnitude of 1 pJ [3], as will later be shown in table IV. This places an approximate limit on the computational efficiency of most traditional architectures on the order of 0.1-1 TOPS/W, which is consistent with state of the art performance [4]. ## III. MINIMIZING MEMORY ACCESS ENERGY WITH IN-MEMORY COMPUTE One of the major downsides of SISD machines is that they can end up accessing the same memory element multiple times in the course of evaluating a large operator, which wastes memory access energy. This is ultimately reflected in the ratio N op /N m = 1 / 2 that is fixed by the nature of a SISD machine. Alternatively, one can imagine finding another hypothetical architecture that is arranged in some energetically optimal way to where all of the inputs are only read once from memory, and all outputs are only written once to memory in the course of the computation. If that were done, this would represent the minimum total access energy required to evaluate the linear operator. In other words, N m would reach its minimum value, and the ratio N op /N m would be maximized. While a particular processor might only be able to implement a certain N op /N m ratio, this ratio is also limited by the algorithm being performed, and is commonly referred to as the arithmetic intensity of the algorithm: $$a \equiv N _ { o p } / N _ { m } . \quad ( 4 )$$ An in-memory compute device [5] as illustrated in fig. 1 can leverage the arithmetic intensity of an algorithm by reading a large set of both operator data and input vector data from memory at once and operating on all of the data together before writing the output back to memory. If the in-memory compute device is sufficiently large and complex, all of the necessary operations involving this data can be performed without any of the inputs being read a second time from memory in the future. Returning to eq. (1), we set a lower bound on the amount of memory access energy that must be expended for the von Neumann machine to evaluate the operator in terms of the arithmetic intensity. This in turn leads to a limit on the computational efficiency: $$\eta = \frac { 1 } { e _ { m } / a + e _ { o p } } \quad ( 5 )$$ FIG. 1: Illustration of a digital compute-in-memory processor. <details> <summary>Image 1 Details</summary> ![5d4a7f87](/v1/image/5d4a7f878bf99a82fa62755ed762a0b04982b409758b203d95aaa3fec04010ae) ### Visual Description ## Diagram: Compute-In-Memory Processor Architecture ### Overview The image presents a block diagram illustrating the architecture of a Compute-In-Memory processor. It shows the flow of data between Weight Memory, the Compute-In-Memory processor, and Activation Memory. ### Components/Axes * **Weight Memory:** An orange rectangular block on the left, labeled "Weight Memory". * **Compute-In-Memory processor:** A blue rounded rectangular block in the center, labeled "Compute-In-Memory processor" and "(i.e. vector processor)". * **Activation Memory:** An orange rounded rectangular block at the bottom, labeled "Activation Memory". * **Arrows:** * A black line labeled "K" connects Weight Memory to the Compute-In-Memory processor. * A black line labeled "N" connects Activation Memory to the Compute-In-Memory processor. * A curved yellow arrow labeled "M" connects the Compute-In-Memory processor to Activation Memory. ### Detailed Analysis or ### Content Details The diagram illustrates the data flow in a Compute-In-Memory architecture. Weight data is fetched from the Weight Memory to the Compute-In-Memory processor (indicated by arrow K). Activation data is fetched from the Activation Memory to the Compute-In-Memory processor (indicated by arrow N). The results of the computation are then written back to the Activation Memory (indicated by arrow M). ### Key Observations * The diagram highlights the central role of the Compute-In-Memory processor in performing computations directly within the memory. * The use of separate Weight and Activation Memories suggests a separation of concerns for storing weights and intermediate activation values. * The feedback loop from the Compute-In-Memory processor to the Activation Memory indicates that the architecture supports iterative computations. ### Interpretation The diagram illustrates a typical architecture for Compute-In-Memory processing, where computations are performed directly within the memory units. This approach can potentially reduce data movement and improve energy efficiency compared to traditional processor architectures. The separation of weight and activation data, along with the feedback loop, suggests that this architecture is suitable for applications such as deep learning, where iterative computations and large amounts of data are common. The labels K, N, and M likely represent data transfer rates or sizes. </details> The contribution to computational efficiency from memory access energy can therefore be brought arbitrarily low when implementing an operator with arbitrarily high arithmetic intensity. The reduction in the contribution from memory access energy with increasing arithmetic intensity in eq. (5) is reflective of the energy savings in systolic arrays and TPUs [1, 2]. We note that the kind of analysis presented in eq. (5) is analogous to roofline models of processors [6]; however, the emphasis here is on energy consumption, while the latter is focused on identifying bottlenecks in processor speed. In order to sample what degree of advantage inmemory compute devices can bring, we examine a few examples of linear operators and present their arithmetic intensities. For a general matrix multiplication of a matrix of size L × N times a matrix of dimension N × M the total number of memory accesses is N m = LN + NM + LM , and the number of operations is N op = 2 NML , where additions and multiplications are treated as separate operations. The arithmetic intensity in this case is: $$a = \frac { 2 N M L } { L N + N M + L M } , \quad ( 6 ) \quad D e p _ { i }$$ which approaches ∞ as N,M,L →∞ collectively. For a convolution, the arithmetic intensity can similarly become arbitrarily large, since a convolution can be implemented as a matrix-matrix multiplication. This is typically done by rearranging the input data into a toeplitz matrix using what is known as an im2col() operation. The general algorithm of implementing convolution using matrix multiplication in a systolic array is shown in fig. 2, where n × n is the size of one input channel, C i is the number of input channels, k × k is the size of one of the kernel channels, and C i +1 is the number of output channels (and, consequently, also the number of individual 3-D kernels). The toeplitz formed by replicating and rearranging the activation data results in an ( n -k +1) 2 × k 2 C i matrix. A convolution is performed by multiplying this with a k 2 C i × C i +1 matrix containing the weights. Therefore, when implementing a convolution using matrix multiplication we generally have matrix dimensions, $$L = ( n - k + 1 ) ^ { 2 } \approx n ^ { 2 } \quad ( 7 a )$$ $$N = k ^ { 2 } C _ { i } \quad ( 7 b )$$ $$M = C _ { i + 1 } . \quad ( 7 c )$$ which results in an arithmetic intensity, $$a = \frac { 2 n ^ { 2 } k ^ { 2 } C _ { i } C _ { i + 1 } } { n ^ { 2 } k ^ { 2 } C _ { i } + k ^ { 2 } C _ { i } C _ { i + 1 } + n ^ { 2 } C _ { i + 1 } } . \quad ( 8 )$$ However, since the activation data was replicated approximately k 2 times in order to form the input matrix, the arithmetic intensity is significantly reduced relative to a processor that natively implements convolution instead of general matrix multiplication. To see this, consider again the convolutional layer of an n × n input image with C i input channels, C i +1 output channels, and a k × k kernel. The input vector size is N i = n 2 C i , and the number of kernel weights is K = k 2 C i C i +1 . If only the necessary weight and activation data were required to be read, the arithmetic intensity of the i th layer would become $$a \approx { \frac { 2 n ^ { 2 } k ^ { 2 } C _ { i } C _ { i + 1 } } { n ^ { 2 } ( C _ { i } + C _ { i + 1 } ) + k ^ { 2 } C _ { i } C _ { i + 1 } } } . \quad ( 9 )$$ In the limit where n 2 >> k 2 C i , this is roughly k 2 higher arithmetic intensity than when convolution is implemented using matrix multiplication. Whether convolution is implemented natively or using matrix-matrix multiplication, eq. (9) shows that, as n, k, C i → ∞ , arithmetic intensity becomes arbitrarily large, making the contribution from memory access energy in eq. (5) arbitrarily small. Indeed, in most modern convolutional neural networks, these parameters are large and yield high arithmetic intensity, as shown in table I. Depending on the size of the memory banks (which determine memory access energy), and based on the reference numbers given in table IV for SRAM access energy and digital MAC operation, an in-memory compute processor implementing an algorithm with high arithmetic intensity can be made to expend negligible memory access energy relative to the computational energy. ## IV. REDUCING COMPUTATIONAL ENERGY WITH ANALOG COMPUTING Unfortunately, by Ahmdal's law, even if the memory access energy is made arbitrarily small, computational energy consumed by the logical units will limit the overall performance gains to be made. In order to improve the overall efficiency by orders of magnitude, both contributions need to be addressed. FIG. 2: Algorithmic implementation of a convolution using matrix multiplication in a weight-stationary systolic array. The input data is converted into a toepliz matrix and fed into the systolic array, with each row delayed one time step behind the one above it. <details> <summary>Image 2 Details</summary> ![09fa27bc](/v1/image/09fa27bcde1b20925bdf5dc26adfed67237d24ee159926b9869f90224cafa096) ### Visual Description ## Convolutional Neural Network Layer Diagram ### Overview The image is a diagram illustrating the process of convolution in a convolutional neural network (CNN) layer. It shows how input data, kernels, and weights interact to produce output channels. The diagram visualizes the transformation of input features through convolution operations, highlighting the spatial relationships and data flow within the layer. ### Components/Axes * **Kernels (Ci+1 kernels of size k x k x Ci):** Four 3D cubes, each representing a kernel. Each cube is divided into 8 smaller cubes, with labels "A", "B", "C", and "D" on the faces. The dimensions are labeled as 'k', 'k', and 'Ci'. * **Kernel matrix (Ci+1 x k2Ci):** A matrix formed by arranging the kernel elements. It shows the repeating sequence of "ABCD" in each row. * **Inputs (n x n x Ci):** A 3D cube representing the input data. The dimensions are labeled as 'n', 'n', and 'Ci'. The cube's faces are labeled with letters 'a' through 'p'. * **Outputs (m x m x Ci+1):** A 3D cube representing the output data. The dimensions are labeled as 'm', 'm', and 'Ci+1'. The cube's faces are labeled with letters 'a' through 'i'. * **Preload weights:** A column of blocks labeled "Inputs", each containing four sub-blocks labeled "A", "B", "C", and "D". These blocks are connected to the "Kernel matrix" via dashed lines. * **Output channels:** A column of blocks representing the output channels, each containing four sub-blocks. * **m2:** Indicates the number of rows in the intermediate matrices. * **k2Ci:** Indicates the transformation from the input cube to the intermediate matrix. * **Time:** An arrow indicating the direction of processing. ### Detailed Analysis 1. **Kernels:** * Four kernel cubes are shown, each with dimensions k x k x Ci. * Each kernel cube has faces labeled with "A", "B", "C", and "D". * The colors of the cubes are blue, purple, orange, and green. 2. **Kernel Matrix:** * The kernel matrix is formed by arranging the elements of the kernels. * Each row of the matrix contains a repeating sequence of "ABCD". * The matrix has dimensions Ci+1 x k2Ci. 3. **Inputs:** * The input data is represented as a 3D cube with dimensions n x n x Ci. * The faces of the cube are labeled with letters 'a' through 'p'. 4. **Outputs:** * The output data is represented as a 3D cube with dimensions m x m x Ci+1. * The faces of the cube are labeled with letters 'a' through 'i'. 5. **Intermediate Matrices:** * Several matrices are shown between the "Inputs" and "Outputs". * Each row of these matrices contains a sequence of letters, such as "kjigfecba", "lkhgfdcb", "onmkjiigfe", and "ponlkjhgf". * The number of rows in these matrices is indicated as m2. 6. **Preload Weights:** * The "Preload weights" column shows blocks labeled "A", "B", "C", and "D". * These blocks are connected to the "Kernel matrix" via dashed lines, indicating the application of weights to the kernel elements. 7. **Output Channels:** * The "Output channels" column shows blocks representing the final output. * Each block contains four sub-blocks, similar to the "Preload weights". ### Key Observations * The diagram illustrates the flow of data from the input cube through the convolution operation to the output cube. * The kernel matrix represents the weights applied during the convolution. * The intermediate matrices show the result of applying the kernels to different regions of the input data. * The "Preload weights" and "Output channels" columns highlight the role of weights in the convolution process. ### Interpretation The diagram provides a visual representation of the convolution operation in a CNN layer. It demonstrates how the input data is transformed by applying kernels and weights to produce output features. The diagram highlights the spatial relationships between the input data, kernels, and output features. The use of intermediate matrices helps to visualize the intermediate steps in the convolution process. The diagram suggests that the convolution operation involves sliding the kernels across the input data, applying weights to the kernel elements, and summing the results to produce the output features. The diagram also shows how multiple kernels can be used to produce multiple output channels. </details> TABLE I: Summary of convolutional layer parameters of various well-known neural networks considering a 1-Mpixel (per channel) input image. | Network | # of layers | median n | median C i | max N | avg. k | total K | median C i +1 | median a | |-------------------|---------------|------------|--------------|---------|----------|-----------|-----------------|------------| | DenseNet201 | 200 | 62 | 128 | 1.6e+07 | 2 | 1.8e+07 | 128 | 292 | | GoogLeNet | 59 | 61 | 480 | 3.9e+06 | 2.1 | 6.1e+06 | 128 | 200 | | InceptionResNetV2 | 244 | 60 | 320 | 8e+06 | 1.9 | 8e+07 | 192 | 291 | | InceptionV3 | 94 | 60 | 192 | 8e+06 | 2.4 | 3.7e+07 | 192 | 295 | | ResNet152 | 155 | 63 | 256 | 1.6e+07 | 1.7 | 5.8e+07 | 256 | 390 | | VGG16 | 13 | 249 | 256 | 6.4e+07 | 3 | 1.5e+07 | 256 | 2262 | | VGG19 | 16 | 186 | 256 | 6.4e+07 | 3 | 2e+07 | 384 | 2527 | | YOLOv3 | 75 | 62 | 256 | 3.2e+07 | 2 | 6.2e+07 | 256 | 504 | Recently, various types of analog computing, from electrical to optical, have been proposed as techniques to reduce computational energy consumption. Electronic analog computing typically centers around crossbar arrays of resistive memory (or ReRAM) [7-9]. Optical analog processors are commonly based on silicon photonics [10-13]. Optical 4F systems have been explored since the 1980s as a higher dimensional form of compute [14, 15], and simple scattering off of optical surfaces is also being explored [16-18]. The argument for analog computing is fundamentally a scaling one: analog computing has particular advan- tages when applied to large, linear operators with low bit precision [19]. To see this, consider a general analog processor (shown in fig. 3(a)) that takes N numbers of B -bit precision input data, produces M numbers of B -bit precision output data, and is configured by K weights with B -bit precision which represent the matrix. The analog processor is first configured by converting the K weights using digital-to-analog converters (DACs) and applying these values to the modulators in the analog processor. Then the N inputs are read from memory, and DACs are used to apply N analog inputs to the processor. By the physics of the processor, this naturally results in M FIG. 3: (a) System-level view analog, in-memory compute processors. The analog device is configured using DACs to either hold activations or weights, while the other is provided as input. (b) Detailed view of a ReRAM crossbar analog electronic in-memory compute processor. Each transistor is connected to a reconfigurable resistor, the conductance of which determines the effective weight of each element in the matrix. (c) Detailed view of a silicon photonic in-memory compute processor. Each transistor is connected to an electro-optic element that changes the scattering parameters through each intersection. <details> <summary>Image 3 Details</summary> ![5aed89d1](/v1/image/5aed89d1bad7a1b92fe5575b836efe2cda7d6ab602ce04eb6ea8a65607ea2066) ### Visual Description ## Circuit Diagrams: Analog Device Architectures ### Overview The image presents three circuit diagrams (a, b, c) illustrating different architectures for analog devices, likely used in machine learning or signal processing applications. Diagram (a) shows a high-level block diagram, while (b) and (c) depict more detailed circuit implementations. ### Components/Axes **Diagram (a): High-Level Block Diagram** * **Blocks:** * Activation Memory (Blue rectangle, top-left) * Passive, Analog Device (Gray rectangle, center) * Weight Memory (Blue rectangle, top) * DAC 1, DAC 2, ..., DAC K (Orange rectangles, left of Passive Analog Device) * DAC 1, DAC 2, ..., DAC M (Gray rectangles, above Passive Analog Device) * ADC 1, ADC 2, ..., ADC M (Blue rectangles, below Passive Analog Device) * **Labels:** * N (next to DAC K) * M (next to DAC M and ADC M) * **Connections:** * Activation Memory connected to DACs (DAC 1 to DAC K) * DACs (DAC 1 to DAC M) connected to Weight Memory * DACs (DAC 1 to DAC K) connected to Passive, Analog Device * Passive, Analog Device connected to ADCs (ADC 1 to ADC M) * ADCs (ADC 1 to ADC M) connected back to Activation Memory **Diagram (b): Detailed Circuit Implementation (First Architecture)** * **Regions:** * DAC array (Red rectangle, top-left) * Labels: γdac, εdac,1, γdac, εdac,2 * ADC array (Blue rectangle, bottom) * Labels: γadc, εadc * Transistor Array (Gray rectangle, center) * **Labels:** * N inputs (left) * M programming lines (top) * M outputs (bottom) * K = MN weights (right) **Diagram (c): Detailed Circuit Implementation (Second Architecture)** * **Regions:** * DAC array (Red rectangle, top-left) * Labels: γdac, εdac,1, γdac, εdac,2 * ADC array (Blue rectangle, bottom) * Labels: γadc, εadc * Bitline selector (Gray rectangle, center) * **Labels:** * N inputs (left) * M programming lines (top) * M outputs (bottom) * K = MN weights (right) * Source (Gray circle, left) * VOA (Variable Optical Attenuator) * PD (Photodiode) * **Connections:** * Input "source" splits into N paths, connected to the bitline selector. * The bitline selector has connections to the ADC array. ### Detailed Analysis or ### Content Details **Diagram (a):** * The diagram shows a system with an Activation Memory providing inputs to a Passive, Analog Device. * DACs (Digital-to-Analog Converters) are used to convert digital signals from the Activation Memory and Weight Memory into analog signals suitable for the Passive, Analog Device. * ADCs (Analog-to-Digital Converters) convert the analog output of the Passive, Analog Device back into digital signals. * The number of DACs connected to the Activation Memory is K, while the number of DACs/ADCs connected to the Passive, Analog Device is M. **Diagram (b):** * This diagram shows a circuit implementation using a DAC array, a transistor array, and an ADC array. * The DAC array receives N inputs and generates analog signals that control the transistors in the transistor array. * The transistor array performs a weighted sum of the inputs, with the weights determined by the programming lines. * The ADC array converts the analog output of the transistor array into M digital outputs. * The total number of weights is K = MN. **Diagram (c):** * This diagram shows an alternative circuit implementation using a DAC array, a bitline selector, and an ADC array. * The DAC array receives N inputs from a "source" and generates analog signals that control the bitline selector. * The bitline selector routes the analog signals to the ADC array based on the programming lines. * The ADC array converts the analog signals into M digital outputs. * The circuit includes Variable Optical Attenuators (VOA) and Photodiodes (PD). * The total number of weights is K = MN. ### Key Observations * Diagram (a) provides a high-level overview of the system architecture. * Diagrams (b) and (c) present two different circuit implementations for the Passive, Analog Device. * Both implementations use DACs and ADCs to interface between digital and analog domains. * Diagram (c) utilizes optical components (VOA and PD) and a bitline selector, suggesting a different approach to signal processing compared to diagram (b). ### Interpretation The diagrams illustrate different approaches to implementing analog devices for machine learning or signal processing. Diagram (a) sets the stage by showing the overall system architecture, highlighting the interaction between digital memory and an analog processing unit. Diagrams (b) and (c) then delve into specific circuit implementations. Diagram (b) uses a transistor array to perform weighted sums, while diagram (c) employs a bitline selector and optical components. The choice of implementation depends on factors such as performance requirements, power consumption, and area constraints. The use of optical components in diagram (c) suggests a potential for higher bandwidth and lower power consumption compared to the purely electronic implementation in diagram (b). The common goal is to efficiently perform analog computations, which can be advantageous for certain machine learning tasks. </details> analog outputs, which are converted back to the digital domain using analog-to-digital (DAC) converters. If the analog processor is somehow already configured, or never needs to be reconfigured, then the total energy consumed will be only that of the DACs for the inputs and ADCs for the outputs: $$E _ { o p } \equiv N _ { o p } e _ { o p } = N ( e _ { d a c , 1 } + e _ { a d c } ) , \quad ( 1 0 )$$ where we have assumed N = M for simplicity. While the right-hand-side of eq. (10) represents the computational energy consumed by the analog processor, the lefthand-side represents the equivalent number of digital operations performed ( N op ) times the energy that each of those operations would have to take ( e op ) in order for a digital computer to achieve the same efficiency as the analog computer. Since N op = 2 N 2 for matrix multiplication, if this operation were performed digitally, the expended computational energy would be proportional to the number of operations: E op = 2 e op N 2 . The conclusion is that analog computing reduces matrix multiplication from O ( N 2 ) in energy to O ( N ) in energy . This furthermore implies that the effective energy per operation of analog computing scales inversely to the size of the problem, i.e. $$e _ { o p } \, \infty \, 1 / N . \quad ( 1 1 ) \quad I n$$ We note that in practice the scaling N is defined either by the size of the processor or the size of the problem, whichever is smaller. ## A. Vector-Matrix Multiplication For most problems involving neural networks, the analog processors that can be created are not large enough to store the entire neural network. In this case, the reconfiguring of the weights in the analog processor itself can destroy the O ( N ) scaling advantage. To see this, consider the multiplication of a vector of length N with a matrix of dimensions N × M . In this case, we have, $$N _ { o p } e _ { o p } = 2 N e _ { d a c , 1 } + 2 M N e _ { d a c , 2 } + 2 M e _ { a d c } . \quad ( 1 2 )$$ Wehave also separated the DAC energies e dac, 1 and e dac, 2 since different physical mechanisms and loads are sometimes used to configure an analog computer versus feed it with analog inputs. Here, e dac, 1 is used to represent the energy required per input, while e dac, 2 is used to represent the energy required per reconfiguration. Typically, in analog computing technologies, the analog in-memory compute device can only store either positive definite numbers (like in the example of memristors) or fully complex numbers (like in the case of coupled Mach-Zender interferometers). If only positive numbers can be created, then the entire calculation must be done twice and the difference of the results taken in order to take into account both positive and negative matrix values. On the other hand, when complex values are allowed like in the case of silicon photonic MZI's, there are two voltages (and hence two DAC operations) required to configure each coupled MZI modulator. Additionally, for coherent optical measurements, an interference technique must be used to recover the positive and negative field components from the photodetectors, which can only measure the norm square of the field. Hence, regardless of the analog compute scheme, each term in eq. (12) must practically be multiplied by a factor of two in order to handle both positive and negative values. Applying eq. (12) to vector-matrix multiplication, we obtain: $$e _ { o p } = e _ { d a c , 1 } / M + e _ { d a c , 2 } + e _ { a d c } / N , \quad ( 1 3 )$$ in which case the middle term is proportional neither to 1 /N nor 1 /M . ## B. Matrix-Matrix Multiplication The aforementioned situation is relieved in the case of matrix-matrix multiplication. In this case the configuration of the analog computer itself is reused for every row of the input matrix, restoring the energy cost per operation to be inversely proportional to the problem scaling. In the case of an L × N matrix times an N × M matrix, we have $$e _ { o p } = e _ { d a c , 1 } / M + e _ { d a c , 2 } / L + e _ { a d c } / N \quad ( 1 4 )$$ since N op = 2 NML in this case. Since each of the three separate contributions to the energy consumption is decreased by a factor proportional to the three different dimensions associated with the matrices being multiplied, the effective energy per operation decreases as the problem scale increases. In the case of a finite-sized analog processor, the last two contributions will ultimately be limited by the two dimensions (number of inputs and outputs) of the analog processor itself. At this point, a distinction needs to be made between the size of the matrices involved in the neural net architecture and the physical dimensions of the analog processor. We label the matrix dimensions with primes, i.e. M ′ , N ′ , and L ′ , and label the physical dimensions of the processor with hats: ˆ M , ˆ N . The actual factors by which energy is saved (i.e. M and N in eq. (14)) are given by the smaller of these two numbers: $$M = \min \{ \hat { M } , M ^ { \prime } \} \quad ( 1 5 a )$$ $$N = \min \{ \hat { N } , N ^ { \prime } \} . \quad ( 1 5 b )$$ ## C. Convolution As in the case of digital processors, analog processors can also implement convolution using matrix-matrix multiplication. The mapping of the kernel and activation data to to matrix dimensions remains the same, i.e. $$L ^ { \prime } = ( n - k + 1 ) ^ { 2 } \approx n ^ { 2 } \quad \ \ ( 1 6 a )$$ $$N ^ { \prime } = k ^ { 2 } C _ { i } \quad ( 1 6 b )$$ $$M ^ { \prime } = C _ { i + 1 } \quad ( 1 6 c )$$ when weight-stationary scheme is implemented. These numbers are permuted for activation-stationary. As with digital processors, one of the unfortunate aspects of representing convolution as pure matrix multiplication is TABLE II: Median values of L ′ , N ′ , and M ′ as per eq. (16) for the convolutional layers of various well-known neural networks. The values were obtained considering a 1-Mpixel (per channel) input image. | Network | # of layers | L ′ | N ′ | M ′ | |-------------------|---------------|-------|-------|-------| | DenseNet201 | 200 | 3844 | 1152 | 128 | | GoogLeNet | 59 | 3721 | 528 | 128 | | InceptionResNetV2 | 244 | 3600 | 432 | 192 | | InceptionV3 | 94 | 3600 | 768 | 192 | | ResNet152 | 155 | 3969 | 1024 | 256 | | VGG16 | 13 | 62001 | 2304 | 256 | | VGG19 | 16 | 38688 | 2304 | 384 | | YOLOv3 | 75 | 3844 | 1024 | 256 | that the input activations get duplicated k 2 times, which means k 2 more DAC operations (and possibly memory accesses as well) than in a processor that natively implements convolution rather than general matrix multiplication. The consequence of this is that M is by far the smallest of the numbers in eq. (16c), and therefore analog processors that implement convolution as matrix multiplication get the least amortization over their input DACs in eq. (14). The median values of L ′ , N ′ , and M ′ for various neural networks is presented in table II. ## V. OPERATOR-SPECIALIZED ANALOG PROCESSORS Thus far, we have seen that 1) the contribution of memory access energy to compute efficiency can be brought arbitrarily low by implementing networks with large arithmetic intensity on specialized processors, and 2) analog processors can further reduce computational energy consumption when performing matrix multiplication. The reduction in computational energy is proportional to the size of the matrix the analog processor can handle. One of the inherent disadvantages of planar, matrix multiplication based processors in performing convolutions is that the matrix that is formed for the input is of dimensions ( n -k + 1) 2 × k 2 C i , which is a factor of k 2 larger than the actual activation data. When the convolution is performed digitally this is of little consequence because the number of MACs required is the same for this matrix multiplication as it is for convolution: ( n -k +1) 2 k 2 C i . However, when the matrix multiplication is performed with an analog processor using a matrix with k 2 more rows than necessary means that it requires k 2 more DAC operations than should be theoretically necessary. Even worse than this, unless some additional logic is used to set up the matrix between the SRAM and processor (which also consumes energy), it will require k 2 additional memory reads than is in principle necessary, thus significantly increasing the memory access energy. Furthermore, since the number of channels in each layer are often correlated (the output channels of one layer become the input channels of the next), the weight data loaded into the analog processor which has dimensions of k 2 C i × C i +1 is highly rectangular, which will increase M relative to N , which in turn increases the contribution to the energy consumption per operation to the input data DACs. In contrast to analog processors designed for general matrix multiplication, there are classes of analog processors which are specialized to implementing convolutions. One technique to implementing an analog processor is by restricting it to only operate in one particular eigenspace of operators. While any linear operator may be expressed as a matrix, the matrix A may be factored into the product of three matrices using eigen-decomposition: $$X = U \Lambda U ^ { T } , \quad ( 1 7 )$$ where U is a unitary (i.e. lossless) matrix of the eigenvectors of X , and Λ is a purely diagonal matrix of the eigenvalues of X . The eigenvectors of a convolution are waves, and so when X is a matrix representing a convolution, the eigenvector matrix U represents a Fourier transform, while U T represents and inverse Fourier transform. One technique of creating an operator-specialized processor is to statically implement the matrices U and U T , and only dynamically reconfigure the eigenvalues Λ. In this case, in order to change linear operators from one to another only the diagonal entries of Λ need to be changed. In other words, if the matrix X is of size m × m , changing the matrix to another convolution matrix only requires the modulation of m weights in the analog processor instead of m 2 weights. In the particular case where X represents a convolution, these eigenvalues are the Fourier transform of the kernel data. By tuning this set of m elements, the matrix X that is implemented by the analog processor can span the range of linear operators with the eigenvectors given by U . Eigen-decomposition is possible for planar analog processors, and has in fact been demonstrated in silicon photonic processors [11, 13]. However, there is an alternate approach to silicon photonics to implementing a convolution-specialized processor called an optical 4F system , which has a particular set of advantages relative to planar convolution processors. In planar analog processors, data is inserted into the processor in a one dimensional array, and the data is processed as it propagates along the second dimension. Unlike planar processors, an optical 4F system is a volumetric processor, so data is represented in a two dimensional array, while the computation happens as light propagates in the third dimension. While this does bring dramatically higher information density and computational density, the most significant difference is that it allows the processor to scale to numbers of inputs that are entirely impractical for planar processors. Since the efficiency of analog compute was shown in eq. (11) to scale proportionally to the dimensions of the analog processor (in the limit of infinite arithmetic intensity), optical 4F systems can in theory reach computational efficiencies orders of FIG. 4: Illustration of a transmission-mode optical 4F system performing convolutions with parallelized input channels. The input activation data can be tiled on the object plane, while the input filters can be tiled with appropriate padding before the Fourier transform is taken and the data is applied to the second SLM in the Fourier plane. In this arrangement one complete output channel is produced per measurement. <details> <summary>Image 4 Details</summary> ![f04edcbe](/v1/image/f04edcbe936fd434314007532fe7e6b1190e5e010f803f6471604bde6713e9ff) ### Visual Description ## Diagram: Optical Computation with Lenses ### Overview The image is a diagram illustrating an optical computation process using lenses. It shows how an input image is transformed through a series of optical elements, including lenses and intermediate representations, to produce an output image. The diagram highlights the dimensions and parameters involved in the computation. ### Components/Axes * **Axes:** * "computation axis" and "2D input plane" are indicated with arrows, defining the orientation of the diagram. * **Lenses:** Two lenses are depicted, labeled as "F{} lens". * **Input Image:** * A stack of three images, colored red, green, and blue, representing the input channels. * Labeled as "N = nx x ny Cin". * Dimensions are indicated as "nx", "ny", and "Cin". * **Intermediate Representations:** * A grid-like representation between the lenses. * Another grid-like representation after the second lens, labeled with "F{}". * Labeled as "K = k x k Cin". * Dimensions are indicated as "k" and "Cin". * **Output Image:** * A grayscale image on the right side. * Labeled as "M = nx x ny". * Dimensions are indicated as "nx", "ny", and "Cout". * **Parameters:** * "Cin" represents the number of input channels. * "Cout" represents the number of output channels, indicated by "1 ... Cout". * "nx" and "ny" represent the dimensions of the input and output images. * "k" represents the dimensions of the intermediate grid representations. ### Detailed Analysis * **Input Image (N):** The input image is represented as a stack of three color channels (red, green, blue). The dimensions are nx (width), ny (height), and Cin (number of input channels). The equation N = nx x ny Cin suggests that N represents the total number of input elements. * **First Lens (F{} lens):** The first lens transforms the input image into an intermediate representation. A red ray traces the path of light through the lens. * **Intermediate Representation (K):** This representation is a grid with dimensions k x k, and it also incorporates the number of input channels Cin. The equation K = k x k Cin suggests that K represents the total number of elements in this intermediate representation. * **Second Lens (F{} lens):** The second lens performs another transformation on the intermediate representation. * **Output Image (M):** The output image is a grayscale image with dimensions nx x ny. The equation M = nx x ny suggests that M represents the total number of output elements. * **Flow:** Arrows indicate the flow of information from the input image, through the lenses and intermediate representations, to the output image. ### Key Observations * The diagram illustrates an optical computation process where an input image is transformed into an output image using lenses and intermediate representations. * The dimensions and parameters of the input, intermediate, and output representations are clearly labeled. * The lenses are represented as performing some kind of transformation, denoted by "F{}". * The number of input and output channels are represented by Cin and Cout, respectively. ### Interpretation The diagram depicts a conceptual framework for optical computation. The lenses likely represent Fourier transform operations, as suggested by the "F{}" notation. The intermediate representations could be spatial frequency representations of the input image. The process involves transforming the input image into a different domain (e.g., frequency domain), performing some computation in that domain, and then transforming it back to the spatial domain to obtain the output image. The diagram highlights the key parameters and dimensions involved in this process, such as the number of input and output channels, and the dimensions of the intermediate representations. The use of optical elements allows for parallel processing and potentially faster computation compared to traditional digital methods. The transformation "F{}" is applied twice, suggesting a forward and inverse transform pair. </details> magnitude higher than planar processors. An example of an optical 4F system processor is shown in fig. 4. It is composed of two spatial light modulators (SLMs), which might be based on either liquid crystal cells or dynamic metasurfaces. These are placed before and after a lens, one focal length away from either side. Lenses naturally perform Fourier transforms between these two place, so that the light transmitted through the first SLM is Fourier-transformed upon passing through the lens. Therefore the first SLM provides the input data, and the first lens represents multiplication by the unitary Fourier matrix U . The second SLM is loaded with the Fourier transform of the kernel data and the transmitted light through it; therefore, the product of the Fourier transform of the input data with the Fourier transform of the kernel data. The second SLM therefore represents the multiplication by the diagonal eigenvalue matrix Λ. A second lens is then placed after the second SLM, one focal length away, which represents multiplication by the second eigenvalue matrix U . Finally, a detector is placed a second focal length from the second lens, and the light impinging on the detector is therefore the convolution of the input data with the kernel data. The detector itself is sensitive only to the intensity (i.e. the norm square) of the incident field. However, the complex value of the field can nonetheless be recovered using interferometric methods. Alternatively, as others have pointed out, the nonlinear measurement performed by the optical absorption of semiconductors can also be used naturally as the nonlinear activation of the neurons. As shown in fig. 4, more than one input channel can be processed in parallel if the kernel data is appropriately padded before the Fourier transform is taken and the data is applied to the second SLM. This allows greater SLM utilization when small kernels are being used. Unfortunately, from a compute systems perspective, traditional optical 4F systems have a fatal flaw: the output data from the convolution is measured four focal lengths away from the input data, which presumably FIG. 5: Illustration of a reflection-mode optical 4F system (folded into a 2F overall length) performing processing a full convolutional layer with all input and output channels in two phases: (a) phase one, where an optical Fourier transform of the input activation data is taken and loaded into the Fourier plane SLM, (b) phase two, where the input channels are tiled onto the object plane SLM and the convolution of all input channels are measured in parallel. The process is repeated for each output channel. <details> <summary>Image 5 Details</summary> ![4c94f1dc](/v1/image/4c94f1dc20b41655dece5b785f04be4b6832699ba06969313280578bce17268b) ### Visual Description ## Diagram: Optical Computing Architectures ### Overview The image presents two diagrams, labeled (a) and (b), illustrating optical computing architectures using metasurfaces and lenses. Diagram (a) depicts the input stage, where multiple input images are combined and transformed by a lens and metasurface. Diagram (b) shows the output stage, where the transformed data is processed to generate output images. ### Components/Axes **General Components:** * **Computation Axis & 2D Input Plane:** An axis system indicating the direction of computation and the plane of input. * **Inputs xi:** A stack of input images, indexed by *i* from 1 to Cin, with dimensions nx and ny. * **Metasurface 1:** A surface with dimensions N = nx * ny * Cin. * **CIS 1:** Computational Imaging System 1, with dimensions M = nx * ny. * **Lens:** A lens element, denoted as F{}. * **Metasurface 2:** A surface with dimensions N = nx * ny * Cin. * **CIS 2:** Computational Imaging System 2. * **Outputs xj:** A stack of output images, indexed by *j* from 1 to Cout, with dimensions nx and ny. **Diagram (a) Specifics:** * **⊕ixi:** Represents the combination of input images. * **Copy:** Indicates that the Fourier transform of the combined input is copied to Metasurface 2. * **F{⊕ixi}:** Represents the Fourier transform of the combined input images. **Diagram (b) Specifics:** * **1 ... Cout:** Indicates the range of output channels. * **Cin:** Indicates the number of input channels. * **k:** Represents a kernel or filter. * **K = k x k Cin:** Represents the dimensions of Metasurface 1. * **Output xj:** Represents the output images. ### Detailed Analysis **Diagram (a): Input Stage** 1. **Inputs xi:** A series of input images, represented by stacked rectangles colored red, green, and blue, indicating multiple input channels (i = 1...Cin). The dimensions of each input image are nx and ny. 2. **⊕ixi:** The input images are combined (summed or concatenated) into a single representation. 3. **Metasurface 1:** The combined input is projected onto Metasurface 1, which has dimensions N = nx * ny * Cin. A portion of Metasurface 1 is labeled CIS 1, with dimensions M = nx * ny. 4. **Lens:** The combined input is passed through a lens, denoted as F{}. 5. **Metasurface 2:** The Fourier transform of the combined input, F{⊕ixi}, is copied to Metasurface 2, which has dimensions N = nx * ny * Cin. The left portion of Metasurface 2 is labeled CIS 2. **Diagram (b): Output Stage** 1. **Input from Lens:** The Fourier transformed data from the lens, F{⊕ixi}, is projected onto Metasurface 2. 2. **Kernel Operation:** A series of kernels (k) are applied to the data, indexed from 1 to Cout. 3. **Metasurface 1:** The result of the kernel operation is projected onto Metasurface 1, which has dimensions K = k x k Cin. A portion of Metasurface 1 is labeled CIS 1, with dimensions M = nx * ny. 4. **Output xj:** The processed data is transformed into a series of output images, indexed by *j* from 1 to Cout, with dimensions nx and ny. ### Key Observations * Both diagrams utilize a lens to perform a Fourier transform on the input data. * Metasurfaces are used to manipulate the data in both the input and output stages. * The dimensions of the metasurfaces are related to the dimensions of the input images and the number of input channels. * Diagram (a) focuses on the input and Fourier transform, while diagram (b) focuses on the kernel operation and output generation. ### Interpretation The diagrams illustrate two stages of an optical computing architecture. Diagram (a) shows how multiple input images are combined and transformed into the frequency domain using a lens and metasurface. Diagram (b) shows how the transformed data is processed using kernels and another metasurface to generate multiple output images. The architecture leverages the properties of light and optical elements to perform complex computations in a potentially more efficient manner than traditional electronic computing. The use of metasurfaces allows for precise control over the manipulation of light, enabling the implementation of various computational operations. The diagrams highlight the flow of data through the system and the key components involved in the optical computing process. </details> must be physically implemented in its own chip. Since this convolution operation only represents the connections between two layers of neurons, in order to implement a deep neural network with more than two layers of neurons the output data from the detector chip must be brought back somehow to the input spatial light modulator. Communicating this massive amount of data off-chip would entail massive energy costs, overcoming all advantages brought by the large-scale analog compute. However, an optical 4F system might be folded using reflection-mode SLMs as shown in fig. 5 in order to consolidate the first SLM and the CMOS image sensor sideby-side into a single chip, and using only a single lens. In this architecture all significant data transfer between the two chips happens optically instead of electronically. On either side of the lens on two chips, split into two halves are: an SLM (or metasurface) and a CMOS image sensor. Both chips are placed one focal length away from either side of the lens such that, whenever light passes between the two chips, a Fourier transform is taken by the lens. This system computes convolutions in two phases: a loading phase and a compute phase. The first, loading phase is shown in fig. 5(a), where the purpose is to take the Fourier transform of the activation data and load it into the second metasurface. A set of input filter maps are written to the input SLM in the first chip, which is illuminated. The Fourier transform of the reflected light is delivered to the CMOS image sensor (CIS) in the second chip, and this data is electronically transferred over to the second SLM within the same chip using DAC and ADC operations. As with the in-transmission unfolded 4F system in fig. 4, in-reflection 4F systems like the one in fig. 5 can be used to take the convolution of multiple input channels in parallel. The final result of this phase is therefore that the SLM in the second chip is configured with the Fourier transform of the activation data. In the second, compute phase, the input kernel weight data is applied to the first SLM. This is then illuminated at a slightly oblique angle so that the reflected light impinges upon the SLM in the second chip. When this light is reflected the lens takes another Fourier transform, and the light impinging on the CIS in the first chip is the convolution of the input filter map data with the kernel data. If the input data requires n 2 C i total pixels, loading the optical Fourier transform of the activation data will cost $$E _ { f f t } = n ^ { 2 } C _ { i } ( 2 e _ { a d c } + 4 e _ { d a c } ) \quad ( 1 8 )$$ energy. One DAC operation per pixel is required to write the input data to the first metasurface, while two ADC operations and two DAC operations are required in order to reconstruct the complex field data from the intensity data and then apply it to the second SLM. Since input channels can be performed in parallel and then looped over output channels, the second phase in- volves two times K = k 2 C i C i +1 DAC operations, and two n 2 C i +1 ADC operations in the CIS to recover the field. $$E _ { c o n v } = 2 k ^ { 2 } C _ { i } C _ { i + 1 } e _ { d a c } + 2 n ^ { 2 } C _ { i + 1 } e _ { a d c } \quad ( 1 9 )$$ Therefore the total energy associated with the analog compute of this layer is E fft + E conv , $$E _ { o p } = 2 n ^ { 2 } ( C _ { i } + C _ { i + 1 } ) e _ { a d c } + 2 C _ { i } ( 2 n ^ { 2 } + k ^ { 2 } C _ { i + 1 } ) e _ { d a c } . \, ( 2 0 )$$ The total number of operations performed is N op = 2 n 2 k 2 C i C i +1 . Therefore the efficiency of the approach is, $$\eta = \frac { 1 } { e _ { m } / a + e _ { a d c } / \left ( \frac { k ^ { 2 } C _ { i } C _ { i + 1 } } { C _ { i } + C _ { i + 1 } } \right ) + 2 e _ { d a c } / k ^ { 2 } C _ { i + 1 } + e _ { d a c } / n ^ { 2 } } .$$ in the limit that the metasurfaces are large enough to handle all of the activation or weight data. In order to take into account the finite size of the metasurfaces, which may not be large enough to fit all of the activation data from all channels at once, we first find the number of input channels that can practically be handled at once. For a metasurface of dimension n x × n y ≡ ˆ N , the number of input channels that can be included at once, C ′ , is, $$C ^ { \prime } = \lfloor \hat { N } / n ^ { 2 } \rfloor . \quad ( 2 2 )$$ Using this in place of the actual number of software defined input channels we can derive the factors by which energy is saved in the optical 4F system in the case that C ′ ≥ 1, $$L = n ^ { 2 } \quad & ( 2 3 a ) \quad \begin{array} { l } { g i c } \\ { i n } \end{array}$$ $$N = \frac { k ^ { 2 } C ^ { \prime } C _ { i + 1 } } { ( C ^ { \prime } + C _ { i + 1 } ) } ( 23 b ) \quad \ n o$$ $$M = k ^ { 2 } C _ { i + 1 } / 2 . \quad ( 2 3 c ) \quad t e c _ { 1 }$$ In terms of these parameters, the efficiency of the optical 4F system is given in the usual way, $$e _ { o p } = e _ { d a c } / M + e _ { d a c } / L + e _ { a d c } / N . \quad ( 2 4 ) \quad t h e r$$ For an optical 4F system, the median values of L , N , and M as per eq. (23) for various neural networks is presented in table III. ## VI. ANALYTIC RESULTS The formula given in eqs. (3), (5), (14) and (24) can be used to estimate the efficiency when evaluating a given CNN layer on any one of those four compute platforms. They depend on the energy values for memory access, DAC/ADC operations, and digital multiplication. Estimates for many of these quantities are given in table IV, TABLE III: Median values of L , N , and M for the convolutional layers of various well-known neural networks considering an optical 4F system as computational substrate. The values were obtained considering a 1-Mpixel (per channel) input image and an infinitely large metasurface (i.e. C ′ → inf). | Network | # of layers | L | N | M | |-------------------|---------------|-------|------|------| | DenseNet201 | 200 | 3844 | 272 | 136 | | GoogLeNet | 59 | 3721 | 128 | 64 | | InceptionResNetV2 | 244 | 3600 | 224 | 112 | | InceptionV3 | 94 | 3600 | 240 | 120 | | ResNet152 | 155 | 3969 | 1024 | 512 | | VGG16 | 13 | 62001 | 2304 | 1152 | | VGG19 | 16 | 38688 | 3456 | 1728 | | YOLOv3 | 75 | 3844 | 512 | 256 | TABLE IV: Energy per operation for various operations of digital and analog computers. These assume a technology node of 45nm and a voltage of 0.9V, and 8-bit values per operation. The example of memory access energy assumes a bank size of 96kB, since this is the bank sized used to construct the TPU SRAM bank. | e m (96kB SRAM)[3] | 4.3pJ | |----------------------------------------------|---------| | e mac [3] | 0.23pJ | | e adc [20] | 0.25pJ | | e dac [21] | 0.01pJ | | e opt [eq. (A8)] | 0.01pJ | | e load for 4 µm pitch, N = 256 [eq. (A6)] | 0.08pJ | | e load for 250 µm pitch, N = 40 [eq. (A6)] | 0.8pJ | | e load for 2.5 µm pitch, N = 2048 [eq. (A6)] | 0.04pJ | and formula for deriving the loads to estimate DAC energies for various analog compute platforms are also given in the appendix. Each of these values depend on the CMOS technology node, but scaling laws can be used to interpolate between technology nodes [22]. We compare the various compute platforms by considering a CNN layer with parameters given in table V, and the resulting efficiencies are plotted as a function of technology node in fig. 6. While all processors improve with technology node, there is roughly an order of magnitude difference between digital in-memory compute processors and silicon photonic processors, and yet another order of magnitude difference to be expected between silicon photonic pro- TABLE V: Convolution parameters used to estimate efficiencies of various processors in fig. 6. The arithmetic intensity follows from the other parameters by eq. (9). | Input Channels | C i | 128 | |----------------------|--------|-------| | Output Channels | C i +1 | 128 | | Filter size | k | 3 | | Input size | n | 512 | | Arithmetic intensity | a | 230 | FIG. 6: Efficiencies from analytic models of various compute architectures as a function of technology node. <details> <summary>Image 6 Details</summary> ![dfff6f48](/v1/image/dfff6f48d599bf9b4f8e904abbfe012fc3ed1ae4dcb4d21d063cfaa4337644c3) ### Visual Description ## Line Chart: Efficiency vs. Node Size ### Overview The image is a line chart comparing the efficiency (TOPS/W) of four different computing architectures (CPU, Digital in-memory, Silicon Photonic, and Optical 4F) across various node sizes (nm). The y-axis represents efficiency on a logarithmic scale, while the x-axis represents the node size in nanometers. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** * Label: "node [nm]" * Scale: Linear, with values at 180, 130, 90, 65, 45, 32, 20, 16, 14, 10, and 7. * **Y-axis:** * Label: "Efficiency [TOPS/W]" * Scale: Logarithmic, with values at 10-2, 100, and 102. * **Legend:** Located on the top-right of the chart. * CPU (Blue) * Digital in-memory (Orange) * Silicon Photonic (Green) * Optical 4F (Red) ### Detailed Analysis * **CPU (Blue):** The efficiency of the CPU architecture increases as the node size decreases. * At 180 nm, the efficiency is approximately 0.003 TOPS/W. * At 7 nm, the efficiency is approximately 0.03 TOPS/W. * **Digital in-memory (Orange):** The efficiency of the Digital in-memory architecture increases as the node size decreases. * At 180 nm, the efficiency is approximately 0.5 TOPS/W. * At 7 nm, the efficiency is approximately 10 TOPS/W. * **Silicon Photonic (Green):** The efficiency of the Silicon Photonic architecture increases slightly as the node size decreases, then plateaus. * At 180 nm, the efficiency is approximately 5 TOPS/W. * At 7 nm, the efficiency is approximately 50 TOPS/W. * **Optical 4F (Red):** The efficiency of the Optical 4F architecture increases as the node size decreases. * At 180 nm, the efficiency is approximately 50 TOPS/W. * At 7 nm, the efficiency is approximately 200 TOPS/W. ### Key Observations * Optical 4F consistently demonstrates the highest efficiency across all node sizes. * CPU consistently demonstrates the lowest efficiency across all node sizes. * The efficiency of all architectures generally increases as the node size decreases. * The rate of increase in efficiency varies among the different architectures. ### Interpretation The chart illustrates the relationship between node size and efficiency for different computing architectures. The data suggests that smaller node sizes generally lead to higher efficiency, which is consistent with the trend of miniaturization in the semiconductor industry. The Optical 4F architecture appears to be the most efficient, while the CPU architecture is the least efficient among those compared. The Silicon Photonic architecture shows a plateau in efficiency at smaller node sizes, suggesting a potential limitation in its scalability compared to Optical 4F. The Digital in-memory architecture shows a significant increase in efficiency as node size decreases, indicating its potential for future development. </details> FIG. 7: Contributions of energy consumption per operation for various processor types. DIM is digital in-memory, SP is silicon photonic, and O4F is optical 4F system architectures. The CNN layer parameters are in table V, and assumptions about architectural details are given in the text. The technology node is assumed to be 32nm for all processor types. <details> <summary>Image 7 Details</summary> ![ea76f275](/v1/image/ea76f27575f28fd5f5cf11285fd7bd8f9079a7bd8e48de8a5327ec788539a459) ### Visual Description ## Bar Chart: Energy per Operation by Processor Type ### Overview The image is a bar chart comparing the energy per operation (in picojoules) for memory and computation across four processor types: CPU, DIM, SP, and O4F. The y-axis uses a logarithmic scale. ### Components/Axes * **X-axis:** Processor Type (CPU, DIM, SP, O4F) * **Y-axis:** Energy per operation [pJ] (logarithmic scale) * Axis markers: 10^-3, 10^-2, 10^-1, 10^0, 10^1, 10^2 * **Legend:** Located in the top-right corner. * Blue: Memory * Orange: Computation ### Detailed Analysis * **CPU:** * Memory (Blue): Approximately 100 pJ * Computation (Orange): Approximately 0.2 pJ * **DIM:** * Memory (Blue): Approximately 0.006 pJ * Computation (Orange): Approximately 0.15 pJ * **SP:** * Memory (Blue): Approximately 0.006 pJ * Computation (Orange): Approximately 0.03 pJ * **O4F:** * Memory (Blue): Approximately 0.006 pJ * Computation (Orange): Approximately 0.004 pJ ### Key Observations * For the CPU, memory operations consume significantly more energy than computation. * For DIM, SP, and O4F, computation consumes more energy than memory. * CPU memory operations require substantially more energy than memory operations on the other processor types. * The energy consumption for memory operations is roughly the same for DIM, SP, and O4F. ### Interpretation The chart highlights the energy efficiency differences between processor types for memory and computation operations. The CPU exhibits a high energy cost for memory operations, suggesting it may be less efficient for memory-intensive tasks compared to DIM, SP, and O4F. The DIM, SP, and O4F processors show a lower energy footprint for memory but a higher relative cost for computation. This suggests they may be optimized for computational tasks while maintaining memory efficiency. The data suggests a trade-off between memory and computation energy efficiency across different processor architectures. </details> cessors and optical 4F systems. While this difference is clearly algorithm-dependent, the underlying hardware for analog compute systems must be large enough to be able to exploit the potential algorithmic advantages, which is what is enabled by moving from a two-dimensional silicon photonic processor to a fundamentally three-dimensional processor akin to an optical 4F system. The breakdown of improvements into memory and computational energy reductions are shown in fig. 7, which shows the contribution to the energy per operation from memory and computational elements separately for each processor type. Exploiting high arithmetic intensity with in-memory compute vastly improves energy consumption between CPUs and the other platforms by first reducing memory energy well below computational energy. The analog processors in turn have reduced computational energy, with less computational energy on a per-operation basis for analog processors with more inputs. It is worthwhile noting that the efficiencies reported in fig. 6 for the digital in-memory processor are significantly higher than the Google TPU, which reported 0.3-2 TOPS/W depending on the CNN architecture, for a chip manufactured at a 28-nm node. The in-memory compute digital processor modeled here has the same architectural parameters as the TPU: a 256 by 256 systolic array, and 24-MiB of SRAM divided into 256, 96-KB banks. Here we predict that number should be roughly 5 TOPS/W, which is a significantly higher efficiency than reported in the literature [1]. However, we note that this estimation simplifies the energy costs associated with the digital multiplication and storing and transporting data in and between each processing element in the systolic array. The silicon photonics processor modelled in figs. 6 and 7 assumed an array size of 40 by 40, which is typical for most processors reported in the literature [10-13], since the various modulator technologies typically require the array to have pitches in the 100-400 µ m range. The computational energy consumption is highly limited by the optical modulator technology, which currently stands at around 7 pJ/byte, as discussed in section A1. We assume in our model that this will be improved to 0.5 pJ over time, but even with this assumed advantage it is clear in fig. 6 that silicon photonics will have a difficult time maintaining an efficiency advantage over digital compute in memory technologies unless it is possible to scale up the processor sizes. We also assume a 24-MiB SRAM for the silicon photonics processor, divided into 40, 600-KB SRAM banks, following the TPU architecture. The optical 4F system is based on the architecture in fig. 5, with 4-Mpx SLMs and a 24-MiB SRAM divided into 2048, 12-KB banks, again following the TPU architecture. The SLM pitch for DAC loads involved in active matrix addressing of the SLMs was assumed to be 2.5 µ m, which results in a line capacitance of 0.9 fF and a load energy of 40 fJ as shown in table IV. The optical energy per pixel is based on 1550-nm light, and contributes 10 fJ/pixel per operation as shown in table V. The large array sizes enabled by realistic SLM dimensions are able to reduce computational energy consumption even below the memory consumption in fig. 7. ## VII. COMPUTATIONAL RESULTS Thus far, we have provided simple analytic formula that estimate the efficiency of various AI inference platforms on the basis of how they scale. These formula are approximations with several limitations, the biggest of which is that they don't take into account situations where the matrices involved are too large for either the capacity of the in-memory compute device or its inputs. In that circumstance the problem needs to be broken up into several smaller matrix multiplications. In order to get around this limitation we developed cycle-accurate models of a systolic array and of an in-reflection optical 4F system, and tested those models when evaluating various CNNs for a given input image size. The more accurate computational results are then compared with the analytic models from the previous sections. ## A. Systolic array efficiency estimation For analyzing the energy efficiency of a systolic array, we considered an architecture similar to that of the Google TPU [ref:TPU], with a weight-stationary systolic array of size 256 x 256 tiles. Each of the 256 ports of the array has access to an individual 96-KB SRAM block, totaling 24 MiB of buffer memory for storing activations (i.e. inputs/outputs of a convolutional layer). The weights are stored in DRAM and accessed based to the convolutional layer being executed. The activations and weights are 8-bit fixed point. In terms of energy costs, we used as reference the SRAM and MAC energy values for a 45-nm process at 0.9 V from [3]: SRAM read/write of 1.25 pJ/byte (8-KB memory) and 8-bit MAC operation of 0.23 pJ/byte. To align with the SRAM block size of 96 KB in the TPU, the 8-KB SRAM energy cost was scaled in size by a factor of √ 96K / 8K = 3 . 46 in accordance with eq. (A2), resulting in 4.33 pJ/byte. Associated to each MAC operation, we also included the energy costs of the load and of the memory read/write inside each array tile (to store/propagate the 8-bit input and 32-bit accumulation = 40 bits). A load energy cost of 2.82 fJ/bit was computed using eq. A6, where the distance between array tiles was approximated based on the 256 x 256 array area occupancy (24%) of the entire TPU chip (331 mm 2 ), resulting in a distance of 34.8 um between tiles. The internal array memory energy cost was obtained by scaling the 8-KB SRAM block to 40 bits, resulting in 1 . 25 pJ/byte × √ 5 / 8K = 31 . 25 fJ/byte. Lastly, using the techniques presented in [22], we scaled all the energy values (except for load, since it's not directly process-dependent) from 45-nm process to the appropriate technology nodes, ranging from 180 nm down to 7 nm. The results are presented in fig. 8. Both the analytic expression and the cycle-accurate model follow the same trend, with a slight divergence as the technology node is reduced. This can be accounted for the fact that e load does not depend on technology node, and its cost starts becoming a dominating factor in the overall energy cost since the other energy sources diminish as node size reduces. ## B. Optical computer efficiency estimation For the optical 4F system we considered 4-Mpixel SLMs, along with the same 24-MiB SRAM as in the systolic array analysis. With this, the SRAM is partioned into 2048 equal parts (one per metasurface row), resulting in a size-scaled SRAM read/write energy of 1.55 pJ/byte. The DAC, ADC and laser energies were obtained using the values in table IV considering a 2.5µ m pitch. A comparison between the analytic expression and a cycle-accurate model of the optical 4F system is presented in fig. 9. The figure provides an overall curve for the efficiency, with significant gain when constructing the FIG. 8: Efficiency comparison between a cycle-accurate model and the analytic expression given by eq. (8) and the values in table I. Both models are running YOLOv3 (1-Mpixel input image) using a 256 × 256 <details> <summary>Image 8 Details</summary> ![6ed4a8ec](/v1/image/6ed4a8ec23f3c522bc08664c4481e08fefcc83bd2cd2fd4924bdd583a2661be7) ### Visual Description ## Line Chart: Efficiency vs. Node Size ### Overview The image is a line chart comparing the efficiency (in TOPS/W) of two models, "Analytic expression" and "Cycle-accurate Model," across different node sizes (in nm). The chart shows how efficiency changes as node size decreases. ### Components/Axes * **X-axis:** Node [nm]. The scale ranges from 7 to 180 nm, with markers at 7, 10, 14, 16, 20, 32, 45, 65, 90, 130, and 180. * **Y-axis:** Efficiency [TOPS/W]. The scale ranges from 0 to 8, with markers at every integer value. * **Legend:** Located in the top-left corner. * Red line: "Analytic expression" * Blue line: "Cycle-accurate Model" ### Detailed Analysis * **Analytic expression (Red line):** * Trend: Generally increasing as node size decreases. * Data Points: * 180 nm: ~0.5 TOPS/W * 130 nm: ~0.7 TOPS/W * 90 nm: ~1.1 TOPS/W * 65 nm: ~1.6 TOPS/W * 45 nm: ~2.3 TOPS/W * 32 nm: ~3.4 TOPS/W * 20 nm: ~6.0 TOPS/W * 16 nm: ~6.5 TOPS/W * 14 nm: ~6.8 TOPS/W * 10 nm: ~7.1 TOPS/W * 7 nm: ~7.4 TOPS/W * **Cycle-accurate Model (Blue line):** * Trend: Generally increasing as node size decreases, but consistently lower than the "Analytic expression" line. * Data Points: * 180 nm: ~0.4 TOPS/W * 130 nm: ~0.6 TOPS/W * 90 nm: ~0.9 TOPS/W * 65 nm: ~1.3 TOPS/W * 45 nm: ~1.9 TOPS/W * 32 nm: ~2.9 TOPS/W * 20 nm: ~5.3 TOPS/W * 16 nm: ~6.0 TOPS/W * 14 nm: ~6.2 TOPS/W * 10 nm: ~6.6 TOPS/W * 7 nm: ~7.0 TOPS/W ### Key Observations * Both models show increasing efficiency as node size decreases. * The "Analytic expression" model consistently shows higher efficiency than the "Cycle-accurate Model." * The rate of increase in efficiency is more pronounced between 32 nm and 20 nm for both models. ### Interpretation The chart suggests that smaller node sizes generally lead to higher efficiency in terms of TOPS/W. The "Analytic expression" model provides an optimistic estimate of efficiency, while the "Cycle-accurate Model" provides a more conservative estimate. The significant jump in efficiency between 32 nm and 20 nm indicates a potential technological shift or optimization in that range. The consistent difference between the two models suggests that the analytic expression may not fully capture all the factors affecting efficiency in a real-world implementation. </details> weight-stationary systolic array and a 24-MiB SRAM (as in the Google TPUv1). FIG. 9: Comparison of eq. (24) with a cycle-accurate model of the optical 4F processor running YOLOv3 (1-Mpixel input image) using 4-Mpixel SLMs and a 24-MiB SRAM. <details> <summary>Image 9 Details</summary> ![c048c73e](/v1/image/c048c73e64c691d76bdac92eb841e1010be78f71f13a131a8c562b5a5f541f90) ### Visual Description ## Chart: Efficiency vs. Node Size ### Overview The image is a line chart comparing the efficiency (TOPS/W) of two models, "Analytic expression" and "Cycle-accurate model", across different node sizes (nm). The x-axis represents the node size in nanometers (nm), ranging from 180nm to 7nm. The y-axis represents efficiency in TOPS/W, ranging from 0 to 500. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** * Label: "Node [nm]" * Scale: 180, 130, 90, 65, 45, 32, 20, 16, 14, 10, 7 * **Y-axis:** * Label: "Efficiency [TOPS/W]" * Scale: 0, 100, 200, 300, 400, 500 * **Legend:** Located in the top-left corner. * Red line: "Analytic expression" * Blue line: "Cycle-accurate model" ### Detailed Analysis * **Analytic expression (Red line):** * Trend: Generally increasing as node size decreases. * Data points: * 180 nm: ~50 TOPS/W * 130 nm: ~60 TOPS/W * 90 nm: ~90 TOPS/W * 65 nm: ~130 TOPS/W * 45 nm: ~180 TOPS/W * 32 nm: ~250 TOPS/W * 20 nm: ~420 TOPS/W * 16 nm: ~450 TOPS/W * 14 nm: ~460 TOPS/W * 10 nm: ~480 TOPS/W * 7 nm: ~520 TOPS/W * **Cycle-accurate model (Blue line):** * Trend: Generally increasing as node size decreases, but at a slower rate than the "Analytic expression". * Data points: * 180 nm: ~30 TOPS/W * 130 nm: ~40 TOPS/W * 90 nm: ~60 TOPS/W * 65 nm: ~80 TOPS/W * 45 nm: ~100 TOPS/W * 32 nm: ~130 TOPS/W * 20 nm: ~230 TOPS/W * 16 nm: ~240 TOPS/W * 14 nm: ~245 TOPS/W * 10 nm: ~255 TOPS/W * 7 nm: ~270 TOPS/W ### Key Observations * The "Analytic expression" model consistently shows higher efficiency than the "Cycle-accurate model" across all node sizes. * The efficiency of both models increases as the node size decreases. * The "Analytic expression" model experiences a more significant increase in efficiency between 32nm and 20nm compared to the "Cycle-accurate model". ### Interpretation The chart demonstrates the relationship between node size and efficiency for two different models. The data suggests that smaller node sizes generally lead to higher efficiency in both models. The "Analytic expression" model appears to be more sensitive to changes in node size, exhibiting a steeper increase in efficiency as node size decreases compared to the "Cycle-accurate model". This could indicate that the "Analytic expression" model is better optimized for smaller node sizes or that it benefits more from the advancements in smaller node technologies. The significant jump in efficiency for the "Analytic expression" model between 32nm and 20nm could be attributed to a specific technological breakthrough or optimization that was more effectively utilized by this model. </details> device with smaller technology nodes. The main differences which explain the divergence between the analytic and cycle-accurate models include: - The cycle-accurate model considers the exact number of metasurface executions to account for output detector ADC read operations, output memory accesses, and total laser energy consumed. - Equation (23) considers that the dimensions of the output are the same as that of the input (i.e. m = n ), which naturally does not account for strides bigger than 1. - The value of e dac in eq. (24) is comprised of e dac, 1 + e load + e opt , resulting in an energy cost based on number of active pixels in the metasurface. However, the cycle-accurate model more precisely estimates the energy costs by separating the pixel-wise energy ( e dac, 1 + e load ) from the metasurface sizedependent laser energy ( e opt ). ## C. Optical computer energy cost distribution The cycle-accurate model for the optical 4F system can provide a detailed summary of the energy cost distribution based on four different system components: DAC, ADC, SRAM, and laser. These results for VGG19 and YOLOv3 across different technology nodes are presented in fig. 10, with the values specified in picojoules per MAC operation. Naturally, as the node size reduces, ADC and SRAM energy costs decrease. On the other hand, the DAC energy includes the dominating e load in its composition, and since the latter is technology node-independent, we see very little reduction in the overall DAC energy cost throughout the different nodes. Just as with e load , the laser energy e opt does not change with technology node and is, thus, constant. Comparing the energy cost distributions between VGG19 (left) and YOLOv3 (right), it is curious to note that a network with a much larger arithmetic intensity as in the case of VGG19 (refer to table I) presents a higher SRAM energy per MAC operation. This can be explained by the fact that the cycle-accurate model takes into account the sizes of the SLMs and the inputs, making the VGG19 network slightly less efficient in terms of placement of the input image pixels onto the metasurface due to it presenting (on average per layer) larger input images with more channels. This results in more metasurface executions - and, consequently, more output activation buffering (SRAM read/write) - to complete the convolutions in the network. If we consider an infinitely large metasurface, then this artifact naturally goes away and VGG19 becomes more efficient than YOLOv3 in terms of SRAM energy per MAC operation. ## VIII. CONCLUSIONS In-memory compute and analog compute techniques are both effective techniques to address different contributions to total processor energy cost. While in-memory compute is able to reduce memory access energy per operation in the context of a high arithmetic intensity algorithm, analog compute is able to reduce the computational energy itself in proportion to the scale of the analog processor. Convolutional neural networks are a perfect application for such analog, in-memory compute architectures since they have high arithmetic intensity, large linear operators, and typically require low bit precision for forward propagation. <details> <summary>Image 10 Details</summary> ![8b1e2592](/v1/image/8b1e2592a29169177db0979a68ba9056523d6bd82228d5d2f9aaa5f9a1120dd5) ### Visual Description ## Bar Charts: Energy per MAC vs. Node Size for VGG19 and YOLOv3 ### Overview The image presents two bar charts comparing the energy per MAC (multiply-accumulate operation) for different components (DAC, ADC, SRAM, Laser) across various node sizes (180nm to 7nm) for two neural network architectures: VGG19 (left) and YOLOv3 (right). The y-axis represents energy per MAC in picojoules (pJ) on a logarithmic scale, while the x-axis represents the node size in nanometers (nm). ### Components/Axes **Left Chart (VGG19):** * **Title:** VGG19 * **Y-axis:** Energy per MAC [pJ] (logarithmic scale) * Markers: 10-1, 10-2, 10-3, 10-4 * **X-axis:** Node [nm] (linear scale) * Markers: 180, 130, 90, 65, 45, 32, 20, 16, 14, 10, 7 * **Legend (top-right):** * Yellow: DAC (Digital-to-Analog Converter) * Blue: ADC (Analog-to-Digital Converter) * Green: SRAM (Static Random-Access Memory) * Red: Laser **Right Chart (YOLOv3):** * **Title:** YOLOv3 * **Y-axis:** Energy per MAC [pJ] (logarithmic scale) * Markers: 10-1, 10-2, 10-3, 10-4 * **X-axis:** Node [nm] (linear scale) * Markers: 180, 130, 90, 65, 45, 32, 20, 16, 14, 10, 7 * **Legend (top-right):** * Yellow: DAC (Digital-to-Analog Converter) * Blue: ADC (Analog-to-Digital Converter) * Green: SRAM (Static Random-Access Memory) * Red: Laser ### Detailed Analysis **VGG19 Chart:** * **SRAM (Green):** The energy consumption of SRAM decreases as the node size decreases. * 180nm: ~0.04 pJ * 7nm: ~0.008 pJ * **ADC (Blue):** The energy consumption of ADC decreases as the node size decreases. * 180nm: ~0.015 pJ * 7nm: ~0.002 pJ * **Laser (Red):** The energy consumption of Laser decreases as the node size decreases. * 180nm: ~0.005 pJ * 7nm: ~0.001 pJ * **DAC (Yellow):** The energy consumption of DAC decreases as the node size decreases. * 180nm: ~0.001 pJ * 7nm: ~0.0005 pJ **YOLOv3 Chart:** * **SRAM (Green):** The energy consumption of SRAM is relatively constant across different node sizes. * Ranges between ~0.004 pJ and ~0.008 pJ * **ADC (Blue):** The energy consumption of ADC is relatively constant across different node sizes. * Ranges between ~0.0005 pJ and ~0.002 pJ * **Laser (Red):** The energy consumption of Laser is relatively constant across different node sizes. * Ranges between ~0.004 pJ and ~0.008 pJ * **DAC (Yellow):** The energy consumption of DAC is relatively constant across different node sizes. * Ranges between ~0.001 pJ and ~0.002 pJ ### Key Observations * For VGG19, energy consumption for all components decreases as node size decreases. SRAM consistently has the highest energy consumption, followed by ADC, Laser, and DAC. * For YOLOv3, energy consumption for all components remains relatively constant across different node sizes. DAC consistently has the lowest energy consumption, while SRAM and Laser have the highest. * The energy consumption of SRAM is significantly higher than other components in VGG19, especially at larger node sizes. * The energy consumption of ADC is significantly lower than other components in YOLOv3, especially at larger node sizes. ### Interpretation The charts illustrate the impact of node size on the energy efficiency of different components within VGG19 and YOLOv3 neural network architectures. The VGG19 chart suggests that reducing node size can significantly improve the energy efficiency of all components, particularly SRAM. In contrast, the YOLOv3 chart indicates that energy consumption is less dependent on node size for this architecture. This could be due to differences in the architecture, memory access patterns, or other factors. The data suggests that optimizing memory components like SRAM is crucial for improving the overall energy efficiency of neural networks, especially for architectures like VGG19. The relatively constant energy consumption in YOLOv3 might indicate that other factors, such as data transfer or computational complexity, are more dominant in determining energy consumption for this architecture. </details> ⊗}̂˜(⋃{⌉⊎ ⊗}̂˜(⋃{⌉⊎ FIG. 10: Energy cost distribution for the cycle-accurate model of the 4F optical system running VGG19 (left) and YOLOv3 (right). To provide some intuition regarding how much energy efficiency can be improved using one or both of these techniques, we have provided simple analytic formula estimating the efficiency for a range of processor types, including digital in-memory processors like a systolic array, analog in-memory compute processors, and optical 4F systems which are a class of analog in-memory processor that are specialized to convolutions. These analytic formula, when applied to average neural network parameters provided in tables I to II show good agreement with a cycle accurate model of the TPU architecture in figs. 8 and 9. As shown in fig. 6 of these approaches perform orders of magnitude better than CPUs, at any modern technology node. The largest improvement is due to the reduction in memory access energy per operation for in-memory compute processors, as shown in fig. 7. However, this technique is so effective that computational energy becomes the dominant contribution, which is then improved by analog computing. Since analog computing's energy advantage is proportional to the scale of the analog processor, optical 4F systems have a particular advantage in that they can be scaled large enough to reduce computational energy per operation below the minimum memory energy required for an in-memory compute processor when evaluating a modern CNN algorithm. ## Appendix A: Processor Energy Model Parameters In this section, we provide derivations for the typical energies per operation associated with memory access, digital MAC, ADC, and DAC, which are all necessary in order to properly compare the various computing schemes discussed in this paper. The energy consumed by a digital MAC operation will scale as the number of gates involved with the logical unit: indeed the lower bound of a digital MAC is set by the Landauer limit, which is proportional to the number of gates. For a serial-parallel multiplier, the number of gates G is G = 6 B 2 , and for other multiplier implementations, the area or gate-count is still proportional to B 2 [23], where B is the number of bits of the operand. A full adder has and additional nine gates per bit, so we can write $$e _ { m a c } = \gamma _ { m a c } ( 6 B ^ { 2 } + 9 B ) k T \quad ( A 1 )$$ where k is Boltzmann's constant, T is temperature, and γ mac is a dimensionless constant. Landauer's limit specifies that the energy per MAC is bounded on the lower end by γ mac > ln(2).[24] Typically, γ mac ≈ 122 , 500 for a 45-nm process [3], so current digital multipliers have several orders of magnitude improvement that could in theory be achieved. All accelerators will need internal memory for both neural network parameters and intermediate variables (unless an analog processor is built with a large enough capacity to store the entire network which is currently impractical). Digital electronic SRAM banks have an energy per operation that scales as the length of the bit and word lines used to address and write data to the SRAM, since most of the power is consumed in charging and discharging the effective capacitors formed by these lines. In general then, the energy per memory access can be written as,[3] $$e _ { m } = e _ { m 0 } \sqrt { N _ { m } } \quad ( A 2 )$$ where N m is the size of the memory bank, and e m 0 is a constant with units of energy. The scaling presented here is not reflective of the lower bound of energy consumption according to the Landauer limit, since currently it's the charging and discharging of capacitive lines that drives the energy consumption of SRAM rather than switching gates, which is why it scales according to the root of the array size. In the limit of a single bit cell, one might compare e m 0 with the Landauer limit by setting e m 0 = γ m kT . The resulting γ m is many more orders of magnitude away from the Landauer limit than even digital MACs: γ m ≈ 3 × 10 6 for a 45-nm CMOS process, which corresponds to an e m 0 ≈ 5 fJ. It can be argued that both the sheer value of e m 0 compared to the Landauer limit and the fact that power consumption in the capacitance of the addressing lines in SRAM leads to the energy scaling proportional to the root of the array size are broadly the source of computing's most severe energy problem.[3] Fortunately, in the case of specialized processors implementing operations with high arithmetic intensity like convolutions, we are able to significantly mitigate this problem.[5] For analog computation, ADC energy depends exponentially on bit precision, most fundamentally because it requires sufficient signal to noise ratio to distinguish the levels. When these levels are defined in terms of linear voltage steps, the ADC energy per sample is[20, 25] $$e _ { a d c } = \gamma _ { a d c } k T 2 ^ { 2 B } , \quad ( A 3 ) \quad u l d e$$ where k is Boltzmann's constant, T is the temperature, and γ is a dimensionless constant. It has been argued[20] that γ adc is bounded on the lower end at γ adc > 3 by thermal noise, and presents an empirical survey showing that the state of the art value for on-chip ADCs is γ adc ≈ 1404 for a 65-nm process, which scales to about 927 at 45 nm. DACs scale in the same manner as ADCs: $$e _ { d a c } = \gamma _ { d a c } k T 2 ^ { 2 B } \quad ( A 4 ) \quad t u r$$ and for similar arguments. However, the value for stateof-the art on γ dac is γ dac ≈ 39.[21] However, the expression in eq. (A4) only takes into account the power burned in the DAC circuitry itself, and not the power consumed driving the analog processor load. For example, the load of the bitline associated with a ReRAM processor in fig. 3(b) will be very different from the load associated with the variable optical attenuator (VOA) in the optical analog processor in fig. 3(b). An optical processor will have an additional energy contribution from the optical laser power itself, which can be considered effectively part of the load energy. Therefore we can write, $$e _ { d a c , i } = \gamma _ { d a c } k T 2 ^ { 2 B } + e _ { l o a d , i } \quad ( A 5 ) \quad p r o t$$ for both e dac, 1 and e dac, 2 . In the following subsections these quantities are estimated for both analog, memristive processors and silicon photonic processors. However, we note that for physically large arrays the load can often be dominated by the capacitance of the row and column addressing lines. The formula for the energy dissipation due to the capacitance of the bitlines and wordlines is, $$e _ { l o a d , i } = ( 1 / 2 ) \mathcal { C } L V ^ { 2 } \quad \quad ( A 6 ) \quad \begin{array} { r l r } { { \mathrm d i n } } & { \mathrm d i n } \\ { \mathrm e _ { l o a d , i } = ( 1 / 2 ) \mathcal { C } L V ^ { 2 } \quad \mathrm d i n } \end{array}$$ where C is the capacitance per unit length of the line, and L is the line length. For reference, a typical CMOS copper trace has a capacitance of around 0.2 fF /µm [26], so for a process with 0.9 volts they typically consume 0.08 fJ /µm per operation. ## 1. Silicon Photonics Analog Processors For an optical computer, there is both an optical and an electrical component to the load for the driving of the inputs: $$e _ { l o a d , 1 } = e _ { e l e c } + e _ { o p t } . \quad ( A 7 )$$ The electrical component will involve driving some kind of electro-optic modulator, and the energy per operation will depend on the capacitance of that component in the usual way. In the context of silicon photonics, this might be a variable optical attenuator (VOA) on the data input, while a mach-zender interferometer, MEMS modulator, or phase-change modulator is often used to store the weight data in the array. Some of the lower energy approaches used for electro-optic modulators are plasmonic resonators. Out of these, the lowest recorded to date energy per modulation of plasmonic modulators is around e elec ≈ 9 pJ.[27, 28] These are comparable to electrooptic modulators made of doped silicon micro-ring resonators tuned via carrier plasma dispersion have been demonstrated with roughly 0.9pJ/bit, or 7pJ/B.[29] It may be possible to design optical modulators in the future with lower energy per sample than these figures.[30] The optical contribution to the load itself will depend exponentially on bit precision since the dominant source of optical noise is shot noise. Therefore for the optical component we can write, $$e _ { o p t } = \frac { \hbar { \omega } } { \eta _ { o p t } } 2 ^ { 2 B } \equiv \gamma _ { o p t } k T 2 ^ { 2 B } \quad ( A 8 )$$ where is Plank's constant in units of angular frequency, ω is the frequency of the light, and η opt is the efficiency of the optical system and photodetector. Conveniently, the optical power consumption also scales as 2 2 B since the dominant source of noise is typically shot noise. To provide numbers for context, for 1550-nm light and an optical efficiency of 80%, we have γ opt ≈ 39, which corresponds to e opt ≈ 10 fJ. In light of the energy per sample associated with current electro-optic technology, the optical contribution to the energy is negligible. The load associated with the reconfiguration of the weights, e dac, 2 , only has an electrical component, which will involve both the electro-optic modulators and the electrical bitlines used to address the array. Ultra-low energy electro-optic modulators typically also have small dimensions in order to minimize the capacitance, on the order of a few microns, which leads to an additional energy consumption of a few femtojoules per element in the length of the array. This is also negligible compared to the energy associated with the electro-optic modulator itself. Therefore, both e dac, 1 and e dac, 2 are dominated by the electro-optic modulator energy. ## 2. Memristive Analog Processors In the ReRAM processor, the load has two contributions: the capacitance associated with the conductive lines in the array, and the dissipation of charge in the memristors. The pitch of ReRAM arrays tends to be limited by the size of the transistor placed at each node, which means the array bitlines and wordlines are relatively short and have low capacitance. Nonetheless energy consumption in large arrays can still be dominated by the capacitance, which is given by eq. (A6). On the other hand, in a ReRAM array the energy per operation consumed by the memristors themselves can also be quite high since the energy is proportional both the size of the array, and proportional to their average conductance, and the conductance is limited by the quantum conductance of G 0 = 2 e 2 /h , where e is the charge of an electron and h is Plank's constant. The conductance of these elements is therefore limited in the range of G = G 0 to G = G 0 2 B for B -bit precision elements. Memristors are highly nonlinear elements, so the input data is usually supplied with pulse width modulation instead of changing the voltage. Therefore the energy consumed by the entire array can be written as a sum over all the memristors $$\langle E _ { R e R A M } \rangle = \delta t \sum _ { i = 1 } ^ { M } \sum _ { j = 1 } ^ { N } \langle G _ { i j } \rangle \langle V _ { j } ^ { 2 } \rangle . \quad ( A 9 ) \quad v a r$$ where δt is the samplint period. Using the nominal value of the conductances and voltages for each memristor, we can simplify the equation to $$\langle E _ { R e R A M } \rangle = \delta t M N \langle G \rangle V _ { r m s } ^ { 2 } . \quad ( A 1 0 )$$ In one action of the array the number of MAC operations is MN , so the average energy per operation consumed by the array is actually a constant and is not reduced by scaling up the array in the case of a ReRAM array: $$e _ { R e R A M } \equiv \frac { \langle E _ { R e R A M } \rangle } { M N } = \langle G \rangle V _ { r m s } ^ { 2 } \delta t \quad ( A 1 1 ) \quad \text {the}$$ As noted above, the conductance of memristors is only well behave above quantum conductance. Assuming a - [1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. , In-datacenter performance analysis of a tensor processing unit, in Proceedings of the 44th annual international symposium on computer architecture (2017) pp. 1-12. - [2] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, A systematic methodology for characterizing scalability of dnn accelerators us- uniform distribution, the average value will be half the dynamic range, and therefore 〈 G 〉 = 2 B -1 G 0 . The energy is proportional to the square of the voltage, so we assume this is limited to maintain B effective bits of precision relative to the Johnson-Nyquist thermal noise V noise limit in the memristors. For a clock period of dt , the thermal noise is, $$V _ { n o i s e } ^ { 2 } = \frac { 4 k T } { G _ { 0 } \delta t } . \quad ( A 1 2 )$$ since the maximal noise is given by the minimum conductance. Setting V 2 rms = (3 / 2)2 2 B V 2 noise as the minimal required voltage to maintain B bits of accuracy, the TABLE VI: Typical pitches for various analog compute modulation technologies. | Active ReRAM[8, 31] | 1-4 µm | |----------------------------------------------|----------| | Optical phase-change material [10] | 250 µm | | Optical Mach-Zender interferometer (MZI)[13] | 100 µm | TABLE VII: Values of dimensionless constants for various operations. These assume a technology node of 45nm and a voltage of 0.9V. Optical efficiency is assumed to be 50% for γ opt . | γ m | 3 × 10 6 | |-------|--------------| | γ mac | 1 . 2 × 10 5 | | γ adc | 583 | | γ dac | 39 | | γ opt | 105 | mimimum energy absorbed by the memristor array per operation is, $$e _ { R e R A M } = 3 k T 2 ^ { 3 B } , \quad ( A 1 3 )$$ While this is the ideal solution, in practice there is a minimum voltage that can be applied that is typically much higher than the thermal noise limit, and is on the order of V rms ≈ 70 mV. Using this estimate and a sampling period of δt = 1 ns, the energy per operation due to the memristors is e ReRAM ≈ 0 . 05 pJ, which is about five times lower than the energy per operation in commercial memristor arrays, but nonetheless places an upper bound on the efficiency at η ≈ 20 TOPS/W. ing scale-sim, in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (IEEE, 2020) pp. 58-68. - [3] M. Horowitz, 1.1 computing's energy problem (and what we can do about it), in 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC) (IEEE, 2014) pp. 10-14. - [4] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, Survey and benchmarking of machine learning accelerators, in 2019 IEEE high performance extreme computing conference (HPEC) (IEEE, 2019) pp. 1-9. - [5] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, Memory devices and applications for in-memory computing, Nature nanotechnology 15 , 529 (2020). - [6] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM 52 , 65 (2009). - [7] M. Demler, Mythic multiplies in a flash, Microprocesser Report (2018). - [8] Q. Xia and J. J. Yang, Memristive crossbar arrays for brain-inspired computing, Nature materials 18 , 309 (2019). - [9] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars, ACM SIGARCH Computer Architecture News 44 , 14 (2016). - [10] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, et al. , Parallel convolutional processing using an integrated photonic tensor core, Nature 589 , 52 (2021). - [11] J. R. Ong, C. C. Ooi, T. Y. Ang, S. T. Lim, and C. E. Png, Photonic convolutional neural networks using integrated diffractive optics, IEEE journal of selected topics in quantum electronics 26 , 1 (2020). - [12] M. Y.-S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi, and M. R. DeWeese, Design of optical neural networks with component imprecisions, Optics Express 27 , 14009 (2019). - [13] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. BaehrJones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, et al. , Deep learning with coherent nanophotonic circuits, Nature photonics 11 , 441 (2017). - [14] P. Ambs, Optical computing: A 60-year adventure., Advances in Optical Technologies (2010). - [15] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification, Scientific reports 8 , 1 (2018). - [16] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, All-optical machine learning using diffractive deep neural networks, Science 361 , 1004 (2018). - [17] C. Qian, X. Lin, X. Lin, J. Xu, Y. Sun, E. Li, B. Zhang, and H. Chen, Performing optical logic operations by a diffractive neural network, Light: Science & Applications 9 , 1 (2020). - [18] T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, and Q. Dai, Large-scale neuromorphic optoelectronic computing with a reconfigurable diffrac- tive processing unit, Nature Photonics 15 , 367 (2021). - [19] R. Hamerly, L. Bernstein, A. Sludds, M. Soljaˇ ci´ c, and D. Englund, Large-scale optical neural networks based on photoelectric multiplication, Physical Review X 9 , 021032 (2019). - [20] B. E. Jonsson, An empirical approach to finding energy efficient ADC architectures, in Proc. of 2011 IMEKO IWADC & IEEE ADC Forum (2011) pp. 1-6. - [21] P. Palmers and M. S. Steyaert, A 10-Bit 1.6-GS/s 27mW current-steering D/A converter with 550-MHz 54dB SFDR bandwidth in 130-nm CMOS, IEEE Transactions on Circuits and Systems I: Regular Papers 57 , 2870 (2010). - [22] A. Stillmaker and B. Baas, Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm, Integration 58 , 74 (2017). - [23] I. Koren, Computer Arithmetic Algorithms (PRENTICE HALL, 1993) Chap. 6. High Speed Multiplication. - [24] R. Landauer, Irreversibility and heat generation in the computing process, IBM journal of research and development 5 , 183 (1961). - [25] M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, Analysis of power consumption and linearity in capacitive digital-to-analog converters used in successive approximation ADCs, IEEE Transactions on Circuits and Systems I: Regular Papers 58 , 1736 (2011). - [26] N. H. Weste and D. Harris, CMOS VLSI design: a circuits and systems perspective (Pearson Education India, 2015). - [27] G. Dabos, D. V. Bellas, R. Stabile, M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, A. Totovic, E. Lidorikis, and N. Pleros, Neuromorphic photonic technologies and architectures: scaling opportunities and performance frontiers, Optical Materials Express 12 , 2343 (2022). - [28] C. Haffner, D. Chelladurai, Y. Fedoryshyn, A. Josten, B. Baeuerle, W. Heni, T. Watanabe, T. Cui, B. Cheng, S. Saha, et al. , Low-loss plasmon-assisted electro-optic modulator, Nature 556 , 483 (2018). - [29] C. Sun, M. T. Wade, Y. Lee, J. S. Orcutt, L. Alloatti, M. S. Georgas, A. S. Waterman, J. M. Shainline, R. R. Avizienis, S. Lin, et al. , Single-chip microprocessor that communicates directly using light, Nature 528 , 534 (2015). - [30] D. A. Miller, Attojoule optoelectronics for low-energy information processing and communications, Journal of Lightwave Technology 35 , 346 (2017). - [31] A. Khakifirooz, K. Cheng, Q. Liu, T. Nagumo, N. Loubet, A. Reznicek, J. Kuss, J. Gimbert, R. Sreenivasan, M. Vinet, et al. , Extremely thin soi for system-on-chip applications, in Proceedings of the IEEE 2012 Custom Integrated Circuits Conference (IEEE, 2012) pp. 1-4.

Rendering Paper...