## An Implementation of Back-Propagation Learning on GF11, a Large SIMD Parallel Computer
Michael Witbrock and Marco Zagha December 1989 CMU-CS-89-208
School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
## Abstract
Current connectionist simulations require huge computational resources. We describe a neural network simulator for the IBM GF11, an experimental SIMD machine with 566 processors and a peak arithmetic performance of 11 Gigaflops. We present our parallel implementation of the backpropagation learning algorithm, techniques for increasing efficiency, performance measurements on the NetTalk text-to-speech benchmark, and a performance model for the simulator. Our simulator currently runs the back-propagation learning algorithm at 900 million connections per second, where each 'connection per second' includes both a forward and backward pass. This figure was obtained on the machine when only 356 processors were working; with all 566 processors operational, our simulation will run at over one billion connections per second. We conclude that the GF11 is well-suited to neural network simulation, and we analyze our use of the machine to determine which features are the most important for high performance.
This research was performed at and supported by the IBM T.J. Watson Research Center, Yorktown Heights, NY 10598. The production of this Report was supported in part by Hughes Aircraft Corporation and National Science Foundation grant ECS-8716324.
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the International Business Machines Corporation, the Hughes Aircraft Corporation, the National Science Foundation, or the U.S. Government.
## 1. Introduction
The recent development of several new and effective learning algorithms has inspired interest in applying neural networks to practical problems such as road following by autonomous vehicles[14], speech recognition[9], story understanding[12] and sonar target identification[7]. Many of these applications have been approached using variants of the Backpropagation learning algorithm[16]. Although this learning algorithm has been used to attack small problems with considerable success, the huge computational resources required have hindered attempts on large scale tasks.
One of the authors of this paper has recently been involved in an attempt, at Carnegie Mellon University, to apply backpropagation learning to the problem of speaker-independent continuous speech recognition [6]. Even for the relatively small digit recognition task initially selected[11], it has been necessary to train rather large recurrent nets[4] by making around 10,000 passes through very large amounts of data. 1 On the Convex C-1 used for these experiments, a typical training run took about a week. It became clear that the connectionist simulation tools available to us would make it very difficult to approach the ultimate goal of learning to transcribe continuously spoken general English. A simulator several orders of magnitude faster might permit a worthy attack.
The construction of such a simulator became feasible when the authors were offered the opportunity to write connectionist software for IBM's GF11 parallel supercomputer.
## 2. GF11 Architecture and Microcode Generation
GF11 [1] is an experimental parallel computer located at IBM's T.J. Watson Research Center at Yorktown Heights, New York. It is a SIMD (Single Instruction Multiple Data) machine composed of 566 processors interconnected through a Benˇ es Network[2] (See Figure 1). Each processor is capable of 20 million floating or fixed point operations per second, a rate which can be sustained during many kinds of calculations since intermediate results can be stored in a relatively large (256 word) register file. The registers can do up to four operations on every clock: write each of two operands to an ALU, read a result from an ALU, and write to or read from the interconnection network or static RAM (SRAM). Each processor has 16K words of static RAM, which, since it can be written or read on every clock, is effectively 4 times slower than the register file, and 512K words of dynamic RAM (DRAM), which is 4 times slower still. The Benˇ es network is capable of connecting the processors in arbitrary 1 to 1 permutations; 1024 such permutation patterns can be set up in the machine at once 2 . The whole machine has a peak arithmetic performance of 11.4 Gigaflops and contains a total of 1.14 Gbytes of semi-conductor memory.
Programs for GF11 consist of subroutines of sequential (non-branching) microcode residing on
1 This long training period was necessary, even using the quickprop[5] variation of backprop which usually converges considerably faster than ordinary backprop.
2 Avariety of other connection topologies, including broadcasts and multicasts are also possible, but were relatively difficult to set up with the available software and were not needed for our purposes.
Figure 1: The Architecture of the GF11 Computer (after Beetem[1]).
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: System Architecture
### Overview
The image depicts a high-level system architecture diagram, likely representing a distributed computing or parallel processing system. It shows the interconnection between processors, disk systems, a file server, and a controller, along with an IBM RT Workstation. The diagram is enclosed within a dashed border, suggesting a modular or bounded system.
### Components/Axes
The diagram is composed of several key components:
* **Benes Interconnection Switch:** Located at the top-center.
* **Processors:** A block labeled "Processors" containing six processor units labeled P1 through P6.
* **Disk System:** A block labeled "Disk System" containing ten disk units labeled D1 through D10.
* **File Server:** Located between the Processors and Disk System.
* **IBM RT Workstation:** Located in the lower-left section.
* **Controller:** A block labeled "Controller" containing several sub-components: Control Interface, Data and Status Buffers, Address Portion of Microcode, Instruction Portion of Microcode, and Address Relocation.
* **Network:** A connection labeled "Network" leading to the IBM RT Workstation.
* **Arrows:** Representing data flow and connections between components.
There are no explicit axes or scales in this diagram. It is a schematic representation of system components and their interconnections.
### Detailed Analysis or Content Details
The diagram illustrates the following connections and relationships:
* The Benes Interconnection Switch connects the Processors and the Disk System. The switch appears to have multiple input/output lines, suggesting a complex routing capability.
* The File Server is connected to both the Processors and the Disk System, acting as an intermediary for data access.
* The IBM RT Workstation is connected to the Controller via the Network.
* The Controller's Control Interface is connected to the IBM RT Workstation.
* The Controller's Data and Status Buffers are connected to the Address Relocation and Address Portion of Microcode.
* The Instruction Portion of Microcode is connected to the Address Portion of Microcode.
* The Controller is connected to the Network.
The diagram does not provide specific numerical data or performance metrics. It focuses on the architectural arrangement of the system.
### Key Observations
* The Benes Interconnection Switch is a central component, suggesting its importance in managing communication between processors and disks.
* The Controller acts as a bridge between the IBM RT Workstation and the core processing/storage components.
* The separation of Instruction and Address portions of microcode within the Controller suggests a sophisticated control mechanism.
* The diagram highlights a layered architecture, with the IBM RT Workstation at the top level, the Controller as an intermediary, and the Processors and Disks at the lower level.
### Interpretation
This diagram likely represents a parallel processing system designed for high-performance computing. The Benes Interconnection Switch enables efficient communication between processors and disks, allowing for parallel data access and processing. The File Server provides a centralized storage and access point for data. The IBM RT Workstation serves as a user interface and control point for the system. The Controller manages the flow of instructions and data between the workstation and the core processing components.
The architecture suggests a focus on scalability and flexibility. The modular design allows for the addition of more processors and disks as needed. The Benes Interconnection Switch provides a dynamic routing capability, enabling efficient communication even as the system grows. The separation of instruction and address portions of microcode within the Controller suggests a sophisticated control mechanism that can optimize performance and manage complex tasks.
The diagram does not reveal specific details about the system's performance characteristics or limitations. However, it provides a clear overview of its architectural design and key components. The use of a Benes network suggests a focus on minimizing communication bottlenecks and maximizing parallel processing efficiency. The inclusion of an IBM RT Workstation indicates that the system was likely developed in the 1980s or early 1990s, a period when IBM RT workstations were commonly used for high-performance computing applications.
</details>
a single, system wide controller. All of GF11's 566 processors receive exactly the same instruction at exactly the same time. Typically each processor will apply the instructions to different data, and, because the processors have a table lookup facility, this data can reside at different local memory addresses on different processors. The controller is connected to an IBM RT Workstation, which schedules the execution of microcode subroutines residing in the controller, and which can read results from and write data to GF11's processors via the controller. The controller is capable of storing 512K microcode instructions, a length which which corresponds to about 1/40th of a second of GF11 run-time. Only very limited data dependent computation is possible in this code: table lookup , and selection of the source (data path) from which an operand is taken. Since neither branching nor looping is possible in the microcode, program flow control must be implemented by having the RT choose which stored microcode subroutine to run next. Since memory addresses contained in the microcode may be mapped onto different physical addresses (relocated) by the controller, one can loop through large arrays by repeatedly applying a single section of sequential
microcode to different portions of processor memory.
Programming GF11 is straightforward. One can regard it as a vast floating point coprocessor attached to the RT. GF11 programs are programs written in a high level language 3 which run on the RT. At appropriate points in the program, functions are called to write data to, read data from, or execute a particular microcode subroutine on GF11. Microcode subroutines are, in turn, written as a series of calls to high-level language functions representing GF11 processor instructions such as floating add, write to memory, read from switch, etc. During a preliminary 'generation' execution of the program, these high level language subroutines are executed, and the GF11 operations required to perform their function are recorded. These operations are then passed through a scheduler and turned into a block of sequential microcode, which will execute on GF11 when the routine is called during a 'run' execution. Two advantages accrue from generating microcode by actually executing a high level program. The first advantage is that high level language constructs, such as loops and branches can be interspersed with the GF11 instructions, allowing the same block of high level language to generate many different microcode subroutines, depending on the state of program variables during 'generation'. This flexibility allowed our program to handle arbitrary network topologies very simply. Efficient, sequential microcode is generated by the program according to a topology file that it reads during 'generation'. In effect, our program compiles arbitrary feed-forward networks into GF11 microcode. The second advantage is that simulation of GF11 operation is virtually free. The routines representing GF11 operations are capable, depending on a mode switch, of outputting a GF11 microcode instruction, or of simulating execution of that instruction. We made a great deal of use of this simulation facility when developing and debugging our program.
## 3. The Backpropagation Learning Algorithm
Backpropagation is a technique for training networks of simple neuron-like units connected by adjustable weights to perform arbitrary input/output mappings. Patterns are presented by setting the activation of a set of 'input layer' units. Activation flo ws forward to units in subsequent layers via adjustable weights, eventually reaching the output layer. The activation of a unit is calculated by computing the weighted sum of the activations of the units connected to it and then applying a squashing, or logistic, function 4 to it. The object of learning is to minimize the value of some error metric 5 between the actual activations of the output units, and the values required by the desired input/output mapping. This is done by computing the effect of each weight on the error metric, and adjusting the weight in the direction of reduced error. Since both the weighted sum and the logistic function are differentiable, this can be done by computing partial derivatives of the error measure with respect to each weight, starting with the weights to the output units and working backwards.
3 in the case of this simulator, the language used was a proprietary IBM language similar to PL1.
4 Usually the sigmoid function, 1 1+ e x .
5 Usually summed squared difference.
/0
While in the 'online' version of backprop, the required weig ht changes are applied as they are computed, in our implementation, the changes required to reduce the error measure for each input/output pattern are accumulated across all patterns and used to compute a net (or 'pooled') weight update after all cases have been presented.
Pooled update differs from the online version of the backpropagation algorithm in the following way: if there are np input-output pairs in the training sequence, online update allows np weight changes to be applied, whereas the most extreme form of pooled update allows only 1. For some data-sets (the NetTalk training set among them), this is a decided disadvantage. Using online update, NetTalk can be learned in 10 complete passes through the training set[3], a performance unlikely to be matched by the 10 updates allowed by strict pooled update. It is, however, an open question whether pooled update is worse in general. For some tasks, it appears to work better; in some recent experiments performed by one of the authors (Witbrock) - training recurrent networks for speech recognition - online backprop, even wit h a very small learning rate, had a tendency to reach a plateau beyond which it could not successfully reduce the error. With pooled update, this effect was not noticed. Pooled update also allows one to store successive approximations to the slope of the weight space with respect to the training set, permitting the use of the Quickprop weight update rule. This rule converges considerably more quickly that the usual update rule for many problems [5, 10]. Finally, the apparent disadvantage of pooled update is reduced when observes that update doesn't have to be pooled over all np patterns. In fact, weights can be updated after one case has been processed on each processor.
## 4. Simulator Implementation
## 4.1. Parallelizing Backprop for GF11
There are two obvious approaches to parallelizing backprop. In one approach, one divides the network, distributing 'neural processing units' - weig hts, units, or layers of units - across physical processors, and communicates activation levels between processors. In the other approach one can parallelize across training cases, having each processor simulate identical networks, but apply them to different subsets of training examples, communicating collected weight changes at the end of each training epoch (i.e. after a single presentation of each training example). Both approaches have been used in previous simulators. Blelloch and Rosenberg, in their simulator for the Connection Machine, mapped both individual weights and units to processors. The simulator on Warp initially mapped subsets of units with their corresponding weights to physical processors, but was later changed to the more efficient (on the Warp) case-based parallelism. Kevin Lang's [personal communication] simulator for the Convex C-1 Vector processor, and its descendants 6 all succeed in making backprop vectorizable by parallelizing across training cases.
Dividing units across processors is a technique which appears to be best suited to MIMD machines with very fast communications - MIMD, because diff erent units in a neural network
6 Written by Franzini and Witbrock.
Figure 2: Parallelizing backprop for GF11
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Parallel Neural Network Processing
### Overview
The image depicts a diagram illustrating a parallel processing architecture for a neural network. It shows multiple copies of a neural network operating on different processors, each processing a subset of input-output cases. The diagram highlights the distribution of workload across these processors.
### Components/Axes
The diagram consists of two main sections:
1. **Input-Output Cases:** A rectangular block at the top, divided into approximately 6 vertical sections, representing different input-output cases. An arrow indicates the direction of processing.
2. **Copies of the Network on different Processors:** A row of 6 identical neural network diagrams at the bottom. Each network consists of input nodes, hidden layers, and output nodes.
A label at the bottom reads: "Copies of the Network on different Processors".
A label at the top-right reads: "Input-Output Cases".
### Detailed Analysis or Content Details
The diagram shows a clear one-to-one mapping between the input-output cases and the copies of the neural network. Each neural network appears to have approximately 4 input nodes, 2 hidden layers with approximately 5 nodes each, and 1 output node.
The connections between the input-output cases and the neural networks are represented by gray shaded lines. Each input-output case is connected to each of the neural networks. This suggests that each processor is working on a different subset of the overall dataset.
There are no numerical values or specific data points present in the diagram. It is a conceptual illustration of a parallel processing scheme.
### Key Observations
The diagram emphasizes the parallel nature of the processing. The identical neural networks suggest that the same model is being used across all processors. The distribution of input-output cases implies a data parallelism approach, where the dataset is divided among the processors.
### Interpretation
This diagram illustrates a data-parallel approach to training or applying a neural network. By replicating the network across multiple processors, the workload can be distributed, potentially leading to faster processing times. The diagram suggests that the input data is partitioned and fed to each processor independently. This is a common strategy for scaling neural network computations, especially with large datasets. The lack of specific details about the network architecture or the data suggests that the diagram is intended to convey a general principle rather than a specific implementation. The diagram does not provide any information about how the results from the different processors are combined or synchronized. It is a high-level conceptual illustration.
</details>
have different patterns of connection, and very fast communication because activation levels must be passed between processors for each pattern presentation. If both units and weights are divided across processors, SIMD suffices, since all weights (and all units) are essentially the same, but even more communication is necessary 7 . For SIMD machines with moderate numbers of processors and sufficient memory on each processor, the technique of having each processor run the same network (and hence the same code) over a different subset of cases maps more neatly onto the architecture. This latter technique is the one that we used when designing our simulator.
The ability to parallelize backprop across cases assumes that the changes to weights due to each input/output pattern are independent. That is, we must be able to satisfy the condition that the weight changes from the each of patterns in the training set can be applied in any order and yield the same result. This condition is clearly not satisfied by the canonical version of backprop, in which weight changes are calculated and applied after each pattern presentation. Instead, we use the 'pooled update' technique where weight changes are summ ed across all input/output cases, and the net change is applied after the entire training set 8 has been presented.
When the algorithm is parallelized this way on GF11, each processor stores the input/output cases it is responsible for (See figure 2). It also has its own copy of the entire network (both units and weights), and storage for accumulating the total weight change due to its cases. After all the processors have finished calculating weight changes for all their cases, they share their accumulated weight changes with the other processors over the interconnection switch. They then update their local weights with the resulting total weight change over all cases.
7 In fact, Blelloch and Rosenberg report that communication speed was the primary limitation on the speed of their Connection Machine simulator.
8 Or a large enough subset of it to parallelize over.
## 4.2. Simulator Structure
Dividing the functionality of the simulator into blocks of microcode is straightforward, although there are a few software and hardware imposed constraints. The current software restricts the size of microcode blocks to 20K lines. The hardware design forces the programmer to work within the memory limits of the machine - in particular, the 256 word register file and the total size of microcode memory (512K lines).
We process weights by subsets of layers, or bundles , typically on the order of 1000 weights. This enables us to keep all the inputs and outputs of units in registers.
To save microcode, we use the same microcode for all training examples and simply set a pointer to the current example in the controller. Similarly, when updating weights, we repeatedly set pointers to the weights and weight changes and call a routine to update a small group of weights.
## 4.3. Processor Dependent Computation
Even when parallelization over cases is used to fit the algorithm to its SIMD architecture, there is still a problem with implementing backprop on the GF11. The problem arises because of the necessity of computing the sigmoid function 1 1+ e /0 x , on processors which can not perform division, let alone exponentiation, of floating point numbers. They can, however, do addition, multiplication, and a few other IEEE floating point operations including, perhaps most importantly, extracting the integer part of a floating point number ( /98 x /99 ). They can also do table lookup within SRAM, and can select the source of operands depending on condition codes set as a result of previous operations (the 'select' operation). Two methods of computing the sigmoid function were tried, using both forms of processor dependent computation available.
The second approach to computing the sigmoid function involved using the 'select' operation (data dependent data path selection) to ensure that values of x were in the range [ /0 15; 15]. Values outside of this range were mapped to the extreme values of the sigmoid function. Numbers within the range were remapped, by addition and multiplication, as x 1 over a range of [0; 255]. The integer part /98 x 1 /99 of these numbers was used to do table lookup into a precomputed table of 1 1+ e /0 x ; the fractional part of the number, x 1 /0 /98 x 1 /99 , was used to perform linear interpolation between successive entries in the lookup table. Using this combination of table lookup and interpolation, 9 any differentiable monotonic function over the reals is usable
The first approach involved replacing the sigmoid function with 1 1+2 /0 x , or equivalently, 9 2 x 2 x +1 . To help compute 2 x , we use the identity 2 x = 2 x 2 2 and a polynomial approximation for 2 x accurate in the range [0 /58 0; 0 /58 5]. More precisely, we compute 2 /98 x /99 2 x /0/98 x /99 2 2 , where 2 /98 x /99 is computed by table lookup. This reformulation reduces the required operations to addition, multiplication and round down, all of which are provided by the hardware, and taking the reciprocal of a number, which was available as an accurate library routine.
Figure 3: Computing sigmoid by table lookup. Actual table contained 256 points.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: Interpolation with Lookup Table
### Overview
The image is a diagram illustrating the concept of interpolation using a lookup table. It shows a smooth curve generated by interpolating between discrete data points stored in a lookup table. The diagram highlights the difference between the actual interpolated value on the curve and the discrete points within the lookup table.
### Components/Axes
The diagram consists of:
* A smooth, curved line representing the interpolated function.
* A series of discrete data points marked with squares, representing the lookup table.
* Two labels: "interpolated value" pointing to a point on the curve, and "points in lookup table" pointing to a set of squares.
* No explicit axes are present, but the diagram implies a horizontal axis representing the input and a vertical axis representing the output.
### Detailed Analysis or Content Details
The diagram shows a function that starts with a relatively flat slope, then increases rapidly, and finally plateaus. The lookup table consists of approximately 8 data points.
* **Interpolated Value:** The label "interpolated value" points to a location on the curve where the function's value is determined by interpolation between two lookup table points.
* **Points in Lookup Table:** The label "points in lookup table" points to a set of squares representing the discrete data points used for interpolation. The squares are connected by straight lines, showing how interpolation creates the smooth curve.
* The curve begins at approximately (0, 0.1), increases to approximately (2, 0.6), then to approximately (3, 1.2), then to approximately (4, 1.8), then to approximately (5, 2.2), then to approximately (6, 2.5), then to approximately (7, 2.7), and finally to approximately (8, 2.8).
* The lookup table points are located at approximately: (1, 0.2), (2, 0.6), (3, 1.2), (4, 1.8), (5, 2.2), (6, 2.5), (7, 2.7), (8, 2.8).
### Key Observations
* The interpolated curve is smoother than the straight lines connecting the lookup table points.
* The interpolation process effectively "fills in" the gaps between the discrete data points.
* The diagram demonstrates how a continuous function can be approximated using a finite set of discrete data points.
### Interpretation
This diagram illustrates a fundamental concept in numerical analysis and data processing: interpolation. Interpolation is a technique used to estimate values between known data points. Lookup tables are often used to store pre-calculated values, and interpolation is used to efficiently determine values that are not explicitly stored in the table. This is particularly useful when evaluating complex functions or dealing with large datasets. The diagram highlights the trade-off between accuracy and storage space: using more lookup table points increases accuracy but also requires more storage. The smooth curve represents the ideal function, while the lookup table points represent a practical approximation. The diagram suggests that interpolation provides a good balance between these two factors.
</details>
the sigmoid function could be computed with approximately five decimal places of accuracy over its entire domain. See figure 3.
There were two other, less important, pieces of code where the 'select' operation was used. One use was to replace outlandish weight changes by zero, so that the hardware errors which were frequent while we were developing our code would not render the algorithm unworkable (see section 8 below).
The other use of data dependent data path selection was in an (unfinished) implementation of Fahlman's quickprop weight update rule[5], which includes a number of calculation steps which are dependent on the local curvature of the error surface in weight space.
## 4.4. Processor Communication
## 4.4.1. Summing Weight Changes in a Ring
In our initial approach, we used one switch configuration to connect P processors in a ring and sum each weight change in P steps.
On the first step, each processor initializes Sum to zero and loads its weight change into the variable Neighbor . Then for remaining P /0 1 steps, each processor sends Neighbor to the processor on its right, receives a new value of Neighbor from the processor on its left, and adds Neighbor to Sum . At the end of P steps each processor has the same value for the total weight change in Sum (See figure 4).
This algorithm has a time complexity of O ( NPROCS ) per weight.
Figure 4: Summing Weight Changes with Processors Configured in a Ring
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: Iterative Processing Flow
### Overview
This diagram illustrates an iterative process involving four elements (A, B, C, and D) processed sequentially over time. Each step combines the current element with the results of previous steps, demonstrating a cumulative effect. The diagram is organized in a grid format, with 'time' progressing vertically and 'processor' horizontally.
### Components/Axes
* **Axes:**
* Vertical Axis: Labeled "time", with arrows indicating progression downwards.
* Horizontal Axis: Labeled "processor", with arrows indicating progression to the right.
* **Elements:** A, B, C, D
* **Process:** Each box represents a processing step, showing the combination of elements.
### Detailed Analysis or Content Details
The diagram consists of four rows (representing time steps) and four columns (representing processor steps). The content of each box represents the elements being processed at that step.
* **Row 1 (Time Step 1):**
* Box 1: A
* Box 2: B
* Box 3: C
* Box 4: D
* **Row 2 (Time Step 2):**
* Box 1: D+A
* Box 2: A+B
* Box 3: B+C
* Box 4: C+D
* **Row 3 (Time Step 3):**
* Box 1: C+D+A
* Box 2: D+A+B
* Box 3: A+B+C
* Box 4: B+C+D
* **Row 4 (Time Step 4):**
* Box 1: B+C+D+A
* Box 2: C+D+A+B
* Box 3: D+A+B+C
* Box 4: A+B+C+D
The arrows between the boxes indicate the flow of processing from one step to the next. The order of elements in the sums appears to follow the order of the processor steps.
### Key Observations
* The process is iterative, with each step building upon the results of the previous steps.
* The complexity of the expression increases with each time step, adding one element at a time.
* The order of elements in the sums is consistent with the processor order.
* The diagram demonstrates a cyclical or cumulative process.
### Interpretation
This diagram likely represents a computational or algorithmic process where elements A, B, C, and D are combined iteratively. The diagram could illustrate:
* **Data Fusion:** Combining different data sources (A, B, C, D) over time to create a more comprehensive result.
* **Error Correction:** Accumulating corrections (A, B, C, D) to refine an initial estimate.
* **Signal Processing:** Combining signals (A, B, C, D) to enhance a desired signal or filter out noise.
* **Parallel Processing:** The horizontal axis could represent different processors working on the same data in a pipelined fashion.
The diagram's simplicity suggests a fundamental concept, and the iterative nature implies a process that continues beyond the four time steps shown. The consistent order of elements in the sums suggests a deterministic algorithm. The diagram does not provide specific values or units, so the exact nature of the process remains open to interpretation.
</details>
## 4.4.2. Summing Weight Changes in a Tree
GF11's powerful communication facility allows a more efficient approach to summing weight changes. We use several switch configurations to sum weight changes in a binary tree using a standard data-parallel algorithm[8].
On step i (starting at zero), processor P sends its weight change to processor ( P + 2 i ) mod NPROCS and adds the weight change it received to its weight change. After log 2 NPROCS steps, the total weight change in each processor contains the sum of the individual weight changes. See figure 5.
This algorithm has a time complexity of O (log 2 NPROCS ) per weight.
If the number of processors is not a power of 2, we generate two additional communication instructions. Call each processor /102 Pn : n /62 2 /98 log 2 NPROCS /99 /103 a 'leftover' processor. Each leftover processor Pn sends its weight change to its 'buddy', processor number n /0 2 /98 log 2 NPROCS /99 , which adds the value to its weight change. Then logarithmic summing proceeds as described above. Finally, each leftover processor receives the final total weight change from its buddy. During the two extra steps, processors not involved in the communication mask out their operations using condition codes.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: Parallel Processing Dependency Chart
### Overview
The image depicts a diagram illustrating the dependencies and execution flow of a parallel processing algorithm. It shows how data elements (A through H) are combined over time using a processor. The diagram visualizes a series of additions, where the results of previous operations become inputs for subsequent ones. The diagram is oriented with "time" increasing downwards and "processor" extending horizontally.
### Components/Axes
* **Axes:**
* Vertical Axis: Labeled "time", indicating the progression of the algorithm.
* Horizontal Axis: Labeled "processor", representing the computational resource.
* **Data Elements:** A, B, C, D, E, F, G, H – these are the initial input values.
* **Intermediate Results:** Combinations of data elements (e.g., H+A, A+B, B+C+D+E, etc.) represent the results of additions at each time step.
* **Arrows:** Indicate the flow of data and dependencies between operations.
### Detailed Analysis
The diagram shows a series of additions performed in parallel. The initial row contains the individual data elements A through H. Each subsequent row represents a time step where additions are performed.
* **Time Step 1:**
* H+A
* A+B
* B+C
* C+D
* D+E
* E+F
* F+G
* G+H
* **Time Step 2:**
* F+G+H+A
* G+A+H+B
* H+A+B+C
* A+B+C+D
* B+C+D+E
* C+D+E+F
* D+E+F+G
* E+F+G+H
* **Time Step 3:**
* B+C+D+E+F+G+H+A
* C+D+E+F+G+H+A+B
* D+E+F+G+H+A+B+C
* E+F+G+H+A+B+C+D
* F+G+H+A+B+C+D+E
* G+H+A+B+C+D+E+F
* H+A+B+C+D+E+F+G
* A+B+C+D+E+F+G+H
The final row represents the sum of all elements A through H, with each element appearing in a different combination. The arrows show that each addition in a given time step depends on the results of the additions in the previous time step.
### Key Observations
* The diagram demonstrates a progressive accumulation of data elements.
* The number of elements in each addition increases with each time step.
* The final row shows the complete summation of all elements in various permutations.
* The diagram illustrates a parallel processing approach, as multiple additions are performed simultaneously at each time step.
### Interpretation
This diagram likely represents a parallel algorithm for computing sums or performing other operations that benefit from parallelization. The diagram illustrates how the problem can be broken down into smaller, independent tasks (additions) that can be executed concurrently. The increasing complexity of the additions in each time step suggests that the algorithm is designed to efficiently combine data elements over time. The final row indicates that the algorithm ultimately computes all possible combinations of the input elements. This could be a step in a larger process, such as calculating statistics or performing data analysis. The diagram is a visual representation of a computational process, and it highlights the benefits of parallel processing in terms of reducing execution time and improving efficiency. The diagram does not provide any numerical data, but it clearly demonstrates the logical flow and dependencies of the algorithm.
</details>
processor
Figure 5: Summing Weight Changes with Processors Configured in a Tree
## 5. Simulating Larger Networks
As described so far, our implementation could only accommodate networks with a few thousand weights within the 16K word SRAM. Training examples can be kept in DRAM with trivial changes to the code and a negligible performance penalty. However, without further changes to the structure or the program, keeping weights in DRAM creates a fundamental efficiency problem: transfers to and from DRAM would take over twice as long as the floating point computation. A transfer to or from DRAM can be started at most once per 4 cycles. However, in the forward pass only two floating point operations per weight are required, yielding a maximum efficiency of 50%. In the backward pass, a weight change and a weight must be loaded, add and weight change must be stored, requiring 3 transfers per 4 floating point operations, yielding only 33% efficiency.
We adopt the following approach to obtain locality of reference to the weights and weight changes. Instead of doing independent forward and backward passes for each case, we move a bundle of weights and weight changes into an SRAM cache, process several training examples on those weights, and then move the weights values back to DRAM.
The original structure of the simulator was the following:
For each training example For each Sweep forward on bundle For each Sweep
```
for each bundle of weights
Sweep forward on bundle
for each bundle of weights (in reverse order)
Sweep backward on bundle
```
## The new structure that uses DRAM:
```
For each group of training examples
For each bundle of weights
Move weights to a cache in SRAM
For each training example in group
Sweep forward on bundle
For each bundle of weights (in reverse order)
Move weights and weight changes to a cache in SRAM
For each training example in group
Sweep backward on bundle
Move weight changes from cache to DRAM
```
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Diagram: 3D Representation of Data Transfer
### Overview
The image depicts a three-dimensional diagram representing a data transfer scenario. The diagram uses a cube to visualize the relationship between "Source Unit", "Destination Unit", and "Case" (specifically, "SRAM Capacity"). A shaded volume within the cube represents "one case". The diagram appears to be conceptual, illustrating a space defined by these three parameters.
### Components/Axes
* **X-axis:** Destination Unit
* **Y-axis:** Source Unit
* **Z-axis:** Case (SRAM Capacity)
* **Shaded Volume:** Represents "one case"
* **Small Cube (bottom-right):** Labeled "Case" and "SRAM Capacity". This appears to be a visual representation of the Z-axis dimension.
### Detailed Analysis or Content Details
The diagram does not contain specific numerical data. It is a visual representation of a three-dimensional space. The axes are labeled, but there are no scale markings or specific values indicated. The shaded volume occupies a significant portion of the cube, suggesting a substantial range of possibilities within the defined space. The "one case" label indicates that the shaded volume represents a single instance or scenario within the overall parameter space.
### Key Observations
The diagram emphasizes the interplay between the source and destination units, and the capacity of the case (SRAM). The shading suggests a defined region within this space, representing a specific data transfer scenario. The lack of numerical values implies that the diagram is intended to be a conceptual illustration rather than a precise data representation.
### Interpretation
The diagram likely represents a system where data is transferred from a "Source Unit" to a "Destination Unit", and the amount of data that can be transferred in a single operation is limited by the "SRAM Capacity" (represented by the "Case"). The shaded volume could represent the feasible region for data transfer, given the constraints of the source and destination units and the available SRAM capacity. The diagram suggests that the data transfer process is influenced by all three parameters, and that the size of the "Case" (SRAM Capacity) plays a crucial role in determining the amount of data that can be moved. The diagram is a high-level visualization of a data transfer process, focusing on the relationships between key parameters rather than specific data values. It could be used to illustrate the concept of data transfer limitations or to visualize the design space for a data transfer system.
</details>
Destination Unit
Figure 6: Processing all the Weights for a Single Case.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Diagram: Conceptual Representation of Data Mapping
### Overview
The image depicts a three-dimensional conceptual diagram representing a mapping between "Source Unit", "Destination Unit", and "Case". Two dark gray cubes are positioned within the space defined by these axes, and a third dark gray cube is positioned along the "Case" axis. Text annotations explain the meaning of the cubes. The diagram appears to illustrate a relationship between data units, cases, and potentially memory capacity.
### Components/Axes
* **X-axis:** Labeled "Destination Unit".
* **Y-axis:** Labeled "Source Unit".
* **Z-axis:** Labeled "Case".
* **Cube 1 & 2:** Positioned within the space defined by "Destination Unit", "Source Unit", and associated with the text "Two cases over a subset of weights".
* **Cube 3:** Positioned along the "Case" axis, labeled "SRAM Capacity".
### Detailed Analysis or Content Details
The diagram shows three dark gray cubes.
* **Cube 1 & 2:** These cubes are identical in size and are located in the lower-left portion of the diagram, occupying a region defined by a specific "Destination Unit" and "Source Unit". The annotation indicates these represent "Two cases over a subset of weights". Their positioning suggests they are related to a specific combination of source and destination units.
* **Cube 3:** This cube is positioned along the "Case" axis, towards the upper-right of the diagram. It is labeled "SRAM Capacity". This suggests that the "Case" dimension is related to the amount of SRAM available.
The dimensions of the cubes are not explicitly defined, but they appear to be roughly equal in size. The exact values represented by the axes are not provided.
### Key Observations
* The diagram visually separates the concept of "SRAM Capacity" (Cube 3) from the "Two cases over a subset of weights" (Cube 1 & 2).
* The positioning of Cube 1 & 2 suggests a relationship between specific "Source Unit" and "Destination Unit" combinations.
* The diagram does not provide quantitative data; it is a conceptual illustration.
### Interpretation
The diagram likely represents a data mapping or processing scenario. The "Source Unit" and "Destination Unit" could represent different data structures or memory locations. The "Case" dimension, linked to "SRAM Capacity", suggests that the ability to process these cases is constrained by available memory. The "Two cases over a subset of weights" could represent a specific operation or algorithm applied to a limited set of data.
The diagram implies that the processing of these two cases is dependent on the SRAM capacity. It could be illustrating a trade-off between the number of cases that can be processed and the amount of memory required. The diagram is a high-level conceptual representation and does not provide specific details about the data, the algorithm, or the memory architecture. It is a visual aid for understanding the relationships between these concepts.
</details>
Figure 7: Processing a Subset of the Weights for Several Cases at Once.
As another way of understanding this change, the loop structure can be viewed as a traversal of points on a 3-dimensional rectangular grid. The points on the grid represent weights (or weight
changes) and the axes of the grid are input unit index, output unit index, and training example index. In our original approach (see figure 6), we traverse one slice of the grid at once - that is we process one case at a time. We were forced to modify this algorithm because the memory required for this slice exceeds the capacity of free memory in SRAM. In the improved approach we traverse a sub-grid (See figure 7) which corresponds to several training examples processed on a subset of the weights. Since the weights are the same for each training example, we reduce transfers to and from DRAM.
## 6. Simulator Performance
## 6.1. Performance Model
Developing a performance model yields several important benefits. Execution times for different topologies and training set sizes can be estimated without having to execute a program on GF11. The performance model can also approximate the optimal number of processors to use for a given problem. More importantly, the model makes analyzing the role of the simulator components much easier. Bottlenecks are revealed, and the importance of specific machine features can be determined.
Several assumptions are implicit in the performance model. We neglect computation on units, such as the sigmoid calculation, since there are typically many more weights than units. We also assume that computation during the forward and backward passes fills the processor pipeline this amounts to assuming more than 12 units per layer.
| Let | G = number of groups of training cases |
|---------------------------------------------------------------------|------------------------------------------|
| W = total number of weights W i = number weights in the input layer | U = number of units |
| B = maximum number of weights in a bundle | M = number of free words in SRAM |
| C = number of training cases | P = number of processors |
In the forward pass, one add and one multiply per weight are executed:
$$C y c l e s f o r w a r d / c a s e = 2 W$$
In the backward pass, two adds and two multiplies per weight are executed, except for connections from input units which execute one and and one multiply:
```
Cycles backward/case = 2 W _ { i } + 4 ( W - W _ { i } )
```
/0 Transfers to or from DRAM can only be started once every 4 cycles. In the forward pass, each weight is transferred to SRAM once per group of training examples. In the backward pass, a
weight change is loaded and stored once per group of examples, and weights not in the input layer are loaded once:
Cycles forward weight transferring/group = 4 W
Cycles backward transferring weight changes/group = (4)(2) W
Cycles backward transferring weight/group = 4( W Wi )
$$C y c l e s \, p e r \, u p d a t e = 4 W \lceil \log _ { 2 } P \rceil$$
/0 Communications over the switch also can only be started once every 4 cycles. When summing in a tree: 10
/100 /101 When summing in a ring: Cycles per update = 4 WP After allocating space for the sigmoid lookup tables and a few other constants, the amount of free SRAM is approximately:
$$M = 1 5 K$$
The weight value and weight change caches use 2 B words of memory, and two words per unit are used to hold unit activation and error levels. Thus, the size of a group of cases that can be processed entirely in SRAM is:
The number of groups of cases is:
/61 /2 /61 /0 Total time per epoch in cycles (when summing in a tree): T = 2 W C P + (2 Wi + 4( W Wi )) C P + (12 W + 4( W Wi )) G + 4 P W
$$\frac { M - 2 B } { 2 U }$$
$$G = C / P \times ( 2 U / ( M - 2 B ) )$$
/2 /61 /0 /2 /61 /0 /2 /2 Because each processor executes 20 million instructions per second, millions of connections per second (MCPS) is simply, 20 T
$$P = \frac { C } { 4 W } \times ( ( 6 W - 2 W _ { i } ) + \frac { 2 U } { M - 2 B } \times ( 1 6 W - 4 W _ { i } ) )$$
/61 To find the optimal number of processors when summing in a tree we differentiate time with respect to the number of processors, equate the result to zero, solve, and simplify:
/2 /0 /0 /2 /0 When summing in a ring, the optimal number of processors is simply the square root of the above quantity.
10 if the number of processors is not a power of 2, there is one extra communication not counted in the model
## 6.2. Optimizing Microcode
## 6.2.1. Optimization Techniques
Typically, the first attempt at implementing a microcode routine resulted in code running at less than 50% of optimal performance. In this section, we describe techniques for improving microcode efficiency. These techniques follow from two general principles of code optimization: breaking computation into carefully sized pieces to accomodate multi-level memory hierarchies, and reordering independent instructions to improve pipeline scheduling.
Producing efficient microcode is difficult for a number of reasons. The processor pipeline is 25 deep - that is, the result of a floating point operation do es not appear for 25 cycles. In addition, several operations may execute in one cycle. Although pipeline scheduling is not the programmer's responsibility, the scheduling process is not completely invisible. The programmer often has to know what performance constraints the machine imposes and a few details of the scheduler implementation.
The microcode scheduler rearranges instruction order but is constrained by the programmer's approach to data movement. Thus, managing the register/SRAM/DRAM hierarchy becomes the key to producing efficient microcode. In section 5 we discussed changes to the program structure to minimize DRAM traffic. To minimize SRAM traffic, we keep net inputs in registers during the forward pass rather than reloading the values to increment them. Similarly, during the backward pass, error into a unit is kept in registers. Using weight bundles instead of entire weight layers reduces the number of units used in a block of microcode and enables this strategy to work within the 256 register limit.
Analyzing the output of the microcode scheduler is the next step after developing a reasonable approach to memory management. This phase can be tedious, but it often reveals problems that can fixed by changing the order of instructions or by inserting scheduler directives. For example, the routines that sum weight changes must loop over both communication steps and indices of weight changes. By simply interchanging the loops to sum all weight changes simultaneously, pipeline scheduling improves significantly. As another example, we improved the routine that sums weight changes in a ring by having it circulate weight increments rather than partial sums of weight increments. This change did not reduce the number of source instructions, but it did cut the number of dependent instructions in the critical path in half, since additions could be done in parallel with switch communication.
Inefficient scheduling in the backward pass motivated one of the most interesting optimizations. In both the forward and backward passes, we ordered computation on weights by the unit on the input side of the weight. This turns out to work well for the forward pass, but not in the backward pass because for a M by N bundle of weights, the last M operations all increment the same register (the error from the last source unit). Since each operation must be separated by 25 cycles, this led to sub-optimal code. To improve efficiency, at microcode generation time, we simply create an ordering of weight indices for the backward pass sorted by output unit index. Then accesses to each of the M input units are separated by at least 2 N operations, which produces near optimal
code for N greater than half the pipeline depth (i.e., N 12).
/62 Interleaving code that is logically independent is the most difficult form of optimization, but is necessary for operations with a low throughput such as switch communication and transfers to and from DRAM. For example the routine responsible for updating weights performs several functions: it moves weights, weight changes, and weight changes from the previous epoch from DRAM to a cache in SRAM, sums weight deltas over the switch, calculates weight changes using the current value of the learning rate and momentum, updates the weights in the SRAM cache, moves the new weights back to DRAM, and zeroes weight changes in DRAM for the next epoch.
Finally, because the DRAM is interleaved, accesses should be sequential whenever possible.
## 6.2.2. Microcode Efficiency
In this section we report the efficiency of microcode routines generated for networks with at least 12 units per layer. We defer discussion of small networks until section 10.3.
In the forward pass, 97-99% of all cycles execute an add or multiply. Memory transfers execute in parallel with floating point computation.
In the backward pass, 85-99% of all cycles execute an add or multiply. This routine is slightly less efficient than the forward pass because of additional memory traffic - each weight requires both a load and store instead of just a load.
Routines calculating the sigmoid and back error for units execute an add or multiply on about 40-70% of all cycles, which is fast enough to make the execution time negligible.
Analyzing the performance for weight updating code is more complicated because we overlap DRAM transfers and computation with communication as described in the previous section. For simplicity, we adopt a pessimistic efficiency rating base on the number of switch communications only. Since each switch communication operation takes four cycles, we define the optimal number of cycles as four times the number of communications. When summing in a ring, we achieve 98% of optimal performance. DRAM transfers, weight update with momentum, and zeroing of weight changes for next epoch come for free in empty slots during communication. When summing in a tree, efficiency is approximately 60% of optimal performance because computation can not be fully overlapped with communication. Despite its lower execution efficiency , tree update takes considerably less time than ring update when more than 4 processors are used.
## 6.3. Performance Measurements
We measured the performance on our simulator using NetTalk [17] text-to-phoneme benchmark (as did Pomerleau et al.[15] and Blelloch&Rosenberg [3]). The network consists of an input layer with 203 units and a 'true' unit, a hidden layer with 60 units, and an output layer of 26 units. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected
Figure 8: MCPS with Processors Configured in a Ring
<details>
<summary>Image 8 Details</summary>

### Visual Description
\n
## Chart: Performance vs. Number of Processors
### Overview
The image presents a line chart comparing predicted and measured performance (in MCPS - Millions of Calculations Per Second) as a function of the number of processors. The chart illustrates the scaling behavior of a system as more processors are added.
### Components/Axes
* **X-axis:** Number of Processors. Scale is logarithmic, ranging from 1 to 512. Markers are present at 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512.
* **Y-axis:** MCPS (Millions of Calculations Per Second). Scale is logarithmic, ranging from 1 to 1000. Markers are present at 1, 2, 4, 8, 10, 20, 40, 80, 100, 200, 400, 800, and 1000.
* **Data Series 1:** "predicted performance" - Represented by a solid line.
* **Data Series 2:** "measured performance" - Represented by a line with '+' markers.
* **Legend:** Located in the top-center of the chart, identifying the two data series.
### Detailed Analysis
**Predicted Performance (Solid Line):**
The predicted performance line starts at approximately 5 MCPS with 1 processor. It slopes upward, increasing rapidly until approximately 64 processors, where it reaches a peak of around 250 MCPS. After 64 processors, the line begins to curve downwards, indicating diminishing returns. At 128 processors, the predicted performance is approximately 240 MCPS. At 256 processors, it drops to around 180 MCPS, and at 512 processors, it is approximately 100 MCPS.
**Measured Performance (Line with '+' Markers):**
The measured performance starts at approximately 2 MCPS with 1 processor. It increases steadily, with a steeper slope than the predicted performance initially. At 8 processors, the measured performance is around 15 MCPS. At 16 processors, it reaches approximately 30 MCPS. At 32 processors, it is around 60 MCPS. At 64 processors, the measured performance peaks at approximately 250 MCPS. After 64 processors, the measured performance declines. At 128 processors, it is approximately 220 MCPS. At 256 processors, it drops to around 100 MCPS, and at 512 processors, it is approximately 80 MCPS.
### Key Observations
* The predicted performance and measured performance initially align closely, but diverge after 64 processors.
* Both predicted and measured performance exhibit diminishing returns as the number of processors increases beyond 64.
* The measured performance peaks at 64 processors, while the predicted performance continues to increase, albeit at a slower rate, before eventually declining.
* The measured performance declines more rapidly than the predicted performance after the peak.
### Interpretation
The chart demonstrates the concept of scaling in parallel computing. Initially, adding more processors leads to a significant increase in performance (linear scaling). However, as the number of processors increases, communication overhead and other factors limit the scalability, resulting in diminishing returns. The divergence between predicted and measured performance suggests that the prediction model does not accurately account for these limiting factors. The peak in measured performance indicates the optimal number of processors for this particular system and workload. Beyond this point, adding more processors actually *decreases* overall performance, likely due to increased overhead outweighing the benefits of additional processing power. The difference between the predicted and measured performance after the peak suggests that the model underestimates the impact of these overheads. This data is valuable for optimizing resource allocation and understanding the limitations of parallel processing in this specific context.
</details>
Figure 9: Communication and Computation with Processors Configured in a Ring
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Line Chart: Performance Scaling with Processors
### Overview
The image presents a line chart illustrating the relationship between the number of processors and the time taken for different components of a computational task, measured in thousands of cycles per epoch. The chart displays four curves: total time, communication time, computation time, and a fourth curve that appears to be a smoothed version of the total time. The x-axis represents the number of processors, and the y-axis represents the time in thousands of cycles per epoch, using a logarithmic scale.
### Components/Axes
* **X-axis:** Number of Processors (Scale: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512)
* **Y-axis:** Thousands of Cycles Per Epoch (Logarithmic Scale: 10, 100, 1000, 10000, 100000)
* **Lines/Curves:**
* Total Time (Black, thicker line)
* Communication Time (Gray)
* Computation Time (Gray)
* Smoothed Total Time (Black, thinner line)
* **Legend:** Located in the center-right of the chart, labeling each line.
### Detailed Analysis
* **Total Time:** The total time curve starts at approximately 100,000 thousands of cycles per epoch with 1 processor. It initially decreases rapidly as the number of processors increases, reaching a minimum around 128 processors at approximately 8,000 thousands of cycles per epoch. After this point, the total time begins to increase slowly, reaching approximately 10,000 thousands of cycles per epoch with 512 processors.
* **Communication Time:** The communication time curve starts at approximately 100 thousands of cycles per epoch with 1 processor. It increases steadily as the number of processors increases, reaching approximately 3,000 thousands of cycles per epoch with 512 processors. The curve appears roughly linear on this logarithmic scale.
* **Computation Time:** The computation time curve starts at approximately 10,000 thousands of cycles per epoch with 1 processor. It decreases rapidly as the number of processors increase, reaching approximately 1,000 thousands of cycles per epoch with 128 processors. It then plateaus and slightly increases to approximately 1,500 thousands of cycles per epoch with 512 processors.
* **Smoothed Total Time:** This curve closely follows the total time curve, but is smoother. It starts at approximately 100,000 thousands of cycles per epoch with 1 processor, decreases to a minimum of approximately 7,000 thousands of cycles per epoch around 128 processors, and then increases to approximately 9,000 thousands of cycles per epoch with 512 processors.
### Key Observations
* The total time is initially dominated by computation time, which decreases rapidly with increasing processors.
* As the number of processors increases, communication time becomes a more significant factor, eventually contributing to the increase in total time.
* There appears to be an optimal number of processors (around 128) where the total time is minimized. Beyond this point, adding more processors leads to diminishing returns and eventually an increase in total time.
* The smoothed total time curve suggests that the fluctuations in the total time curve are likely due to noise or variability in the measurements.
### Interpretation
The chart demonstrates the trade-offs involved in parallelizing a computational task. Initially, increasing the number of processors significantly reduces the computation time, leading to a decrease in total time. However, as the number of processors increases, the overhead associated with communication between processors becomes more significant. This communication overhead eventually outweighs the benefits of increased computation, leading to an increase in total time.
The optimal number of processors represents the point where the benefits of increased computation are balanced by the costs of increased communication. Beyond this point, adding more processors is counterproductive. This type of behavior is common in parallel computing and highlights the importance of carefully considering the communication patterns and overhead when designing parallel algorithms. The smoothed total time curve suggests that the observed behavior is robust and not simply due to random fluctuations. The logarithmic scale on the y-axis emphasizes the large range of time values and the relative changes in performance as the number of processors increases.
</details>
Figure 11: Communication and Computation with Processors Configured in a Tree
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Chart: Performance Scaling with Processors
### Overview
The image presents a line graph illustrating the relationship between the number of processors and the resulting performance, measured in MCPS (Million Calculations Per Second). Two lines are plotted: one representing predicted performance and the other representing measured performance. The y-axis is on a logarithmic scale.
### Components/Axes
* **X-axis:** Number of Processors. Scale ranges from 1 to 512, with markers at 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512.
* **Y-axis:** MCPS (Million Calculations Per Second). Scale is logarithmic, ranging from 1 to 10000, with markers at 1, 10, 100, 1000, and 10000.
* **Lines:**
* Predicted Performance (darker line)
* Measured Performance (lighter line)
* **Legend:** Located in the top-left quadrant, labeling the two lines.
### Detailed Analysis
**Predicted Performance (darker line):**
The line slopes upward, indicating a positive correlation between the number of processors and predicted performance. The trend is approximately exponential.
* At 1 processor: ~2 MCPS
* At 2 processors: ~4 MCPS
* At 4 processors: ~8 MCPS
* At 8 processors: ~16 MCPS
* At 16 processors: ~32 MCPS
* At 32 processors: ~64 MCPS
* At 64 processors: ~128 MCPS
* At 128 processors: ~256 MCPS
* At 256 processors: ~512 MCPS
* At 512 processors: ~1024 MCPS
**Measured Performance (lighter line):**
The line also slopes upward, but is consistently below the predicted performance line. The trend is also approximately exponential, but with a slightly lower rate of increase.
* At 1 processor: ~1.5 MCPS
* At 2 processors: ~3 MCPS
* At 4 processors: ~6 MCPS
* At 8 processors: ~12 MCPS
* At 16 processors: ~24 MCPS
* At 32 processors: ~48 MCPS
* At 64 processors: ~96 MCPS
* At 128 processors: ~192 MCPS
* At 256 processors: ~384 MCPS
* At 512 processors: ~768 MCPS
Each data point on both lines is marked with a '+' symbol, indicating a measurement or prediction.
### Key Observations
* The measured performance consistently falls below the predicted performance, suggesting that the prediction model overestimates the actual performance gains from adding more processors.
* Both lines exhibit a roughly exponential growth pattern, indicating that performance scales well with the number of processors, but with diminishing returns.
* The gap between predicted and measured performance appears to remain relatively constant across the range of processors tested.
### Interpretation
The chart demonstrates the scaling of performance with an increasing number of processors. The divergence between predicted and measured performance suggests that there are factors not accounted for in the prediction model that limit the achievable performance gains. These factors could include communication overhead between processors, synchronization costs, or limitations in the algorithm's ability to be parallelized efficiently. The logarithmic scale on the y-axis emphasizes the substantial performance improvements achieved with each doubling of the processor count, even though the rate of improvement slows down as the number of processors increases. The consistent gap between the lines suggests a systematic error in the prediction model, rather than random fluctuations. This information is valuable for system designers and performance analysts, as it highlights the importance of realistic performance modeling and the need to consider factors beyond simple processor count when estimating system performance.
</details>
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Line Chart: MCPS with Processors Configured in a Tree
### Overview
The image presents a line chart illustrating the relationship between the number of processors and the cycles per epoch for computation time, communication time, and total time. The chart is titled "Figure 10: MCPS with Processors Configured in a Tree". The y-axis represents "Thousands of Cycles Per Epoch" on a logarithmic scale, while the x-axis represents "Number of Processors" on a linear scale.
### Components/Axes
* **Title:** Figure 10: MCPS with Processors Configured in a Tree
* **X-axis Label:** Number of Processors
* **X-axis Scale:** Linear, ranging from 1 to 512. Marked at 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512.
* **Y-axis Label:** Thousands of Cycles Per Epoch
* **Y-axis Scale:** Logarithmic, ranging from 10 to 100000. Marked at 10, 100, 1000, 10000, 100000.
* **Data Series:**
* Total Time (Dark Grey Line)
* Computation Time (Light Grey Line)
* Communication Time (Black Line)
### Detailed Analysis
The chart displays three lines representing the trends of total time, computation time, and communication time as the number of processors increases.
* **Total Time (Dark Grey):** The line starts at approximately 95,000 cycles per epoch with 1 processor. It slopes downward, becoming increasingly shallow as the number of processors increases. At 512 processors, the total time is approximately 2,500 cycles per epoch.
* **Computation Time (Light Grey):** The line begins at approximately 90,000 cycles per epoch with 1 processor. It decreases rapidly initially, then levels off. At 512 processors, the computation time is approximately 2,000 cycles per epoch.
* **Communication Time (Black):** The line starts at approximately 20 cycles per epoch with 1 processor. It increases in steps, remaining relatively flat between steps. There is a noticeable step increase around 4 processors, reaching approximately 80 cycles per epoch. Another step increase occurs around 64 processors, reaching approximately 200 cycles per epoch. At 512 processors, the communication time is approximately 250 cycles per epoch.
### Key Observations
* As the number of processors increases, the total time and computation time decrease, while the communication time increases.
* The communication time exhibits a step-wise increase, suggesting that the communication overhead increases at certain processor counts.
* The computation time dominates the total time for a small number of processors, but as the number of processors increases, the communication time becomes a more significant contributor to the total time.
* The rate of decrease in computation time slows down as the number of processors increases.
### Interpretation
The data suggests that while increasing the number of processors initially reduces the computation time and overall processing time, the communication overhead eventually becomes a limiting factor. The step-wise increases in communication time likely correspond to the need for more complex communication patterns as the number of processors grows. This indicates that there is an optimal number of processors beyond which adding more processors does not significantly improve performance and may even degrade it due to increased communication overhead. The logarithmic scale on the y-axis emphasizes the dramatic reduction in cycles per epoch for both total and computation time with increasing processors, but also highlights the relatively smaller, yet significant, increase in communication time. This chart is likely demonstrating the trade-offs involved in parallel processing, where the benefits of increased computation power must be weighed against the costs of increased communication complexity.
</details>
Table 1: Millions of Connections Per Second, Summing in a Ring and in a Tree
| Millions of Connections Per Second | Millions of Connections Per Second | Millions of Connections Per Second |
|--------------------------------------|--------------------------------------|--------------------------------------|
| Processors | Tree MCPS | RingMCPS |
| 8 | 26 | 26 |
| 16 | 55 | 53 |
| 32 | 112 | 107 |
| 64 | 216 | 170 |
| 128 | 415 | 222 |
| 256 | 753 | 180 |
| 356 | 901 | - |
| 512 | 1231 | 84 |
to the output layer. The true unit is fully connected to both the hidden and output layers. The total number of connections is 13,826. Our training set consisted of 12022 patterns. We updated weights once per epoch.
As of August 1989, there were 356 fully operational processors for which the switch could be configured to provide, without errors, the interprocessor communication required by our application. The disk system was not operational. The timings with more processors are actual measurements, although the results generated by the program were not correct. Measurements do not include the time to initialize DRAM with training patterns. However, the measurements are real-time and therefore include all host and controller overheads.
Figures 8 and 10 show the predicted and measured performance of both types of summing. Figures 9 and 11 break down the execution time into components for communication and computation.
## 7. Comparison to Other Backprop Simulators
## 7.1. Difficulties with Making Comparisons
In this section, we will attempt to compare the performance of our backprop simulator on GF11 with the performance of simulators running on other machines. Before doing so, however, we believe it is important to note that while the simulators all implement essentially the same algorithm, there are variations between implementations, and between sets of test data which can make performance metrics somewhat misleading.
The most obvious source of difficulty in making these comparisons is that of variations on the algorithm caused by variations in hardware. Our simulator, for example, must pool updates over at least a number of I/O patterns equal to the number of operating processors to work at all, and
four times that many to work efficiently. This hardware 'limi tation' (which is the source of our code's speed) means that we can't update weights frequently during a presentation of the NetTalk training set. Since the NetTalk training set is suited to frequent updates, our simulator would probably take more pattern presentations to learn this task than the Warp (with fewer processors) or the Connection Machine simulators (with a different form of parallelism). It is even conceivable that our simulator might take longer to learn NetTalk than one of these other simulators.
On the other hand, for the task we originally envisioned applying the simulator to, learning to do speech transcription by training recurrent networks over huge amounts of training data, the advantages of such frequent updates have not been demonstrated.
In short, using variations of an algorithm is sometimes useful, either for a particular task, or to extract a certain form of parallelism, but people reading performance comparisons should not be tempted to gloss over these variations; they may have significant effects on the utility of the simulator for some tasks.
One problem with measuring performance in connections per second is that the definition of this unit is not standard. Our definition [and Pomerleau's] is simply the number of connections (including connections to the true unit) times the number of patterns presented divided by the total time. Notice that this measure accounts for neither the frequency of weight updates nor the fact that connections from the input require less computation.
There is no clear solution to the problems of measuring performance. However, defining the performance metrics used and reporting the exact parameters of a benchmark make fair comparisons feasible. We have also found that our performance model elucidates performance differences between our simulator and other simulators.
## 7.2. Performance Data
Before we present our performance measurements, we must emphasize that our measurements do not include the I/O time to load training data onto processors. At the time these measurements were made, the GF11 disk system was not operational. See section 10.1 for a discussion of incorporating disk I/O into our simulator.
We should also emphasize that we updated weights once every epoch , while measurements for other machines include more frequent updates. We estimate that if we updated weights after every single pattern per processor (i.e., every 512 patterns), our peak rate would be dominated by communication time and DRAM transfer time. Using the performance model as a guide, the total time per update (on the NetTalk training set, using 512 processors) would be roughly (log 2 512) communication steps, each taking four cycles per weight, and 20 cycles/weight of computation and memory transfers, yielding approximately 180 MCPS on 512 procs. This provides a lower bound on performance when more frequent updates are required. As an upper bound of theoretical interest, we estimate that as communication time approaches zero, (i.e., when updating once per epoch with huge data sets) the simulator would perform at approximately 1900 MCPS. We provide this figure only as an indication of the trade-off between communication and computation; it
Table 2: Comparison of Learning Speeds for Various Machines
| Millions of Connections Per Second | Millions of Connections Per Second |
|--------------------------------------|--------------------------------------|
| Machine | MCPS |
| Vax | 0.008 |
| Sun 3/75 | 0.01 |
| /22 Vax 780 | 0.027 |
| Sun 3/160/FPA | 0.034 |
| Ridge 32 | 0.05 |
| DEC 8600 | 0.06 |
| Convex C-1 | 1.8 |
| 16K CM-1 | 2.6 |
| Cray-2 | 7 |
| 64K CM | 13 |
| CM-2 | 40 |
| Warp | 20 |
| GF11 | 901 |
should not be considered a benchmark rating.
To convert MCPS to Megaflops, we consider only the computation on weights in the forward and backward passes, ignoring weight updates and the sigmoid calculation. Total computation time on weights depends on the fraction of weights that are in the input layer, i.e.,
$$M e g a f l o p s = M C P S \times ( 4 + 2 { \frac { W - W _ { i } } { W } } )$$
/58 For purposes of comparison, table 7.2 gives performance figures for a number of other implementations of backpropagation learning. The Warp figure is from George Gusciora [personal communication]. The CM-2 figure is from Zhang et al.[18]. Other figures are from Blelloch and Rosenberg [3] or Pomerleau et al. [15]. Alexander Singer [personal communication] has an implementation for the CM-2, based on an algorithm suggested by Robert Farber, which is an order of magnitude faster than that of Zhang et al. The method used for reporting the performance of this simulation is not directly comparable with that used in this paper, but we estimate its performance at between 400 and 650 MCPS based on the figures we were given.
/2 /0 For NetTalk , this factor is 4 23. Using our measurement for 356 processors, 901 MCPS is equivalent to 3.8 Gigaflops (of a possible 7.1 Gigaflops).
## 8. Fault Diagnosis and Fault Tolerance
The instability of the GF11 hardware during the code development period made fault diagnosis an essential part of our simulator.
To test the interconnection network, at program startup each processor receives initial values from the host and sends these values to over each switch configuration which the program will use during a run. The destination values are collected by the host and verified.
To test the processors, the simulator can be run in a diagnostic mode during which each processor executes backprop with the same training set and initial weights. Each processors acts as if it were the only processor; there is no inter-processor communication. The host then gets the total squared error for each processor, compares the values, and approves all processors that produce the most common total error.
Although separate microcode for computation and error detection is cleaner, it comes with a performance cost. We chose to integrate error detection with computation when there was a significant performance gain available by doing so. In the microcode routine that communicates weight changes, for example, each total weight change is tested for being in a 'reasonable' range. If a value is out of range, it is set to zero and a fault count is incremented. This simple measure allows the program to work (i.e. reduce error) in the presence of processor and bus errors which would otherwise cause huge weights which saturate units and prevent further learning. Most of this error checking can be performed in the unused cycles during inter-processor communications. 11
## 9. Limitations of the Simulator
Our approach to memory management restricts the size of the networks which may be simulated. The 512K words of DRAM hold training examples, weights, weight changes, and weight changes from the previous epoch. The maximum number of weights allowed is one-third of the available memory (512K minus the storage for training data). The maximum number of weights is also limited by microcode memory (512K lines). This latter limitation could be overcome if there are sets of groups of units with identical connectivity patterns. Then the address relocation mechanism can be used to map them onto the same section of microcode.
In SRAM, each unit requires one word for the net input to the unit (or Error on backward pass) and one word for the output of a unit. The lookup tables for the sigmoid calculation and other constants require approximately 1K words. The maximum number of units allowed is roughly (16 K 1 K 2 maxbundle ) 2. This number is typically around 6000 units.
## 10. Extensions
/0
/0
/2
## 10.1. Disk I/O
The GF11 disk system consists of 10 independent units which communicate with processors over the switch. (See figure 1). Each disk can deliver 8 Mbytes/sec for an aggregate peak rate of 80
11 However, timings presented in this paper were taken with this part of the error checking turned off.
/61
Mbytes/sec. At this rate, the entire DRAM of each processor can be filled in less than 15 seconds, and our NetTalk data set could be loaded in a fraction of a second.
Although we designed the simulator to load all training data at program startup, the disk system is fast enough to load patterns in parallel with training. The total capacity of the disk system, 4.5 Gbytes, is large enough to support very large applications, such as speech or image processing.
## 10.2. Switch Communication
Although it is not possible to sum N numbers on N processors in less than O (log 2 N ) time, it is possible to perform N such sums on N processors in less than O ( N log 2 N ) time (See Figure 12). Intuitively, it takes O ( N ) communications to sum N numbers, so it should take O ( N 2 ) communications to sum N sets of numbers if all communication steps can be fully utilized. Processors can pipeline the summing by sending partial sums of each weight change around a ring. On step i , processor P receives a partial sum of weightchange [( P + i + 1) mod N ] from processor ( P + 1) mod N and adds its value for the element to its partial sum. After N /0 1 steps, Processor P has the sum of weightchange [( P + N /0 1) mod N ]. The sums are then circulated using another N 1 steps.
/0
Figure 12: Pipelined Summing with Processors Configured in a Ring
<details>
<summary>Image 12 Details</summary>

### Visual Description
\n
## Diagram: Dataflow through Processors over Time
### Overview
The image depicts a diagram illustrating dataflow through four processors (A, B, C, and D) over three time steps (t=0, t=1, t=2, t=3). Each processor contains a grid of data elements, and arrows indicate the movement and combination of these elements between processors. The diagram demonstrates a series of operations involving data transfer and addition.
### Components/Axes
* **Processors:** Labeled A, B, C, and D, positioned horizontally at the bottom of the diagram.
* **Time:** Represented vertically on the left side, with labels t=0, t=1, t=2, and t=3.
* **Data Elements:** Represented by letters A, B, C, and D with subscripts (e.g., A₁, B₂, C₃, D₄) within the processor grids.
* **Arrows:** Indicate the direction of data flow and the operation being performed (addition, denoted by "+").
* **Grid:** Each processor contains a 2x2 grid of data elements.
### Detailed Analysis / Content Details
**t=0:**
* **Processor A:** Contains four instances of A₀.
* **Processor B:** Contains four instances of B₀.
* **Processor C:** Contains four instances of C₀.
* **Processor D:** Contains four instances of D₀.
**t=1:**
* **Processor A:** Contains A₀.
* **Processor B:** Contains A₀ + B₀.
* **Processor C:** Contains B₀ + C₀.
* **Processor D:** Contains C₀ + D₀.
* **Processor A:** Contains A₀.
* **Processor B:** Contains B₀.
* **Processor C:** Contains C₀.
* **Processor D:** Contains D₀.
* **Processor A:** Contains A₀ + D₀.
**t=2:**
* **Processor A:** Contains A₀.
* **Processor B:** Contains A₀ + B₀.
* **Processor C:** Contains B₀ + C₀.
* **Processor D:** Contains C₀ + D₀.
* **Processor A:** Contains A₀.
* **Processor B:** Contains B₀.
* **Processor C:** Contains C₀.
* **Processor D:** Contains D₀.
* **Processor A:** Contains A₀ + B₀ + C₀.
* **Processor B:** Contains B₀ + C₀.
* **Processor C:** Contains C₀ + D₀.
* **Processor D:** Contains D₀.
**t=3:**
* **Processor A:** Contains A₀.
* **Processor B:** Contains A₀ + B₀.
* **Processor C:** Contains B₀ + C₀.
* **Processor D:** Contains C₀ + D₀.
* **Processor A:** Contains A₀.
* **Processor B:** Contains B₀.
* **Processor C:** Contains C₀.
* **Processor D:** Contains D₀.
* **Processor A:** Contains A₀ + B₀ + C₀.
* **Processor B:** Contains B₀ + C₀.
* **Processor C:** Contains C₀ + D₀.
* **Processor D:** Contains D₀ + A₀ + B₀ + C₀.
### Key Observations
* Data elements are sequentially transferred between processors.
* Addition operations are performed when data elements are combined.
* The data flow appears to be cyclical, with elements moving from A to B to C to D and potentially back to A.
* The complexity of the data within each processor increases with each time step.
### Interpretation
This diagram illustrates a parallel processing scheme where data is distributed across multiple processors and undergoes a series of transformations involving addition. The dataflow pattern suggests a pipeline architecture, where each processor performs a specific operation on the data before passing it on to the next processor. The increasing complexity of the data within each processor indicates that the processors are accumulating and combining information over time. The diagram could represent a simplified model of a signal processing system, a matrix multiplication algorithm, or a similar computational task. The cyclical nature of the dataflow suggests that the system is designed to operate continuously, processing a stream of data over an extended period. The diagram is a conceptual illustration of data manipulation and does not provide specific numerical values or performance metrics. It is a visual representation of a computational process, emphasizing the flow of data and the operations performed on it.
</details>
In practice, the constant factor for this algorithm on GF11 is not significantly better than the log of the number of processors because of the processor dependent computation. But this extension is important in theory because it scales better with the number of processors; it indicates that a very simple processor interconnection, a ring, is adequate for our approach to parallelizing backprop provided that processor dependent addressing is possible.
## 10.3. Improving Performance for Small Networks
In general, our code is optimized for networks with large enough layers to fill up the 25 cycle pipeline. However, some neural network architectures for large applications have small layers, e.g., classification tasks with a few output units[10].
Speeding up computation on smaller layers requires minimizing the longest chain of dependent operations so that more operations can be interleaved. In the forward pass, 2 floating point operations per weight are executed. Multiplies are independent of each other, but all the additions to the net input of a destination form a linear dependency chain. This chain can be shortened by rearranging the operations so that the dependency graph forms a binary tree. To sum an array x of size N (assuming N is a power of 2 for simplicity),
```
for i = 0 step 1 to lg(n)-1
for j = 0 step (2^(i+1)) to n-1
x[j]=x[j]+x[j+2^i];
```
The result is left in x[0].
## 10.4. Fault Tolerance
While we were developing the simulator described in this report, an effort was being made to rapidly increase the number of available processors in GF11. This meant that the hardware was relatively unstable, a situation which lead us to consider ways how the program could be made more robust. Although the machine is far more stable at the time of writing [David George, 12 personal communication] than it was during program development, there are a number of interesting measures which could make the program more fault tolerant.
Since we contemplate storing huge amounts of training data in RAM (i.e., up to 1.1 GBytes) there is some possibility of changes in the stored data due to transient memory errors. In order to detect such occurrences, we would like to add checksums to the training data. Such checksums could be computed very rapidly at infrequent intervals during training. As well as facilitating fault detection, checksums would provide additional utility by allowing the program to decide whether training patterns need to be reloaded following runs of other users programs. 13 Pending a fully functioning disk I/O system, such a facility could be very useful.
The large amount of memory on each processor enables us to contemplate another form of fault tolerance: weight checkpointing to memory. Every few epochs, a copy of the networks weights could be copied to one of a number of checkpoint spaces in each processor's RAM (or trading
12 David George is the manager of the GF11 project at IBM.
13 This is not always necessary, the main switch diagnostic program for GF11, for example, only clobbers memory below that used for storing training cases by our program.
time against space, spread across the RAM of all processors over the switch). In the event that an error in learning is detected (e.g. by a large increase in total pattern error between epochs), one of the checkpoints in RAM could be copied back as active weights and learning restarted. This technique should make learning virtually immune to transient or intermittent hardware failures.
One of the problems with distributing weight changes, rather than weights, to all processors during the update cycle is that the weights may begin to take on different values on different processors. There are two ways in which this could occur. First, an arithmetic or memory failure on a processor could change a weight. Secondly, and less obviously, the order in which weight changes are added together is different on different processors. Since the numbers are floating point with a fixed representation, this can result in different totals on different processors, i.e., floating point operations are not necessarily associative. Small differences between weights on different processors can probably be tolerated over a few training epochs, but they need to be prevented from accumulating. We propose to ensure that all processor work on more or less the same network by occasionally averaging and rebroadcasting the 'reasonable' values for each weight across processors.
## 10.5. Recurrent Networks
The final goal of this work was applying backpropagation training to the problem of speech recognition using recurrent (Elman) nets. The important difference between these networks and those with a conventional, feed forward structure is that unit activation levels are copied to units in earlier layers between presentations of successive patterns in a sequence. Since a unit's previous value contributes to its input, the network can learn weights that allow it to retain useful information over time.
The copying of activations between pattern presentations presents a slight difficulty for parallelization. Activations must be copied from units in the copy of the network which saw the previous Input/Output pattern in a sequence. This means that if we spread cases from sequences across processors we would have to do the costly operation of communicating activation levels across the switch. The secret in avoiding this is to repeat the method used in Witbrock's previous vectorized Elman net simulator for the Convex C-1: parallelizing across sequences instead of across cases. This means that all copying of activation happens within the processor responsible for a particular sequence. While the actual copying of activations is trivial - a small piece of microcode is constructed from a list of (from unit, to unit) pairs at microcode generation time parallelizing across cases involves a couple of extra complications. First, a processor must know when a new sequence starts within its patterns, so that it can zero out the units which receive copies of activations. This is done by keeping a flag with each pattern: 1.0 if it's the continuation of an old sequence and 0.0 if it's the start of a new sequence, and multiplying the copied activations by this flag before copying into the destination units. The second change is that one cannot rely on all processors having the same number of patterns, since sequences may differ in length. Any patterns beyond the end of the last sequence on each processor must be marked with a flag indicating that no error should be back-propagated for this pattern. Again, this is done by multiplying the error by the value of this flag.
## 11. Evaluation of Architectural Features
In building our simulator and writing this report, we have had two equally important goals. The first goal was produce the fastest simulator possible on the hardware while maintaining enough flexibility to make it useful for our other research purposes. For this goal the chief measure of success is raw speed - millions of connections per second. Th e other goal was to evaluate features of an architecture which make it suitable or unsuitable for simulating connectionist networks, not just the performance of the whole package.
We begin by analyzing alternative strategies for communication. The performance of the ring summing algorithm indicates the limitations of a single ring of processors with no processordependent addressing. Clearly, figure 8 indicates that this approach does not scale well with the number of processors. The Benˇ es Network allows more sophisticated communication which significantly improves performance. However, the pipelined ring summing algorithm presented in section 10.2 has the best asymptotic complexity. Given enough weights, the total time is independent of the number of processors. We conclude that for our approach to parallelizing backprop, a sophisticated communications network is not necessary, if processor-dependent addressing is available.
Although SIMD control appears at first glance to be a limitation, we did not find any need for more complex control. In fact, we found that the single thread of control made code easy to design and debug. Moreover, the GF11 processor architecture matched our needs quite well. The balance of adds and multiplies inherent in the backprop algorithm allowed us to keep the floating point units busy, and the table lookup facility proved invaluable for the sigmoid calculation.
However, the three-level, software-controlled memory hierarchy added to the program complexity more than any other feature of the machine. 14 How could this complexity be avoided? It could not be prevented by a hardware controlled cache because of the pattern of the sequential accesses to large arrays. It is uncertain that compiler support would help either. Simply reducing the memory capacity in order to provide a uniform access time would place unacceptable constraints on simulation size. The best solution would be to increase the number of interleaved memory banks. It is relatively easy to arrange for weights to be accessed strictly sequentially. The bandwidth would have to be increased, but the latency would not have to be decreased.
In fact we found a general principle for simulating large networks: bandwidth is much more important that latency. This principle applies to computation on weights, memory accesses to weights, and inter-processor communication. Even though the processor pipeline is 25 deep, our code was naturally amenable to pipeline scheduling without any deliberate restructuring.
14 We were not alone in finding this; Yurij Baransky [personal communication] reports that the memory hierarchy has presented difficult design decisions when mapping other applications to GF11.
## 12. Conclusions
This paper reports our experience in mapping the backprop algorithm onto the GF11. We feel that there are several contributions that this work makes to the enterprise of neural network simulation:
We formalized the structure of the storage access patterns during the algorithm and used this model to design an efficient mapping on to a software controlled 3-level memory hierarchy. Although the fine details of this mapping are specific to GF11, the technique of accessing a subset of weights over several training examples is directly applicable to other machines, even machines with a hardware controlled cache.
We measured the simulator performance empirically, and more importantly produced an analytical model of its performance. This allowed us to independently evaluate the contribution to performance on this task of the various architectural features of GF11.
Finally, we showed that although GF11 was conceived with a particular application in Quantum Chromodynamics[1] in mind, the architecture is also well-suited to other applications.
## 13. Acknowledgements
We would like to thank Scott Fahlman (CMU), Danny Sabbah (IBM), and Dave George (IBM) for initiating this project. We are very grateful to Dave George, Yurij Baransky, Jim Sexton, and Micky Tsao at IBM for their technical assistance and helpful advice, and to Michael Portuesi, Micky Tsao, and Michael Franzini for help shipping data and equipment between CMU and T.J. Watson. And we also thank the CMU Boltzmann group, Yurij Baransky, Guy Blelloch, Scott Fahlman, Alan Fisher, and Dean Pomerleau for their helpful comments on the presentation of this work.
## References
- [1] Beetem, J., Denneau, M. and Weingarten, D., 'The GF11 Par allel Computer', IBM Technical Report 12364, December, 1986.
- [2] Benes, V. 'Optimal Rearrangeable Multistage Connectin g Networks', Bell System Technical Journal, vol. 43, 1641, 1964.
- [3] Blelloch, G. and Rosenberg, C.R., 'Network learning on t he Connection Machine', Proceedings of the Tenth International Joint Conference on Artificial Intelligence , Milan, Italy, 1987.
- [4] Elman, J.L. Finding Structure in Time , Technical Report, Center for Research in Language, University of California, San Diego, April, 1988.
- [5] Fahlman, S.E. Faster-Learning Variations on Back-Propagation: An Empirical Study , Proceedings of the 1988 Connectionist Models Summer School , Morgan Kaufman, San Mateo. 1988.
- [6] Franzini, M.A., Witbrock, M.J., and Lee, K.-F., 'A Connectionist Approach to Continuous Speech Recognition', ICASSP-89: 1980 International Conference on Acoustics, Speech and Signal Processing , pp.425-8 vol.1, IEEE, Glasgow, U.K. May, 1989.
- [7] Gorman, P.R. and Sejnowski, T.J. 'Learned Classificatio n of Sonar Targets Using a Massively Parallel Network', IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol 36., No. 7, 1988.
- [8] Hillis, W.D. and Steele, G.L., 'Data Parallel Algorithm s', Communications of the ACM , December, 1986.
- [9] Lang, K.J. and Hinton, G.E., ATime-Delay Neural Network Architecture for Speech Recognition , Technical Report CMU-CS-88-152, Carnegie Mellon University, December, 1988.
- [10] Lang, K.J. and Witbrock, M.J., 'Learning to Tell Two Spi rals Apart', Proceedings of the 1988 Connectionist Models Summer School , Morgan Kaufman, San Mateo. 1988.
- [11] Leonard, R.G. 'A Database for Speaker-Independent Dig it Recognition', Proceedings of the ICASSP , March, 1984.
- [12] Miikkulainen, R. and Dyer, M.G., A Modular Neural Network Architecture for Sequential Paraphrasing of Script-Based Stories , Technical Report UCLA-AI-8902, University of California, Los Angeles, February, 1989.
- [13] Plaut, D.C., Nowlan, S.J., and Hinton, G.E. Experiments on learning by back propagation , Technical Report, Carnegie Mellon University, June, 1986.
- [14] Pomerleau, D.A., 'ALVINN: An Autonomous Land Vehicle I n a Neural Network', Advances in Neural Information Processing Systems I , D.S. Touretzky, ed., Morgan Kaufman, 1989.
- [15] Pomerleau, D.A., Gusciora, G.L., Touretzky, D.S., and Kung, H.T., 'Neural Network Simulation at Warp Speed: How We Got 17 Million Connections Per Second', Proceedings of IEEE International Conference on Neural Networks , 1988.
- [16] Rumelhart, D.E., Hinton, H.E., and Williams, R.J. 'Lea rning Internal Representations by Error Propagation'. in Rumelhart, D.E., McClelland, J.L., eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol I., The MIT Press, Cambridge, 1986.
- [17] Sejnowski, T.J. and Rosenberg, C. 'Parallel Networks t hat Learns to Pronounce English Text', Complex Systems , 1, 145-168, 1987.
- [18] Zhang, X., McKenna, M., Mesirov, J., and Waltz, D. 'An Implementation of the Backpropagation Algorithm on the Connection Machine CM-2', Proceedings of Neural Information Processing Systems 1989, in press.