# Visual Description
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Icon: P&S Logo
### Overview
The image presents a logo consisting of the letters "P" and "S" intertwined, with a small circle between them. The logo is framed by a light green square.
### Components/Axes
The logo is composed of the following elements:
* **Letters:** "P" and "S" in a dark green, stylized serif font.
* **Circle:** A small, light green circle positioned between the "P" and "S".
* **Background:** A light green square frame.
### Detailed Analysis or Content Details
The letters "P" and "S" are the dominant features. The "P" is on the left and the "S" is on the right. The circle is centered between the two letters. The color contrast between the dark green letters and the light green background is moderate. The letters are not perfectly symmetrical.
### Key Observations
The logo appears to be a brand identifier. The use of serif fonts suggests a traditional or established organization. The color scheme is natural and calming.
### Interpretation
The logo likely represents a company or organization whose name begins with "P" and "S". Without further context, it's difficult to determine the specific industry or purpose of the entity. The design is relatively simple and memorable. The circle between the letters could symbolize connection, completeness, or a central focus. The logo is visually appealing and professional. It does not contain any factual data or numerical values. It is purely a visual representation of a brand.
</details>
## QCD on the Cell Broadband Engine
F. Belletti a , G. Bilardi b , M. Drochner c , N. Eicker d , e , Z. Fodor e , f , D. Hierl g , H. Kaldass h , i ,
- T. Lippert d , e , T. Maurer g , N. Meyer ∗ g , A. Nobile j , k , D. Pleiter i , A. Schäfer g ,
- F. Schifano a , H. Simma i , k , S. Solbrig g , T. Streuer l , R. Tripiccione a , T. Wettig g
Email: nils.meyer@physik.uni-regensburg.de
a Department of Physics, University of Ferrara, 44100 Ferrara, Italy
b Department of Information Engineering, University of Padova, 35131 Padova, Italy
c ZEL, Research Center Jülich, 52425 Jülich, Germany
d ZAM, Research Center Jülich, 52425 Jülich, Germany
e Department of Physics, University of Wuppertal, 42119 Wuppertal, Germany
f Institute for Theoretical Physics, Eotvos University, Budapest, Pazmany 1, H-1117, Hungary
g Department of Physics, University of Regensburg, 93040 Regensburg, Germany
h Arab Academy of Science and Technology, P.O. Box 2033, Cairo, Egypt
i Deutsches Elektronen-Synchrotron DESY, 15738 Zeuthen, Germany
j European Centre for Theoretical Studies ECT ∗ , 13050 Villazzano, Italy
k Department of Physics, University of Milano - Bicocca, 20126 Milano, Italy
l Department of Physics and Astronomy, University of Kentucky, Lexington, KY 40506-0055, USA
We evaluate IBM's Enhanced Cell Broadband Engine (BE) as a possible building block of a new generation of lattice QCD machines. The Enhanced Cell BE will provide full support of doubleprecision floating-point arithmetics, including IEEE-compliant rounding. We have developed a performance model and applied it to relevant lattice QCD kernels. The performance estimates are supported by micro- and application-benchmarks that have been obtained on currently available Cell BE-based computers, such as IBM QS20 blades and PlayStation 3. The results are encouraging and show that this processor is an interesting option for lattice QCD applications. For a massively parallel machine on the basis of the Cell BE, an application-optimized network needs to be developed.
The XXV International Symposium on Lattice Field Theory
July 30 - August 4 2007
Regensburg, Germany
∗ Speaker.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Text Block: Proceedings of Science Logo
### Overview
The image presents a logo or title block for "Proceedings of Science". It consists of stacked text elements against a solid green background. There is no chart, diagram, or data present.
### Components/Axes
The text elements are:
* "PROCEEDINGS" - positioned at the top.
* "OF" - positioned below "PROCEEDINGS".
* "SCIENCE" - positioned below "OF", and is the largest text element.
### Detailed Analysis or Content Details
The text is rendered in a serif font, with "SCIENCE" being significantly larger and bolder than the other two words. The background color is a vibrant green, approximately RGB(144, 238, 144). The text color is white.
### Key Observations
The layout emphasizes the word "SCIENCE", suggesting its importance. The overall design is simple and clean.
### Interpretation
This image is a branding element for a publication or conference series called "Proceedings of Science". It likely represents a platform for disseminating scientific research. The color green may be associated with growth, innovation, or environmental science, depending on the specific focus of the publication. The image itself does not contain any data or analytical information; it is purely a visual identifier.
</details>
http://pos.sissa.it/
## 1. Introduction
The initial target platform of the Cell BE was the PlayStation 3, but the processor is currently also under investigation for scientific purposes [1, 2]. It delivers extremely high floating-point (FP) performance, memory and I/O bandwidths at an outstanding price-performance ratio and low power consumption.
We have investigated the Cell BE as a potential compute node of a next-generation lattice QCD machine. Although the double precision (DP) performance of the current version of the Cell BE is rather poor, the announced Enhanced Cell BE version (2008) will have a DP performance of ∼ 100 GFlop/s and also implement IEEE-compliant rounding. We have developed a performance model of a relevant lattice QCD kernel on the Enhanced Cell BE and investigated several possible data layouts. The applicability of our model is supported by a variety of benchmarks performed on commercially available platforms. We also discuss requirements for a network coprocessor that would enable scalable parallel computing using the Cell BE.
## 2. The Cell Broadband Engine
An introduction to the processor can be found in Ref. [3], and a schematic diagram is shown in Fig. 1. The architecture is described in detail in Ref. [4], and we only give a brief overview here.
The Cell BE comprises one PowerPC Processor Element (PPE) and 8 Synergistic Processor Elements (SPE). In the following we will assume that performance-critical kernels are executed on the SPEs and that the PPE will execute control threads. Therefore, we only consider the performance of the SPEs. Each of the dual-issue, in-order SPEs runs a single thread and has a dedicated 256 kB on-chip memory (local store = LS) which is accessible by direct memory access (DMA) or by local load/store operations to/from the 128 general purpose 128-bit registers. An SPE can execute two instructions per cycle, performing up to 8 single precision (SP) operations. Thus, the aggregate SP peak performance of all 8 SPEs on a single Cell BE is 204.8 GFlop/s at 3.2 GHz. 1
Figure 1: Main functional units of the Cell BE (see Ref. [4] for details). Bandwidth values are given for a 3.2 GHz system clock.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: Cell Broadband Engine Architecture
### Overview
The image depicts the architecture of a Cell Broadband Engine (CBE), illustrating the interconnection of its core components: Main Memory, Memory Interface Controller (MIC), Power Processing Element (PPE), Element Interconnect Bus (EIB), Synergistic Processing Elements (SPEs), and Input/Output Interface (IOIF). The diagram focuses on data transfer rates between these components.
### Components/Axes
The diagram consists of the following components:
* **Main Memory:** Located on the left side of the diagram.
* **MIC (Memory Interface Controller):** Connected to Main Memory.
* **PPE (Power Processing Element):** Connected to both MIC and EIB.
* **EIB (Element Interconnect Bus):** The central interconnect, connecting PPE, MIC, and all SPEs and IOIFs.
* **SPE 0-7 (Synergistic Processing Elements):** Eight SPEs arranged in two rows, connected to the EIB.
* **IOIF 0 & 1 (Input/Output Interface):** Located on the right side of the diagram, connected to the EIB.
* **FlexIO:** Connected to IOIFs.
Data transfer rates are indicated along the arrows connecting the components, measured in GB/s (Gigabytes per second).
### Detailed Analysis
* **Main Memory <-> MIC:** 12.8 GB/s in both directions.
* **MIC <-> PPE:** 25.6 GB/s.
* **MIC <-> EIB:** 25.6 GB/s.
* **PPE <-> EIB:** 25.6 GB/s.
* **EIB <-> SPE 0-7:** Each SPE has a 25.6 GB/s connection to the EIB.
* **EIB <-> IOIF 0:** 25.6 GB/s.
* **EIB <-> IOIF 1:** 25.6 GB/s.
* **IOIF 1 Input (In):** 25.6 GB/s.
* **IOIF 1 Output (Out):** 36.8 GB/s.
* **EIB Bandwidth:** The EIB has a total bandwidth of 204.8 GB/s.
The SPEs are numbered 0 through 7, arranged in two rows of four. SPE 0, 2, 4, and 6 are in the bottom row, while SPE 1, 3, 5, and 7 are in the top row. All SPEs have equal bandwidth connections to the EIB. The IOIFs are depicted with dashed borders, suggesting they are external interfaces.
### Key Observations
* The EIB serves as the central high-bandwidth interconnect for all components.
* The SPEs have a consistent and relatively high bandwidth connection to the EIB.
* The Main Memory connection to the rest of the system is the narrowest bandwidth link at 12.8 GB/s.
* IOIF 1 has a higher output bandwidth (36.8 GB/s) than input bandwidth (25.6 GB/s).
### Interpretation
This diagram illustrates the CBE's heterogeneous architecture, designed for high-performance computing. The PPE acts as the control processor, while the SPEs are specialized processing units optimized for data-parallel tasks. The EIB provides a high-bandwidth pathway for communication between these elements. The relatively low bandwidth of the Main Memory connection suggests that the CBE is designed to minimize reliance on external memory access, favoring on-chip data processing and communication via the EIB. The asymmetric bandwidth of IOIF 1 suggests it is optimized for output-intensive operations. The overall architecture emphasizes parallel processing and efficient data transfer, making it suitable for applications like gaming, scientific simulations, and media processing. The diagram highlights the importance of the EIB as the central nervous system of the CBE, enabling rapid communication between the processing elements.
</details>
1 Available systems use clock frequencies of 2.8 or 3.2 GHz. In our estimates we assume 3.2 GHz.
Figure 2: Data-flow paths and associated execution times Ti . For simplicity, only a single SPE is shown.
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: System Architecture with Data Flow
### Overview
The image depicts a block diagram representing a system architecture with several components and data flow pathways. The components include an Instruction Lookaside Buffer (ILB), Register File (RF), Local Store (LS), Main Memory (MM), Network Interface, and an External Interface Bus (EIB). Arrows indicate the direction of data transfer between these components, each labeled with a corresponding time or latency parameter.
### Components/Axes
The diagram consists of the following components:
* **ILB:** Instruction Lookaside Buffer
* **RF:** Register File
* **LS:** Local Store
* **MM:** Main Memory
* **Network Interface:** Component for network communication
* **EIB:** External Interface Bus
The following time/latency parameters are labeled on the arrows:
* *T<sub>ILB</sub>*: Time associated with the ILB.
* *T<sub>LS</sub>*: Time associated with the Local Store.
* *T<sub>RF</sub>*: Time associated with the Register File.
* *T<sub>FP</sub>*: Time associated with the connection between RF and a small block.
* *T<sub>mem</sub>*: Time associated with Main Memory.
* *T<sub>ext</sub>*: Time associated with the Network Interface.
* *T<sub>link</sub>*: Time associated with the Network Interface's external links.
* *T<sub>EIB</sub>*: Time associated with the External Interface Bus.
### Detailed Analysis or Content Details
The diagram shows the following data flow:
* **ILB to LS:** Data flows from the ILB to the Local Store, labeled *T<sub>ILB</sub>*.
* **LS to MM:** Data flows from the Local Store to the Main Memory, labeled *T<sub>LS</sub>*.
* **RF to LS:** Data flows from the Register File to the Local Store, labeled *T<sub>RF</sub>*.
* **RF to a small block:** Data flows from the Register File to a small block, labeled *T<sub>FP</sub>*.
* **LS to MM:** Data flows from the Local Store to the Main Memory, labeled *T<sub>mem</sub>*.
* **MM to Network Interface:** Data flows from the Main Memory to the Network Interface, labeled *T<sub>ext</sub>*.
* **Network Interface to External:** Data flows from the Network Interface to external connections, labeled *T<sub>link</sub>*.
* **EIB to MM:** Data flows from the External Interface Bus to the Main Memory, labeled *T<sub>EIB</sub>*.
The components are arranged horizontally. The ILB and RF are positioned at the top-left, the LS is below them, the MM and Network Interface are in the center-right, and the EIB spans the bottom.
### Key Observations
The diagram illustrates a hierarchical memory system. The ILB and RF represent fast, small storage close to the processing unit, while the LS and MM represent larger, slower storage. The Network Interface provides connectivity to external networks. The EIB serves as an interface to external devices or systems. The diagram emphasizes the time/latency associated with each data transfer, suggesting performance optimization is a key consideration in the system design.
### Interpretation
This diagram likely represents a simplified model of a processor or a system-on-chip (SoC) architecture. The data flow and latency parameters suggest a focus on minimizing access times to frequently used data (ILB, RF) and efficiently managing data transfer between different levels of the memory hierarchy. The inclusion of a Network Interface indicates that the system is designed to communicate with external networks. The EIB provides a pathway for external interaction. The diagram highlights the importance of optimizing data flow paths to achieve high performance. The relative positioning of the components suggests a flow of data from fast, local storage to slower, more capacious storage and ultimately to external interfaces. The labeling of each connection with a time parameter (*T*) suggests that the design is concerned with minimizing latency and maximizing throughput.
</details>
The current version of the Cell BE has an on-chip memory controller supporting dual-channel access to the Rambus XDR main memory (MM), which will be replaced by DDR2 for the Enhanced Cell BE. The configurable I/O interface supports a coherent as well as a non-coherent protocol on the Rambus FlexIO channels. 2 Internally, all units of the Cell BE are connected to the coherent element interconnect bus (EIB) by DMA controllers.
## 3. Performance model
To theoretically investigate the performance of the Cell BE, we use a refined performance model along the lines of Refs. [5, 6]. Our abstract model of the hardware architecture considers two classes of devices: ( i ) Storage devices : These store data and/or instructions (e.g., registers or LS) and are characterized by their storage size. ( ii ) Processing devices : These act on data (e.g., FP units) or transfer data/instructions from one storage device to another (e.g., DMA controllers, buses, etc.) and are characterized by their bandwidths β i and startup latencies λ i .
An application algorithm, implemented on a specific machine, can be broken down into different computational micro-tasks which are performed by the processing devices of the machine model described above. The execution time Ti of each task i is estimated by a linear ansatz
<!-- formula-not-decoded -->
where I i quantifies the information exchange, i.e., the processed data in bytes.
Assuming that all tasks are running concurrently at maximal throughput and that all dependencies (and latencies) are hidden by suitable scheduling, the total execution time is
$$\frac { T _ { i } ^ { 2 } - m T _ { i } } { p }$$
We denote by T peak the minimal compute time for the FP operations of an application that could be achieved with an ideal implementation (i.e., saturating the peak FP throughput of the machine, assuming also perfect matching between its instruction set architecture and the computation). The floating point efficiency ε FP for a given application is then defined as ε FP = T peak / T exe .
In our analysis, we have estimated the execution times Ti for data processing and transport along all data paths indicated in Fig. 2, in particular:
2 In- and outbound bandwidths will be symmetric on the Enhanced Cell BE, namely 25.6 GB/s each.
- floating-point operations, T FP
- load/store operations between register file (RF) and LS, T RF
- off-chip memory access, T mem
- internal communications between SPEs on the same Cell BE, T int
- external communications between different Cell BEs, T ext
- transfers via the EIB (memory access, internal and external communications), T EIB
Unless stated otherwise, all hardware parameters β i are taken from the Cell BE manuals [4].
## 4. Linear algebra kernels
As a simple application of our performance model and to verify our methodology, we analyzed various linear algebra computations. As an example, we discuss here only a caxpy operation: c · ψ + ψ ′ with complex c and complex spin-color vectors ψ and ψ ′ . If the vectors are stored in main memory (MM), the memory bandwidth dominates the execution time, T exe ≈ T mem, and limits the FP performance of the caxpy kernel to ε FP ≤ 4 . 1%. On the other hand, if the vectors are held in the LS, arithmetic operations and LS access are almost balanced ( T peak / T LS = 2 / 3). In this case, a more precise estimate of T FP also takes into account constraints from the instruction set architecture of the Cell BE for complex arithmetics and yields a theoretical limit of ε FP ≤ 50%.
We have verified the predictions of our theoretical model by benchmarks on several hardware systems (Sony PlayStation 3, IBM QS20 Blade Server and Mercury Cell Accelerator Board). In both cases (data in MM and LS) the theoretical time estimates are well reproduced by the measurements. Careful optimization of arithmetic operations 3 is required only in the case in which all data are kept in the LS (or, in general, if T exe ≈ T FP).
## 5. Lattice QCD kernel
The Wilson-Dirac operator is the kernel most relevant for the performance of lattice QCD codes. We considered the computation of the 4-d hopping term
$$\sum _ { \mu = 1 } ^ { 4 } \{ U _ { x , \mu } ( 1 + y _ { p } ) w _ { x + \mu } + U _ { t - \mu }$$
where x = ( x 1 , x 2 , x 3 , x 4 ) is a 4-tuple of space-time coordinates labeling the lattice sites, ψ ′ x and ψ x are complex spin-color vectors assigned to the lattice site x , and Ux , µ is an SU(3) color matrix assigned to the link from site x in direction ˆ µ .
The computation of Eq. (5.1) on a single lattice site amounts to 1320 floating-point operations. 4 On the Enhanced Cell BE this yields T peak = 330 cycles per site (in DP). However, the implementation of Eq. (5.1) requires at least 840 multiply-add operations and T FP ≥ 420 cycles per lattice site to execute. Thus, any implementation of Eq. (5.1) cannot exceed 78% of the peak performance of the Cell BE.
3 We implemented our benchmarks of arithmetic operations in single precision. However, the theoretical analysis presented here refers to double precision on the Enhanced Cell BE.
4 We do not include sign flips and complex conjugation in the FLOP counting.
The time spent on possible remote communications and on load/store operations for the operands (9 × 12 + 8 × 9 complex numbers) of the hopping term (5.1) strongly depends on the details of the lattice data layout. We assign to each Cell BE a local lattice with V Cell = L 1 × L 2 × L 3 × L 4 sites, and the 8 SPEs are logically arranged as s 1 × s 2 × s 3 × s 4 = 8. Thus, each single SPE holds a subvolume of V SPE =( L 1 / s 1 ) × ( L 2 / s 2 ) × ( L 3 / s 3 ) × ( L 4 / s 4 ) = V Cell / 8 sites. Each SPE on average has A int neighboring sites on other SPEs within and A ext neighboring sites outside a Cell BE.
We consider a communication network with the topology of a 3-d torus. We assume that the 6 inbound and the 6 outbound links can simultaneously transfer data, each at a bandwidth of β link = 1 GB/s, and that a bidirectional bandwidth of β ext = 6 GB/s is available between each Cell BE and the network. This could be realized by attaching an efficient network controller via the FlexIO interface. We have investigated different strategies for the lattice and data layout: Either all data are kept in the on-chip local store of the SPEs, or the data reside in off-chip main memory.
## Data in on-chip memory (LS)
We require that all data for a compute task can be kept in the LS of the SPEs. Since loading of all data into the LS at startup is time-consuming, the compute task should comprise a sizable fraction of the application code. In QCD this can be achieved, e.g., by implementing an entire iterative solver with repeated computation of Eq. (5.1). Apart from data, the LS must also hold a minimal program kernel, the run-time environment, and intermediate results. Therefore, the storage requirements strongly constrain the local lattice volumes V SPE and V Cell .
The storage requirement of a spinor field ψ x is 24 real words (192 Byte in double precision) per site, while a gauge field Ux , µ needs 18 words (144 Byte) per link. Assuming that for a solver we need storage corresponding to 8 spinors and 3 × 4 links per site, the subvolume carried by a single SPE cannot be larger than about V SPE = 79 lattice sites. Moreover, one lattice dimension, say the 4-direction, must be distributed locally within the same Cell BE across the SPEs (logically arranged as an 1 3 × 8 grid). Then, L 4 corresponds to a global lattice extension and, as a pessimistic assumption, may be as large as L 4 = 64. This yields a very asymmetric local lattice 5 with V Cell = 2 3 × 64 and V SPE = 2 3 × 8.
## Data in off-chip main memory (MM)
When all data are stored in MM, there are no a-priori restrictions on V Cell . On the other hand, we need to minimize redundant memory accesses to reload the operands of Eq. (5.1) into the LS when sweeping through the lattice. To also allow for concurrent FP computation and data transfers (to/from MM or remote SPEs), we consider a multiple buffering scheme. 6 A possible implementation of such a scheme is to compute the hopping term (5.1) on a 3-d slice of the local lattice and then move the slice along the 4-direction. Each SPE stores all sites along the 4-direction, and the SPEs are logically arranged as a 2 3 × 1 grid to minimize internal and to balance external communications between SPEs. If the U - and ψ -fields associated with all sites of three 3-d slices can be kept in the LS at the same time, all operands in Eq. (5.1) are available in the LS. This optimization requirement again constrains the local lattice size, now to V Cell ≈ 800 × L 4 sites.
5 When distributed over 4096 Cell BEs, this corresponds to a global lattice size of 32 3 × 64.
6 In multiple buffering schemes several buffers are used in an alternating fashion to either process or load/store data. This requires additional storage (here in the LS) but allows for concurrent computation and data transfer.
Table 1: Comparison of the theoretical time estimates Ti (in 1000 SPE cycles) for some micro-tasks arising in the computation of Eq. (5.1) for different lattice data layouts: keeping data either in the on-chip LS (left part) or in the off-chip MM (right part). The first rows indicate the corresponding number of neighbor sites A int and A ext. Estimated efficiencies, ε FP = T peak / max i Ti , are shown in the last row.
| data in on-chip LS | data in on-chip LS | data in off-chipMM | data in off-chipMM | data in off-chipMM | data in off-chipMM |
|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
| V Cell | 2 × 2 × 2 × 64 | L 1 × L 2 × L 3 | 8 × 8 × 8 | 4 × 4 × 4 | 2 × 2 × 2 |
| A int | 16 | A int / L 4 | 48 | 12 | 3 |
| A ext | 192 | A ext / L 4 | 48 | 12 | 3 |
| T peak | 21 | T peak / L 4 | 21 | 2.6 | 0.33 |
| T FP | 27 | T FP / L 4 | 27 | 3.4 | 0.42 |
| T RF | 12 | T RF / L 4 | 12 | 1.5 | 0.19 |
| T mem | - | T mem / L 4 | 61 | 7.7 | 0.96 |
| T int | 2 | T int / L 4 | 5 | 1.2 | 0.29 |
| T ext | 79 | T ext / L 4 | 20 | 4.9 | 1.23 |
| T EIB | 20 | T EIB / L 4 | 40 | 6.1 | 1.06 |
| ε FP | 27% | ε FP | 34% | 34% | 27% |
The predicted execution times for some of the micro-tasks considered in our model are given in Table 1 for both data layouts and for reasonable choices of the local lattice size. If all data are kept in the LS, the theoretical efficiency of about 27% is limited by the communication bandwidth ( T exe ≈ T ext ). This is also the limiting factor for the smallest local lattice with data kept in MM, while for larger local lattices the memory bandwidth becomes the limiting factor ( T exe ≈ T mem).
Wehave performed hardware benchmarks with the same memory access pattern as (5.1), using the above multiple buffering scheme for data from MM. We found that the execution times were at most 20% higher than the theoretical predictions for T mem.
## 6. Performance model and benchmarks for DMA transfers
DMA transfers determine T mem, T int, and T ext, and their optimization is crucial to exploit the Cell BE performance. Our analysis of detailed micro-benchmarks, e.g., for LS-to-LS transfers, shows that the linear model Eq. (3.1) does not accurately describe the execution time of DMA operations with arbitrary size I and address alignment. We refined our model to take into account the fragmentation of data transfers, as well as source and destination addresses, As and Ad , of the buffers:
$$\frac { T _ { D M A } ( I , A s , A d ) = \lambda ^ { 0 } } { p } + \lambda ^ { a } \cdot N _ { a } ( I , A s , A d ) + N _ { b } ( I , A s ) \cdot \frac { 1 2 8 t + \lambda ^ { a } \cdot N _ { a } ( I , A s , A d ) + N _ { b } ( I , A s ) } { p }$$
Each LS-to-LS DMA transfer has a latency of λ 0 ≈ 200 cycles (from startup and wait for completion). The DMA controllers fragment all transfers into Nb 128-byte blocks aligned at LS lines (and corresponding to single EIB transactions). When δ A = As -Ad is a multiple of 128, the source LS lines can be directly mapped onto the destination LS lines. Then, we have Na = 0, and the effective bandwidth β eff = I / ( T DMA -λ 0 ) is approximately the peak value. Otherwise, if the alignments do not match ( δ A not a multiple of 128), an additional latency of λ a ≈ 16 cycles is introduced for each
Figure 3: Execution time of LS-to-LS copy operations as a function of the transfer size. In the left panel source and destination addresses are aligned, while in the right panel they are misaligned. Filled diamonds show the measured values on an IBM QS20 system. Dashed and full lines correspond to the theoretical prediction from Eq. (3.1) and Eq. (6.1), respectively.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Chart: Performance Benchmarks - Cycles vs. Bytes
### Overview
The image presents two charts comparing performance benchmarks (QS20) with linear and refined models. Both charts plot 'T [cycles]' (time in cycles) against 'I [bytes]' (input size in bytes). Each chart represents a different set of parameters (A1 and A0) used in the models.
### Components/Axes
* **X-axis:** 'I [bytes]' ranging from 0 to 2048, with markers at 0, 512, 1024, 1536, and 2048.
* **Y-axis:** 'T [cycles]' ranging from 200 to 800, with markers at 200, 300, 400, 500, 600, 700, and 800.
* **Legend (Top-Left of each chart):**
* 'linear model (3.1)' - represented by a dashed blue line.
* 'refined model (6.1)' - represented by a solid red line.
* 'QS20 benchmarks' - represented by black diamond markers.
* **Text Labels (Bottom of each chart):** Each chart has a label specifying the values of A1 and A0 (mod 128).
* Left Chart: "A₁ = A₀ = 0 (mod 128)"
* Right Chart: "A₁ = 32, A₀ = 16 (mod 128)"
### Detailed Analysis or Content Details
**Left Chart (A₁ = A₀ = 0 (mod 128))**
* **QS20 Benchmarks (Black Diamonds):** The data points start at approximately (0, 210) and increase non-linearly to approximately (2048, 730). The points exhibit a slight curvature.
* (0, 210)
* (512, 300)
* (1024, 410)
* (1536, 530)
* (2048, 730)
* **Linear Model (3.1) (Dashed Blue Line):** The line starts at approximately (0, 200) and increases linearly to approximately (2048, 650). It appears to underestimate the QS20 benchmarks, especially at higher byte values.
* **Refined Model (6.1) (Solid Red Line):** The line starts at approximately (0, 210) and increases non-linearly, closely following the QS20 benchmarks. It appears to provide a better fit than the linear model.
**Right Chart (A₁ = 32, A₀ = 16 (mod 128))**
* **QS20 Benchmarks (Black Diamonds):** The data points start at approximately (0, 210) and increase non-linearly to approximately (2048, 750). The points exhibit a slight curvature.
* (0, 210)
* (512, 310)
* (1024, 420)
* (1536, 540)
* (2048, 750)
* **Linear Model (3.1) (Dashed Blue Line):** The line starts at approximately (0, 200) and increases linearly to approximately (2048, 650). It appears to underestimate the QS20 benchmarks, especially at higher byte values.
* **Refined Model (6.1) (Solid Red Line):** The line starts at approximately (0, 210) and increases non-linearly, closely following the QS20 benchmarks. It appears to provide a better fit than the linear model.
### Key Observations
* In both charts, the 'refined model (6.1)' consistently provides a closer approximation of the 'QS20 benchmarks' than the 'linear model (3.1)'.
* The 'QS20 benchmarks' exhibit a non-linear relationship between input size ('I [bytes]') and time ('T [cycles]').
* The values of A1 and A0 (mod 128) influence the performance, as demonstrated by the different curves in the two charts.
* The difference between the refined model and the benchmarks is relatively small, suggesting the refined model is a good approximation.
### Interpretation
The charts demonstrate the performance of a system (QS20 benchmarks) as a function of input size. The linear model provides a simplified, but less accurate, representation of this performance. The refined model, likely incorporating more complex factors, offers a significantly improved fit to the observed data. The parameters A1 and A0 (mod 128) appear to play a role in the system's behavior, as changing their values results in different performance curves.
The consistent underestimation of the linear model suggests that the relationship between input size and execution time is not strictly linear. The refined model captures this non-linearity, providing a more realistic representation of the system's performance. The small difference between the refined model and the benchmarks indicates that the model is a good predictor of performance, but there may be additional factors not accounted for in the model. The use of modular arithmetic (mod 128) for A1 and A0 suggests these parameters might relate to memory addressing or indexing within the system.
</details>
transferred 128-byte block, reducing β eff by about a factor of two. Fig. 3 illustrates how clearly these effects are observed in our benchmarks and how accurately they are described by Eq. (6.1).
## 7. Conclusion and outlook
Our performance model and hardware benchmarks indicate that the Enhanced Cell BE is a promising option for lattice QCD. We expect that a sustained performance above 20% can be obtained on large machines. A refined theoretical analysis, e.g., taking into account latencies, and benchmarks with complete application codes are desirable to confirm our estimate. Strategies to optimize codes and data layout can be studied rather easily, but require some effort to implement.
Since currently there is no suitable southbridge for the Cell BE to enable scalable parallel computing, we plan to develop a network coprocessor that allows us to connect Cell BE nodes in a 3-d torus with nearest-neighbor links. This network coprocessor should provide a bidirectional bandwidth of 1 GB/s per link for a total bidirectional network bandwidth of 6 GB/s and perform remote LS-to-LS copy operations with a latency of order 1 µ s. Pending funding approval, this development will be pursued in collaboration with the IBM Development Lab in Böblingen, Germany.
## References
- [1] S. Williams et al., The Potential of the Cell Processor for Scientific Computing , Proceedings of the 3rd conference on Computing frontiers (2006) 9, DOI 10.1145/1128022.1128027
- [2] A. Nakamura, Development of QCD-code on a Cell machine , PoS(LAT2007)040
- [3] H.P. Hofstee et al., Cell Broadband Engine technology and systems , IBM J. Res. & Dev. 51 (2007) 501
- [4] http://www.ibm.com/developerworks/power/cell
- [5] G. Bilardi et al., The Potential of On-Chip Multiprocessing for QCD Machines , Springer Lecture Notes in Computer Science 3769 (2005) 386
- [6] N. Meyer, A. Nobile and H. Simma, Performance Estimates on Cell , internal reports and talk at Cell Cluster Meeting, Jülich 2007, http://www.fz-juelich.de/zam/datapool/cell/Lattice\_QCD\_on\_Cell.pdf