## FeCaffe : FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10
Ke He, Bo Liu^, Yu Zhang, Andrew Ling* and Dian Gu
IoTG Vision Market Channel PRC, ^Flex Services and *Programmable Solution Group of Intel Corporation {harvey.he | bo.a.liu | richard.yu.zhang | andrew.ling | penny.gu}@intel.com
## ABSTRACT
Deep learning and Convolutional Neural Network (CNN) have becoming increasingly more popular and important in both academic and industrial areas in recent years cause they are able to provide better accuracy and result in classification, detection and recognition areas, compared to traditional approaches. Currently, there are many popular frameworks in the market for deep learning development, such as Caffe, TensorFlow, Pytorch, and most of frameworks natively support CPU and consider GPU as the mainline accelerator by default. FPGA device, viewed as a potential heterogeneous platform, still cannot provide a comprehensive support for CNN development in popular frameworks, in particular to the training phase. In this paper, we firstly propose the FeCaffe , i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based on the Caffe to enable FPGA to support mainline deep learning development features, e.g. training and inference with Caffe. Furthermore, we provide some benchmarks with FeCaffe by taking some classical CNN networks as examples, and further analysis of kernel execution time in details accordingly. Finally, some optimization directions including FPGA kernel design, system pipeline, network architecture, user case application and heterogeneous platform levels, have been proposed gradually to improve FeCaffe performance and efficiency. The result demonstrates the proposed FeCaffe is capable of supporting almost full features during CNN network training and inference respectively with high degree of design flexibility, expansibility and reusability for deep learning development. Compared to prior studies, our architecture can support more network and training settings, and current configuration can achieve 6.4x and 8.4x average execution time improvement for forward and backward respectively for LeNet.
## KEYWORDS
FeCaffe, FPGA, Deep Learning, CNN, Training, Inference, Caffe, OpenCL, Heterogeneous Platform, Stratix 10
## 1 Introduction
Deep learning has becoming increasingly more popular and drawn huge attention in both academic and industrial areas in recent years. The Convolutional Neural Network (CNN), as the subset of deep learning, has already demonstrated the capability for higher accuracy in classification, detection and recognition areas, compared to traditional computer vision methods, and thus it has been widely applied to commercial markets, e.g. digital secure surveillance, retail, industrial areas etc.
With rapid development of deep learning and CNN technology, the framework also has gained sufficient attention and investments to improve and develop as well. The development of CNN network is a sophisticated and systematic process, and it usually contains dataset preparation, pre/post processing, training, validation, and acceleration with heterogeneous platforms etc. All of these actions are required by using the framework so that deep learning algorithm developers can focus on algorithm development only with ease.
Currently, most of popular deep learning frameworks natively support Central Processing Unit (CPU) and consider the Graphic Processing Unit (GPU) as the accelerator by default. Field Programmable Gate Array (FPGA) viewed as another potential device by nature for heterogeneous platform acceleration, the development approach is still comparatively sophisticated, and thus cannot be comprehensively supported by popular frameworks for CNN development, especially in terms of training.
In this paper, in order to improve such a situation to some extent, we propose FeCaffe , i.e. FPGA-enabled Caffe framework with OpenCL and provide some contributions as follows:
- Seamlessly integrate FPGA into Caffe framework to perform CNN network training. To our best knowledge, it is the first time to enable FPGA to provide training features for popular networks and support entire training process and various training settings with Caffe.
- Introduce hierarchical software and hardware architectures in details, and the proposed approach has potential to expand to other OpenCL-backend frameworks due to OpenCL portability and generality.
- The proposed FeCaffe has high degree of design flexibility in terms of novel function development and integration, achieving same fine-grained level with GPU. Following the proposed approach, users can develop new kernels and integrate them into the FeCaffe conveniently. Moreover, various FPGA processing architectures can also be integrated to improve computation efficiency as required.
- Compared to prior work, the proposed approach is able to support more CNN network topologies, training-related settings, and provide better expansibility and ease of use[8] [9] . With regard to the performance, current configuration can achieve 6.4x and 8.4x average improvement for forward and backward respectively for LeNet under same testing conditions [8] .
- First time to support SqueezeNet and GoogLeNet training process with default or customized training settings on FPGA, in supporting multiple loss function definitions. In addition, we also firstly provide the benchmark of training an epoch based on ImageNet 2012 training and validation dataset for SqueezeNet and GoogLeNet respectively.
The rest of this paper is organized as follows. In Section 2, Caffe framework, FPGA OpenCL development, and deep learning with FPGA are introduced respectively. Section 3 describes the design methodology, including hierarchical software and hardware architecture, memory synchronization mechanism. Section 4 presents the result and Section 5 provides the analysis and optimization directions accordingly. Eventually, this paper is concluded in Section 6.
## 2 Related Work
## 2.1 Caffe Framework
Among deep learning frameworks in the market, Caffe, standing for Convolutional Architecture for Fast Feature Embedding, has been viewed as one of the most popular and important deep learning frameworks [1] . Original Caffe natively supports operations on CPU with a number of libraries, e.g. Basic Linear Algebra Subprograms (BLAS) and Math Kernel Library (MKL), and also NVidia GPU as the default accelerator with Compute Unified Device Architecture (CUDA) programming or CUDA Deep Neural Network (CuDNN) library. Some classical and well-known CNN networks, e.g. AlexNet, VGG, GoogLeNet, SqueezeNet, etc., were developed and further widely applied in many applications and scenarios by using Caffe[4] [5] [6] [7] .
## 2.2 OpenCL and FPGA Development
Register Transfer Level (RTL) coding, e.g. Verilog and VHSIC Hardware Description Language (VHDL), has been considered as the conventional FPGA development languages for a long history. It is a hardware-oriented and efficient approach, but requires massive engineering development efforts and comprehensive underlying details of FPGA circuit and design flow skills, e.g. synthesis, placement and routing to achieve a good result in terms of performance and timing. In addition, conventional FPGA development flow does not have a friendly simulation environment, especially for algorithm development. With the increment of size and complexity, in particular to the deep learning and CNN applications, those disadvantages of RTL designs are becoming increasingly more obvious. In order to address these pain points, FPGA vendors provide high-level language design methodology and tools for FPGA development, such as High-level Language Synthesis (HLS) and OpenCL [22] [22] [24] [25] .
OpenCL is public standard with data and task parallel programming models, initially proposed by Khronos Group , especially for parallel acceleration on heterogonous platforms, e.g. GPUs, CPUs and FPGAs. The OpenCL design flow has two design stages: kernel and host development. Host part development is mainly used for device initialization, setup, managing memory allocation, and coordinating kernel behaviors. In this work, we refer to the host code as runtime functions and further divide the runtime into two groups: kernel-related and common runtime. The purpose of common runtime is to create context, command queue, program, and memory allocation while the kernel-related runtime focuses on kernel argument configuration, debug, profiling, launch and release. Kernel development means to develop offload functions and tune performance on various devices, and has two approaches: NDRange and single work-item . NDRange is the default execution model for OpenCL kernel development, and employs a number of build-in functions to complete mapping of algorithm to massive work-items execution concurrently. Single work-item is another design philosophy that is hardware-oriented methodology, achieving maximum throughput and optimization by using more flexible optimization directives, FPGA native components and deeper processing pipeline. Optimizing the system and performance with NDRange design approach is hardware-agnostic, by tuning parameters of group size or compute units, and thus it is general and universal for various devices because the compilers can manage resource and adjustment for each device automatically. On the contrary, single work-item optimization heavily relies on complier tools and specific hardware architectures provided by various device vendors, and users' skillset as well.
## 2.3 CNN Inference with FPGA
Due to rapid growth in wide application areas, there have been a number of research studies based on FPGA for deep learning and CNN applications[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] . Authors in these papers have demonstrated FPGAs are able to achieve impressive benchmarks for some popular CNN networks on Intel and Xilinx devices with HLS, OpenCL and RTL design methodologies respectively. In general, they concentrate on the network inference efficiency, and thus defined their own processing architectures and pipelines with FPGA dedicated DSP blocks, distributed and BRAMs, to realize key CNN processing operations, e.g. convolution, pooling, in parallel with tiles simultaneously. Besides, some optimization technologies, for example, fixed-point quantization, low precision, and data transformation, e.g. Winograd and Fast Fourier Transform (FFT) have also been considered as the preconditions to realize such significantly competitive benchmarks on FPGAs, compared to GPUs and CPUs[14] [19] [20] [21] . Those low precision data type, e.g. Int8, is able to increase DSP efficiency, decrease the weight, intermediate and final result storage based on FPGA limited onchip memory, and DDR bandwidth. The core of processing architectures is to utilize a number of cascaded DSP blocks to perform convolution or matrix multiplication in parallel in several dimensions. In general, the data movement path and processing architecture mechanisms are well-optimized and fixed down to achieve impressive results in terms of throughput, performance per power and energy efficiency, compared to CPUs and GPUs respectively. In addition, there are a few of studies on FPGA CNN inference with Caffe framework. Authors in [10] and [16] presented both hardware architectures and software approaches for Caffe invocation and provided benchmarks on FPGA inference with Caffe for AlexNet, VGG, GoogLeNet and YOLO-v2 respectively.
## 2.4 CNN Training with FPGA
Compared to the inference with FPGA, the research of FPGA training is relative limited, and there are only two studies providing the implementation details and benchmarks. Authors in [8] proposed the pipeline structure of convolution and pooling layers and benchmarked the training process with two FPGA boards on LeNet. A much more aggressive approach based on FPGA clusters for CNN training has been proposed in [9] , and 15 FPGAs, scaling up to 83 FPGAs at most, are used for training AlexNet, VGG-16 and VGG-19 respectively. They both employ multiple FPGA boards for CNN training, and create dedicated processing pipeline with fixed weight update mechanism. In addition, they both utilize customized runtime with low-level network configuration parameters and hardware constraints as the software control during the training, resulting in further limitation on CNN training usage.
## 2.5 Motivation
Most of previous studies only focus on CNN inference with FPGA while the training process of deep learning on FPGA has gained little attention. Among the FPGA-enable inference designs, the trend is to design the most efficient and dedicated processing architectures with fine-tuned and well-designed data buffer and reuse mechanism, e.g. data sharing, weight sharing or even hybrid, for one or some types of classical network topologies. This kind of design philosophy leads to the maximum FPGA throughput and efficiency for inference, but have to suffer from flexibility and adaptation problems in some practical CNN scenarios. For the inference structure, it is usually difficult and time consuming to insert new developed functions or primitives into the welloptimized pipelines, resulting in slow time to market and many development efforts for FPGA-based CNN solutions. CNN training with FPGA is more challenging than inference in terms of not only hardware designs and utilizations, but also software development. Due to high development barrier and a large amount of engineering efforts, only a few of studies are able to provide FPGA approaches with Caffe for inference and thus they are not complete approaches because training parts are excluded. There are still some gaps providing more functions and flexibility with FPGA for deep learning development, compared to GPUs. In summary, FPGAbased architectures have obvious limitations and gaps in terms of flexibility, customization and convenience for CNN training and inference development
Considering all of factors discussed above, we propose the FeCaffe in this paper, and make the contributions as discussed previously. This study is a more comprehensive approach and constitutes an extension to conventional CNN development, which often considers GPUs and CPUs, and also creates more feasibilities and choices for deep learning development based on FPGA-related heterogeneous platforms.
## 3 Design Methodology
## 3.1 Caffe Architecture
Conventional Caffe framework structure is illustrated in Figure 1, we divide the whole hierarchy into six layers: from network application level to hardware device layer. Note that we only describe some hierarchical functions that are related to hardware devices and CNN operation layers because Caffe framework also has a larger number of components on debug and logging, database I/O processing, and protobuf parsing, etc. Those components can often be reused with almost no change for various Caffe variants.
Figure 1 Hierarchical Architecture of Caffe Framework
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Caffe Architecture
### Overview
The image is a block diagram illustrating the architecture of the Caffe deep learning framework. It shows the flow of data and operations through different layers, from network models to hardware devices. The diagram is structured in a hierarchical manner, with distinct layers representing different components of the system.
### Components/Axes
The diagram is organized into the following layers, from top to bottom:
1. **Network:** Contains network models.
2. **Frontend Interface:** Provides interfaces for interacting with the Caffe library.
3. **Caffe Library:** Core components of the Caffe framework.
4. **Backend Interface:** Connects the Caffe library to different backend implementations.
5. **Kernel and Library:** Provides optimized kernels and libraries for computation.
6. **Device:** Represents the hardware devices used for computation.
The diagram uses colored blocks to represent different components:
* Purple: Network models
* Green: Software components
* Orange: Hardware components
* Yellow: Caffe library components
* Blue dashed lines: Indicate the flow of data and dependencies between components.
### Detailed Analysis
**Network Layer:**
* AlexNet (Purple block, top-left)
* VGG (Purple block, top-center)
* GoogLeNet (Purple block, top-right)
* Ellipsis (...) indicates that there are more network models.
**Frontend Interface Layer:**
* C++ (Green block, left)
* PyCaffe (Green block, right)
**Caffe Library Layer:**
* Layer/Solver (Yellow block, left)
* Device (Yellow block, center)
* SyncedMem (Yellow block, right)
* Ellipsis (...) indicates that there are more components.
**Backend Interface Layer:**
* Greentea (OpenCL) (Orange block, left)
* AMD (OpenCL) (Orange block, center-left)
* Caffe (Native) (Green block, center)
* Caffe (CUDA) (Green block, center-right)
* FPGA (OpenCL) (Blue block, right)
**Kernel and Library Layer:**
* ViennaCL BLAS (Orange block, left)
* CLBLAS (Orange block, center-left)
* MKL (Green block, center)
* cuBLAS (Green block, center-right)
* cuDNN (Green block, right-center)
* Customized Kernels (Blue block, right)
**Device Layer:**
* Intel GPU (Orange block, left)
* AMD GPU (Orange block, center-left)
* CPU (Green block, center)
* NV GPU (Green block, center-right)
* FPGA (Blue block, right)
### Key Observations
* The diagram shows a clear hierarchical structure, with data flowing from high-level network models down to low-level hardware devices.
* The Caffe framework supports multiple frontends (C++ and PyCaffe), backends (OpenCL, Native, CUDA, FPGA), and hardware devices (Intel GPU, AMD GPU, CPU, NV GPU, FPGA).
* The diagram highlights the modularity of the Caffe framework, with different components that can be combined and configured to support a wide range of deep learning applications.
### Interpretation
The diagram illustrates the architecture of the Caffe deep learning framework, emphasizing its modularity and support for diverse hardware and software configurations. The flow of data and operations is clearly depicted, showing how network models are processed through different layers of the framework and ultimately executed on hardware devices. The diagram demonstrates the flexibility of Caffe, allowing users to choose the most appropriate backend and hardware for their specific needs. The presence of multiple options at each layer (e.g., different network models, frontends, backends, and hardware devices) indicates that Caffe is designed to be adaptable and customizable.
</details>
(Green: native Caffe provided; Red: GPU-based OpenCL variants provided; Orange: Hardware-related functions; Blue: Our work)
Training and inference for various networks can be performed by calling for C++ and Python interface with native Caffe support. Either C++ or Python can call for the Caffe libraries consisting of a large number of defined classes , e.g. layer , device , syncedmem etc. For the CPU path, it can start from either C++ or Python interface and invoke some math functions or perform some operations directly defined by the layer functions with C++. Those math functions further calls for MKL or BLAS libraries, and finally maps to CPU device. Similarly, GPU approach goes CUDA interface and invokes the math functions optimized by cuBLAS or cuDNN, and some CUDA functions for layer operations. Some hardware-related classes, e.g. device and syncedmem, are mainly used for GPU device and memory management.
Due to open source of Caffe framework and community contributions, a number of OpenCL-based variants, stemmed from native Caffe, have been proposed and maintained, and thus they are capable of supporting more heterogeneous platforms, e.g. AMD
and Intel integrated GPUs, among which, two well-known representives are analyzed in this paper [2] [3] . Author in [2] proposed an OpenCL-based interface mechanism named greentea, and, provided the CNN acceleration library with OpenCL by leveraging official CLBLAS and Vienna CLBLAS libraries. With good compatibility of those libraries, the greentea is able to support CNN activities within Caffe on Intel integrated GPUs and AMD GPUs, i.e. the Greentea path in Figure 1. Similarly, another branch maintained by AMD proposed their backend interface hierarchy and kernel designs to support CNN operations deployed and optimized on AMD GPU devices, i.e. AMD path in Figure 1. It is important to note that some hardware-sensitive classes or functions, e.g. operation layers, device, syncedmem, highlighted by orange in Figure 1, also require significant modifications to support new devices even following the same OpenCL development flow.
## 3.2 Kernel-related Layers
Following the similar structure, we proposed a novel hierarchical backend interface based on OpenCL flow to support CNN operations on FPGA, i.e. labeled with blue in Figure 1. More details of our path for the kernel development and backend interface can be referred in Figure 2. Here we divide three layers from FPGA kernel development to deep learning operations within Caffe framework. The L1 , i.e. kernel layer , includes all of kernel design files to support necessary operations and all of kernel files are compiled by Intel FPGA SDK for OpenCL and generate a FPGA configuration file. In order to support deep learning training and inference features, we group all of kernels required into three types: layer-related , BLAS-related and solver-related . The layerrelated kernels define the functions to support some layer classes directly, e.g. pooling, activation functions, including both forward and backward operations. BLAS-related group contains some general and common functions from BLAS library, e.g. General Matrix Multiplication (GEMM), General Matrix-Vector Multiplication (GEMV), etc. Solver is employed to update the weights according to various approaches or policies during the training iterations, and thus plays a significant role during network training process. Some common weight update approaches, e.g. Stochastic Gradient Descent (SGD), Adam , AdaDelta , Nesterov , etc., are supported on FPGA. In this study, note that all of kernel files mentioned utilize NDRange design style from the opensourced OpenCL Caffe versions and CLBLAS library for simplicity and saving validation energy.
For the upper layer on top of the kernel, this layer is referred to as the L2 wrapper layer , constituting a number of runtime functions corresponding each kernel from various groups in kernel layer. The runtime functions in this layer are all kernel-related runtimes as described previously, containing the kernel creation, argument setting, kernel launch, release and debug information. The main purpose of wrapper layer is further to encapsulate the OpenCL kernel designs for invocation with ease by high-level functions. With regard to the L3 , class layer , important classes have been defined by native Caffe, but need to extend based on the underlying L1 and L2 layers accordingly. Some straightforward layers, e.g. pooling, can call for the corresponding runtime directly
Figure 2 Hierarchical FeCaffe Structure in Details
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Layered System Diagram: Abstraction Layers in a System
### Overview
The image is a diagram illustrating a layered system architecture, likely related to software or hardware design. It depicts three distinct layers (L1, L2, L3) representing different levels of abstraction. Each layer contains components grouped under categories like "Layer," "Blob," "Solver," "Device," and "BLAS." The diagram shows the relationships and dependencies between these components across the layers.
### Components/Axes
* **Layers:** The diagram is structured into three horizontal layers, labeled L1, L2, and L3.
* L1: Kernel Layer
* L2: Wrapper Layer
* L3: Class Layer
* **Categories:** Within each layer, components are grouped under the following categories:
* Layer
* Blob
* Solver
* Device
* BLAS
* AOCX (appears only in L1)
* **Components:** Each category contains specific components, represented as rectangular boxes. The components vary across the layers.
* **Connections:** Arrows indicate the flow or dependency between components in different layers.
### Detailed Analysis or ### Content Details
**L3: Class Layer (Top Layer)**
* **Layer:**
* Forward_fpga (yellow fill)
* Backward_fpga (yellow fill)
* **Blob:**
* Update (yellow fill)
* **Solver:**
* ApplyUpdate (yellow fill)
* **Device:**
* Device_Init (yellow fill)
* Common Runtime (blue fill)
* **SyncedMem:**
* to_cpu (yellow fill)
* to_fpga (yellow fill)
* Common Runtime (blue fill)
**L2: Wrapper Layer (Middle Layer)**
* **Layer:**
* Pool_Forward_Runtime (blue fill)
* Pool_Backward_Runtime (blue fill)
* Ellipsis (...) indicating more components
* **BLAS:**
* GEMM_Runtime (blue fill)
* GEMV_Runtime (blue fill)
* Ellipsis (...) indicating more components
* **Solver:**
* SGD_Runtime (blue fill)
* ADAM_Runtime (blue fill)
**L1: Kernel Layer (Bottom Layer)**
* **Layer:**
* Pool_Forward (blue fill)
* Pool_Backward (blue fill)
* Ellipsis (...) indicating more components
* **BLAS:**
* GEMM (blue fill)
* GEMV (blue fill)
* **Solver:**
* SGD (blue fill)
* ADAM (blue fill)
* **AOCX**
### Key Observations
* The diagram illustrates a hierarchical structure, with each layer providing a different level of abstraction.
* The "Layer," "Blob," and "Solver" categories are present in all three layers, but the specific components within them change.
* The arrows indicate dependencies between components in different layers, suggesting a flow of data or control.
* The use of "Runtime" in the L2 components suggests that this layer provides runtime implementations of the components in L1.
* The yellow fill in L3 may indicate configuration or initialization components.
### Interpretation
The diagram represents a layered system architecture, where each layer abstracts away the complexities of the underlying layers. The Kernel Layer (L1) likely contains the core functionalities, while the Wrapper Layer (L2) provides runtime implementations or interfaces for these functionalities. The Class Layer (L3) provides a higher-level interface for interacting with the system.
The connections between the layers suggest a flow of data or control from the Class Layer down to the Kernel Layer and potentially back up. The "Runtime" components in the Wrapper Layer likely handle the execution and management of the Kernel Layer components.
The use of different categories (Layer, Blob, Solver, Device, BLAS) suggests that the system is composed of different types of components, each with its own specific role. The presence of "fpga" in some of the L3 components suggests that the system may involve hardware acceleration using FPGAs.
The diagram provides a high-level overview of the system architecture and can be used to understand the relationships and dependencies between different components.
</details>
to realize acceleration on FPGA. Some functions and layers might need a combination of BLAS library and kernels, and thus leverage various runtime configurations accordingly. Following the proposed architecture and partitions, the proposed design methodology has the potential to be applied in other deep learning frameworks, e.g. TensorFlow or Pytorch. L1 and L2 structures can remain the same due to OpenCL common standard while only L3 is required to update according to high-level functions defined by various frameworks.
## 3.3 Memory Synchronization and Fallback
Memory management is a great feature of Caffe framework, and is capable of allocating memory on-demand for efficient usage at both host and device side, and performing synchronization as required. Following this design idea, FeCaffe makes an extension to the scenario of memory management on FPGA. The memory status topography is shown in the top part of Figure 3.
Figure 3 Top: Memory Status Topography; Bottom: Workload Partition Configurations
<details>
<summary>Image 3 Details</summary>

### Visual Description
## State Diagram and Memory Allocation
### Overview
The image presents a state diagram illustrating data handling between GPU, CPU, and FPGA, coupled with a memory allocation scheme for each. The state diagram shows transitions between states like UNINITIALIZED, HEAD_AT_GPU, HEAD_AT_CPU, HEAD_AT_FPGA, and SYNCED, driven by data transfers. The memory allocation section depicts how functions are allocated in GPU, CPU, and FPGA memory, with synchronization mechanisms between them.
### Components/Axes
**State Diagram Components:**
* **States:** UNINITIALIZED, HEAD_AT_GPU, HEAD_AT_CPU, HEAD_AT_FPGA, SYNCED (represented as green or blue ovals).
* **Transitions:** Arrows indicating data flow between states, labeled with data types (e.g., "(mutable) gpu\_data", "cpu\_data").
**Memory Allocation Components:**
* **Memory Types:** GPU Memory Allocation, CPU Memory Allocation, FPGA Memory Allocation.
* **Functions:** Func 1, Func 2, Func 3 (represented as colored blocks).
* **Synchronization:** Arrows indicating "Sync up Memory Copy" between CPU and GPU, and CPU and FPGA.
### Detailed Analysis
**State Diagram:**
* **UNINITIALIZED:** The initial state. Transitions to HEAD_AT_GPU, HEAD_AT_CPU, and HEAD_AT_FPGA via "(mutable) gpu\_data", "(mutable) cpu\_data", and "(mutable) fpga\_data" respectively.
* **HEAD_AT_GPU:** Transitions to HEAD_AT_CPU via "mutable\_cpu\_data", to SYNCED via "cpu\_data", and back to itself.
* **HEAD_AT_CPU:** Transitions to HEAD_AT_GPU via "mutable\_gpu\_data", to SYNCED via "fpga\_data", and to HEAD_AT_FPGA via "mutable\_fpga\_data".
* **HEAD_AT_FPGA:** Transitions to SYNCED via "cpu\_data".
* **SYNCED:** Transitions to HEAD_AT_GPU via "gpu\_data", to HEAD_AT_CPU via "cpu\_data", and to HEAD_AT_FPGA via "fpga\_data".
**Memory Allocation:**
* **GPU Memory Allocation:**
* Func 1: Solid green block.
* Func 2: Green block with diagonal lines.
* Func 3: Green block with diagonal lines.
* **CPU Memory Allocation:**
* Func 1: Blue block with diagonal lines.
* Func 2: Solid blue block.
* Func 3: Blue block with diagonal lines.
* **FPGA Memory Allocation:**
* Func 1: Yellow block with diagonal lines.
* Func 2: Yellow block with diagonal lines.
* Func 3: Solid yellow block.
* **Synchronization:**
* "Sync up Memory Copy" occurs between Func 2 of GPU and CPU.
* "Sync up Memory Copy" occurs between Func 3 of CPU and FPGA.
### Key Observations
* The state diagram illustrates a data management workflow between different processing units.
* The memory allocation section shows a function distribution across GPU, CPU, and FPGA memory spaces.
* Synchronization is explicitly shown between GPU and CPU, and CPU and FPGA, suggesting data consistency requirements.
* HEAD_AT_FPGA is colored blue, while the other states are colored green.
### Interpretation
The diagram describes a heterogeneous computing system where data is processed across GPU, CPU, and FPGA. The state diagram defines the flow of data and the conditions under which data is transferred between these units. The "mutable" keyword suggests that data can be modified during these transitions. The memory allocation scheme indicates that different functions are assigned to different processing units, likely based on their suitability for specific tasks. The synchronization steps are crucial for maintaining data integrity across the different memory spaces, ensuring that the CPU, GPU, and FPGA have consistent views of the data when needed. The color difference of HEAD_AT_FPGA may indicate a different state or configuration.
</details>
Table 1: Performance Benchmark with Native Caffe Time Measurement (Batch Size = 1)
| AlexNet (ms) | AlexNet (ms) | AlexNet (ms) | VGG_16 (ms) | VGG_16 (ms) | VGG_16 (ms) | SqueezeNet_v1.0 (ms) | SqueezeNet_v1.0 (ms) | SqueezeNet_v1.0 (ms) | GoogLeNet_v1 (ms) | GoogLeNet_v1 (ms) | GoogLeNet_v1 (ms) |
|----------------|----------------|----------------|---------------|---------------|---------------|------------------------|------------------------|------------------------|---------------------|---------------------|---------------------|
| Layer | Forward | Backward | Layer | Forward | Backward | Layer | Forward | Backward | Layer | Forward | Backward |
| Data | 0.001 | 0.001 | data | 0.002 | 0.002 | data | 0.001 | 0.001 | data | 0.635 | 0.003 |
| conv1 | 20.269 | 23.144 | conv1 | 498.268 | 1022.364 | conv1 | 46.025 | 43.506 | conv1 | 43.404 | 43.577 |
| conv2 | 26.661 | 54.883 | conv2 | 304.876 | 659.105 | fire2 | 18.646 | 26.165 | conv2 | 48.861 | 82.239 |
| conv3 | 6.359 | 13.395 | conv3 | 247.751 | 535.662 | fire3 | 18.119 | 26.313 | incep_3a | 34.198 | 53.154 |
| conv4 | 8.420 | 18.624 | conv4 | 132.813 | 281.132 | fire4 | 38.098 | 53.110 | incep_3b | 46.743 | 74.526 |
| conv5 | 8.487 | 19.019 | conv5 | 44.783 | 90.830 | fire5 | 11.464 | 16.732 | incep_4a | 22.026 | 34.657 |
| fc6 | 12.651 | 28.165 | fc6 | 31.676 | 74.651 | fire6 | 14.034 | 20.692 | loss1 | 6.931 | 11.618 |
| fc7 | 6.419 | 13.580 | fc7 | 6.291 | 14.291 | fire7 | 13.851 | 20.989 | incep_4b | 22.307 | 36.201 |
| fc8 | 1.976 | 5.603 | fc8 | 1.906 | 5.601 | fire8 | 20.640 | 30.215 | incep_4c | 23.094 | 36.396 |
| Loss | 1.883 | 0.994 | loss | 1.755 | 0.930 | fire9 | 8.842 | 13.703 | incep_4d | 25.049 | 38.848 |
| | | | | | | conv10 | 8.076 | 10.592 | loss2 | 7.312 | 11.626 |
| | | | | | | loss | 1.448 | 0.785 | incep_4e | 26.757 | 41.106 |
| | | | | | | | | | incep_5a | 15.832 | 23.933 |
| | | | | | | | | | incep_5b | 15.236 | 24.669 |
| | | | | | | | | | loss3 | 2.389 | 3.445 |
| Ave. | 93.230 | 177.527 | | 1270.420 | 2684.860 | | 199.525 | 263.047 | | 341.288 | 516.490 |
| Ave. F->B | 270.79 | 270.79 | | 3955.400 | 3955.400 | | 462.600 | 462.600 | | 857.810 | 857.810 |
The syncedmem class originally defines four status: uninitialized, CPU, GPU and synchronized, highlighted by green. Memory status can be switched by invoking corresponding highlevel functions, e.g. to\_cpu/gpu , and perform data copy as required. Here we create a new status for FPGA, highlighted in blue, and it means the data is at the FPGA DDR memory at current moment. This status can be added into the original pattern by using extended runtime functions, e.g. to\_fpga/cpu , resulting in a larger topography. Under such a flexible memory management for heterogeneous platforms, we can achieve a function-level or finegrained synchronization on various platforms with ease and safety within FeCaffe. The size of memory allocation can be calculated firstly, and then the data can be assigned to any device with the memory size required and management flow mentioned above, given any specific functions or operations. Therefore, the proposed architecture is able to support flexible workload partition on various platforms in theory, taking the bottom part of Figure 3 as the example. A number of functions can be performed straightforward on any specific devices respectively. In the meanwhile, it is also able to partition the workload on GPU, CPU and FPGA respectively with memory synchronization. For simplicity, we only test the combination of CPU and FPGA device, e.g. the fallback mechanism on CPU. Note that OpenCL standard utilizes the master-slave model to synchronize between host and devices, but cannot support synchronization between devices directly. Therefore, there has to be twice synchronization if the FPGA data would like to communicate with GPU.
## 4 Result and Analysis
## 4.1 Performance Benchmark
The result is summarized in Table 1. In this study, we take some classical CNN networks, e.g. AlexNet, VGG-16, SqueezeNet and GoogLeNet, as examples, and perform benchmarks including forward, backward, and for-backward on Intel Stratix 10 FPGA development board with Intel OpenCL SDK version of 19.2. The hardware and software of host platform is Intel Core i7-7700K with 4 Cores and 8 Threads and Ubuntu 16.04 respectively. In order for the accuracy during the training, all of data type is FP32 and native floating-point DSP is implemented for the multiplication and addition operations on FPGA. It is important to note that all of processing kernels for network topologies during training and inference are implemented on FPGA in this test. For the Table 1, the convolution also involves a couple of operations associated, e.g. pooling, Rectified Linear Unit (ReLU), Local Response Normalization (LRN), followed by the convolution layers. The fire layer consists of squeeze, expand, ReLU and concat operations accordingly defined in SqueezeNet and inception layer contains convolution layers of 1x1, 3x3 and 5x5, ReLU, pooling and concat operations in GoogLeNet. Due to limited space, we use convolution, fire and inception to represent those layers. With respect to the time measurement, we utilize Caffe native time function to measure the iterations of 100 times with batch size of 1. The forward and backward flow is the normal approach defined by the native Caffe, computing the result from beginning to the last layer, and gradient from the last to the beginning layer respectively. In this work, we use the train\_val as the model for each network during the performance measurement so that all of layers are required to perform backward calculations, demonstrating great FPGA adaptation but longer process path and time, compared to the deploy model. The performance results are listed in terms of forward, backward and for/backward of each network respectively.
## 4.2 Kernel Breakdowns
In order for the further analysis of FPGA and host behaviors during network forward and backward process within FeCaffe, we choose the deepest network, GoogLeNet, and employ profiling tools to provide workload breakdowns in details, as listed in Table 2. Table 2 elaborates all of execution details, e.g. kernels required, and total instance times for each kernel, including memory write
Table 2: Kernel Statistics within F->B for GoogLeNet
| Kernels | Instance Count | Total Time (ms) | Efficiency |
|---------------|------------------|-------------------|--------------|
| Ave_pool_B | 3 | 3.184 | 36% (DDR) |
| Ave_pool_F | 3 | 2.902 | 39% (DDR) |
| Col2im | 19 | 31.197 | 54% (DDR) |
| Concat | 72 | 18.015 | 10% (DDR) |
| Bias | 59 | 20.315 | 12% (DDR) |
| Dropout_B | 3 | 0.113 | 10% (DDR) |
| Dropout_F | 3 | 0.104 | 10% (DDR) |
| Gemm | 186 | 58.407 | 77% (DDR) |
| Gemv | 69 | 7.067 | 81%(DDR) |
| Im2col | 98 | 187.418 | 42% (DDR) |
| LRN_Diff | 2 | 18.39 | 43% (DDR) |
| LRN_Output | 2 | 4.699 | 16% (DDR) |
| LRN_Scale | 2 | 4.645 | 34% (DDR) |
| Max_pool_B | 13 | 66.337 | 62% (DDR) |
| Max_pool_F | 13 | 62.989 | 60% (DDR) |
| ReLU_B | 61 | 20.707 | 17% (DDR) |
| ReLU_F | 61 | 21.313 | 10% (DDR) |
| Softmax | 3 | 0.776 | 0% (DDR) |
| SoftmaxLoss_B | 3 | 0.063 | 0% (DDR) |
| SoftmaxLoss_F | 3 | 0.089 | 0% (DDR) |
| Split | 41 | 22.943 | 11% (DDR) |
| Add | 9 | 5.632 | 17% (DDR) |
| Asum | 3 | 0.124 | 0% (DDR) |
| Axpy | 25 | 12.695 | 20% (DDR) |
| Scale | 3 | 0.07 | 11% (DDR) |
| Write_Buffer | 198 | 28.168 | 12%(PCIe) |
| Read_Buffer | 3 | 0.091 | 0% (PCIe) |
| Total | 960 | 598.453 | 70%(F->B) |
and read. The Efficiency column has three meanings in Table 2: one is for FPGA DDR bandwidth efficiency during kernel execution, which is an average ratio and has been dynamically measured by FPGA profiling tool; one is for PCIe data transfer efficiency during memory movement between host and FPGA, which is an average ratio and has been dynamically measured by Intel Vtune Amplifier; the last is the ratio of total kernel execution to total for/backward process time. Given the process with batch size of 1, there are 25 kernels used and 960 times of kernel invocations in total, including 198 times for writing data buffer and 3 times for reading data buffer from FPGA to host. The gemm kernel is the most frequent operation with 186 times of invocation. Its total kernel execution time is 58.407ms with 77% FPGA DDR efficiency, and thus the average execution time is about 0.31ms for each invocation. Similarly, the gemv kernel is used by 69 times with 7.067ms for total execution time and 81% DDR efficiency, resulting in average 0.1ms for each time. Kernels of gemm and gemv have extra optimization with local memory buffer and Single Instruction and Multiple Data (SIMD) directives for vectorization. Using local memory buffer can dramatically decrease the times of DDR memory access required. Here we use maximum DDR bandwidth of S10 board, i.e. 14928MB/s with FPGA logic running at 300MHz, as the reference, to compare the DDR efficiency for each kernel. With respect to the data transfer, i.e. write buffer from host to FPGA and read buffer from FPGA to host, many writing buffer events are trigged by loading convolution and bias weights to FPGA to perform convolution layer by layer. The average PCIe data transfer speed is measured at 1.906GB/s, resulting in efficiency of 10% by taking PCIe Gen3 x16 lanes as the reference, i.e. 15.75GB/s. Finally, all of kernel and data transfer time are summed up, achieving
598.453ms and accounting for 70% of total average forward and backward time measured in Table 1, which implies that there are some software runtime overhead for current CPU and FPGA collaborations, leaving the room to further optimize in the future.
Table 3 shows the detail of configuration file in terms of hardware utilization and FPGA frequency after placement for the measurement mentioned in Table 1 and Table 2. Current configuration only occupies 47% and 31% for total BRAM and DSP resources, given more than half of these resource for further optimization. The gemm and gemv kernels are highlighted cause they both are significant kernels and optimized with higher utilization in terms of BRAMs and DSPs so that convolution and full connection layers can be performed with high efficiency.
Table 3: Hardware Utilization on S10
| | ALMs | Regs | M20K | DSPs | Fmax |
|-------|-----------|--------|------------|------------|---------|
| Gemm | 107K(12%) | 326K | 2338 (20%) | 1037 (18%) | 253 MHz |
| Gemv | 49K(5%) | 116K | 756 (6%) | 130 (2%) | 253 MHz |
| Total | 616K(66%) | 1415K | 5419 (47%) | 1796 (31%) | 253 MHz |
## 4.3 Training Process on FPGA
Forward and backward are necessary parts, in addition to that, the training process also needs weight update mechanism after forward and backward processing. In Caffe framework, the solver class is used to optimize and update the weights so that the weights can be trained gradually to reach the loss target as we defined during the training iterations. There are three main computationrelated phases during the weight update process: normalization , regularization and compute\_update. In this study, FeCaffe has considered operations mentioned above with different approaches. Normalization and regularization can be supported via combinations of BLAS-based kernels while the computer\_update is enabled by kernel designs directly, e.g. SGD, Adam and other common policies. Therefore, it is clearly seen that the most of computation burden during weight update has been deployed on FPGA, and thus the proposed FeCaffe is able to provide sufficient features to support CNN training for target networks. We use OpenCL native profiling tool and Intel Vtune Amplifier to capture the entire training process, by using GoogLeNet with batch size and iterations of 16 and 10, and results are shown in Figure 5 and Figure 4 respectively.
Figure 5 illustrates all of kernels required and their execution time for each kernel dynamically during the entire training process by performance registers and counters on FPGA. Figure 4 demonstrates the system profiling by VTune with the view zoomed in during training process. The CPU running time is highlighted by green and FPGA behavior is colored with pink. More details of kernel tasks can be checked via the task line, with different colors. It is clearly seen that the CPU and FPGA interactivity during the CNN training process, and CPU usage can be reduced when FPGA is executing kernel task. Host memory bandwidth can also be monitored for each kernel invocation. For the training with FeCaffe, users can reuse the traditional Caffe format, e.g. solver settings, prototxt, and commands to experience the training on FeCaffe. Moreover, snapshot function can also be supported, and thus the proposed FeCaffe can be reviewed as a comprehensive
Figure 5 Kernel Details for GoogLeNet Training Process
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Timeline Chart: Kernel Execution Timeline
### Overview
The image is a timeline chart visualizing the execution of different kernels on a device. The chart displays the activity of various kernels over time, showing when each kernel is actively computing. The x-axis represents time, and the y-axis lists the kernels. Each row corresponds to a specific kernel, and the blue bars indicate periods of activity.
### Components/Axes
* **Title:** s10gx
* **Subtitle:** Global Memory BW (DDR) 14928 MB/s
* **X-Axis:** Time, labeled with markers at 0.00ms, 22.29s, 44.57s, 1.11min, 1.49min, 1.86min, 2.23min, and 2.60min.
* **Y-Axis:** Kernel names and Device ID. The Device ID is consistently "Device 0". The kernels listed are:
* Xadd
* split
* concat
* Xasum
* Xscal
* Xaxpy
* adam\_update
* gemv\_opt
* conv\_bias
* col2im\_opt
* im2col\_opt
* gemm\_opt
* ReLUBackward
* ReLUForward
* SoftmaxLossBackward
* SoftmaxLossForward
* softmax\_optimization
* AvePoolBackward
* MaxPoolBackward
* AvePoolForward
* MaxPoolForward
* LRNComputeDiff
* LRNFillScale
* LRNComputeOutput
* DropoutBackward
* DropoutForward
* **Labels:**
* Source Code
* Kernel Execution
* Device id
* Kernel
* Compute Unit
### Detailed Analysis or Content Details
The chart displays the execution timeline for each kernel. The blue bars indicate when the kernel is actively computing. The length and density of the bars provide information about the duration and frequency of the kernel's execution.
* **Xadd:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **split:** Similar to Xadd, with activity concentrated at the beginning.
* **concat:** Similar to Xadd and split.
* **Xasum:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **Xscal:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **Xaxpy:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **adam\_update:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **gemv\_opt:** Shows a high density of activity throughout the entire timeline.
* **conv\_bias:** Shows a high density of activity throughout the entire timeline.
* **col2im\_opt:** Shows a high density of activity throughout the entire timeline.
* **im2col\_opt:** Shows a high density of activity throughout the entire timeline.
* **gemm\_opt:** Shows a high density of activity throughout the entire timeline.
* **ReLUBackward:** Shows a high density of activity throughout the entire timeline.
* **ReLUForward:** Shows a high density of activity throughout the entire timeline.
* **SoftmaxLossBackward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **SoftmaxLossForward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **softmax\_optimization:** Shows a high density of activity throughout the entire timeline.
* **AvePoolBackward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **MaxPoolBackward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **AvePoolForward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **MaxPoolForward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **LRNComputeDiff:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **LRNFillScale:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **LRNComputeOutput:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **DropoutBackward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
* **DropoutForward:** Shows activity primarily at the beginning (0.00ms) and then sparsely throughout the timeline.
### Key Observations
* Some kernels (gemv\_opt, conv\_bias, col2im\_opt, im2col\_opt, gemm\_opt, ReLUBackward, ReLUForward, softmax\_optimization) exhibit significantly higher activity throughout the entire timeline compared to others.
* Kernels like Xadd, split, concat, Xasum, Xscal, Xaxpy, adam\_update, SoftmaxLossBackward, SoftmaxLossForward, AvePoolBackward, MaxPoolBackward, AvePoolForward, MaxPoolForward, LRNComputeDiff, LRNFillScale, LRNComputeOutput, DropoutBackward, and DropoutForward show sparse activity, mostly concentrated at the beginning of the timeline.
### Interpretation
The timeline chart provides insights into the computational workload distribution across different kernels. The kernels with high activity likely represent the most computationally intensive operations in the model. The sparse activity of other kernels suggests they are either less frequently used or have shorter execution times. This information can be used to optimize the model by focusing on improving the performance of the most active kernels. The consistent "Device 0" ID suggests all kernels are running on the same device. The Global Memory BW (DDR) value of 14928 MB/s provides context about the memory bandwidth available during the execution.
</details>
Figure 4 CPU and FPGA Behaviors duing GoogLeNet Training Process (Best Viewed Zoomed in)
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Timeline Chart: Performance Analysis
### Overview
The image is a timeline chart displaying performance analysis data for a system, likely related to FPGA utilization and DRAM bandwidth. The chart visualizes the execution of different tasks and modules over time, with color-coded bars representing various activities and states. The timeline spans from approximately 5329ms to 5340ms. A legend on the right side explains the color codes for different metrics like Computing Queue, Thread/Module states, CPU Time, FPGA Utilization, and DRAM Bandwidth.
### Components/Axes
* **Horizontal Axis (Time):** The horizontal axis represents time, spanning from 5329ms to 5340ms, with tick marks at each millisecond.
* **Vertical Axis (Modules/Metrics):** The vertical axis represents different modules and metrics being tracked:
* s10gx: Stratix 10 Reference...
* caffe.bin (TID: 30090)
* FPGA Utilization
* DRAM Bandwidth: packa...
* **Legend (Right Side):** The legend on the right side explains the color codes used in the timeline. It is organized into categories:
* **Computing Queue:**
* Computing Task (color not explicitly visible in the legend, but present in the timeline)
* Transfer (light blue)
* Computing Queue (white, unchecked)
* **Thread/Module:**
* Running (green)
* Context Switches (orange)
* Preemption (white, unchecked)
* Synchronization (light green)
* **CPU Time:**
* User Tasks (white, unchecked)
* Slow (white, unchecked)
* Computing Task (white, unchecked)
* **FPGA Utilization:**
* Computing Task Count
* Compute (red)
* Transfer (purple)
* **DRAM Bandwidth:**
* Average Bandwidth (light blue)
* Read (dark blue)
* Write (brown)
* Total, GB/sec (teal)
### Detailed Analysis or ### Content Details
* **s10gx: Stratix 10 Reference...:**
* Contains blocks of "m2col_opt" (purple), "gem" (light blue), "m2col_opt" (salmon), "Rlu" (red), "S" (purple), "Rlu" (light blue), "m2col_opt" (purple), "ReLuBackwar" (red), "UG" (light blue), "S" (purple).
* **caffe.bin (TID: 30090):**
* Shows a mostly continuous "Write" activity (brown) with some gaps.
* **FPGA Utilization:**
* Shows intermittent "Compute" (red) and "Transfer" (purple) activities.
* **DRAM Bandwidth: packa...:**
* Shows a mostly continuous "Total, GB/sec" activity (teal) with a value of approximately 5.665 at the left edge. There are also "Read" activities (dark blue) interspersed.
### Key Observations
* The "caffe.bin" module shows consistent "Write" activity, suggesting continuous data output.
* FPGA Utilization is sporadic, indicating that the FPGA is not constantly active.
* DRAM Bandwidth appears relatively stable, with the "Total, GB/sec" metric dominating.
* The "s10gx: Stratix 10 Reference..." module shows a sequence of different tasks being executed.
### Interpretation
The timeline chart provides a snapshot of the system's performance during a specific time interval. The data suggests that the "caffe.bin" module is continuously writing data, while the FPGA is utilized intermittently. The DRAM bandwidth remains relatively stable. The sequence of tasks in "s10gx: Stratix 10 Reference..." indicates the execution flow of a particular algorithm or process. The chart can be used to identify performance bottlenecks, optimize task scheduling, and analyze the impact of different modules on overall system performance. The gaps in the "caffe.bin" write activity and the sporadic FPGA utilization could be areas for further investigation to improve efficiency.
</details>
FPGA-based solution to conveniently provide common deep learning development in particular to training.
## 4.4 Comparison to the State-of-art
This study also presents some comparison with prior work on CNN training with FPGAs in terms of functionality and performance, as listed in Table 4. It is clear that we can provide higher flexibility in terms of CNN network topologies, solver types, training hyperparameter settings, expansibility and ease of use. Compared to FCNN solution [8] , we can achieve average execution time of 1102.162 ms and 1710.090 ms for forward and backward respectively, given LeNet with batch size of 384 and 150 minibaches after 200 iterations, resulting in 6.4x and 8.4x average execution time improvement under same testing conditions. Please note that some performance improvement comes from FPGA device difference cause S10 device has native floating-point DSP blocks and more advanced technology node, compared to the device in [8] . Due to DDR memory size limitation of S10
development board, training of VGG-16 and VGG-19 cannot be performed, and thus we provide the training time consumed for one epoch of ImageNet 2012 with 1.2 million training and 50 thousand validation images for AlexNet, SqueezeNet and GoogLeNet, respectively. Compared to FPDeep [9] , current training performance is much less competitive. Fitting all of weights, feature data and gradient within on-chip memory over FPGA cluster can significantly change the FPGA pipeline design structure, and maximize FPGA on-chip memory bandwidth and DSP resources, at the cost of 43,200 DSPs and several hundreds of Mbits of BRAMs in total. In addition, fixed-point of 16 is another key factor to provide such an incredible training result. Small batch size is another factor to impact our training speed as total training iterations and data communication times between FPGA and host can be reduced during training and inference phases with the increment of batch size, leading to higher FPGA computation efficiency.
Table 4: Comparison with FPGA Prior Work
| | Our Work | FCNN | [8] | FPDeep [9] |
|----------------------------------|--------------------------------------------------------------------------------|---------------------------------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| Framework | Caffe | Customized | Customized | Customized |
| Develop Tool | OpenCL with AOC | MaxCompiler Tool | MaxCompiler Tool | RTL Generator |
| CNN Feature | Training and Inference | Training and Inference | Training and Inference | Training and Inference |
| Network Topologies Supported | AlexNet, VGG, SqueezeNet, GoogLeNet, and the Networks with Same Primitives | LeNet | LeNet | AlexNet, VGG-16 and VGG-19 |
| Solver Supported | SGD, Adam, RMS_Prop, Nesterov, Ada_Grad and Ada_Delta | SGD Only | SGD Only | SGD Only |
| Training Hyperparameter Settings | Same with GPUs and CPU, e.g. base_lr, lr_policy, gamma, momentum, weight_decay | Unknown | Unknown | Unknown |
| FPGA Optimization Mechanism | Gemm: NDRange and 2D Local Memory Gemv: NDRange and 1D Local Memory | Systolic-like: Customized Processing Pipeline for | Systolic-like: Customized Processing Pipeline for | All Layers Processing Pipeline Distributed over FPGA Cluster Store All Weights, Feature and Gradients |
| FPGA Optimization Mechanism | Gemm: NDRange and 2D Local Memory Gemv: NDRange and 1D Local Memory | Convolution and Pooling | Convolution and Pooling | with on-chip BRAMs Forward and Backward Processing Pipeline in Parallel |
| Expansibility | Small Efforts to Enable New Functions | More Efforts (Pipeline Need to | More Efforts (Pipeline Need to | More Efforts (Pipeline Need to Update |
| Expansibility | No Inter-FPGA Dependency | Update for New Functions) | Update for New Functions) | for New Functions) |
| Ease of Use | Same with Conventional Caffe, e.g. Prototxt, Commands and Snapshot | Customized Network Config. Parameters and HWConstraints | Customized Network Config. Parameters and HWConstraints | Customized Network Config. Parameters and HWConstraints |
| Device and Board | Stratix 10 Development Kit | Stratix V GSD8 | Stratix V GSD8 | VC709 Board (V7690T) |
| Number | 1 | 2 | 2 | 15 |
| DDR Storage and Bandwidth | 2 GB and 14.578GB/s | 6 GB and 2 * 9.6GB/s | 6 GB and 2 * 9.6GB/s | On-chip Memory Bandwidth |
| Fmax | 253 MHz | 150 MHz | 150 MHz | Unknown |
| Data Type | FP32 | FP32 | FP32 | Fixed-point 16 |
| Total DSP Utilization | 1796 | Unknown | Unknown | 15 * 2880 = 43,200 |
| LeNet (L1-L6) | Forward (ms) Backward | (ms) Forward (ms) | Backward (ms) | |
| L1 (Conv) | 524.293 514.197 | 590 | 1210 | |
| L2 (Pool) | 22.330 23.895 | 530 | 570 | |
| L3 (Conv) | 547.651 1156.870 | 4670 | 10320 | |
| L4 (Pool) | 6.539 | 7.010 170 | 180 | N/A |
| L5 (FC) | 1.345 6.003 | 920 | 1820 | |
| L6 (FC) | 0.004 2.115 | 180 | 200 | |
| Total | 1102.162 (6.4x) 1710.090 | (8.4x) 7060 | 14300 | |
| AlexNet per Epoch | 86.41 Hours (BS:32 and Default Solver) | | N/A | 0.17 Hour |
| SqueezeNet v1.0 per | | | | |
| Epoch | 159.62 Hours (BS:16 and Default Solver) | 159.62 Hours (BS:16 and Default Solver) | N/A N/A | N/A N/A |
| GoogLeNet per Epoch | 291.08 Hours (BS:16, Default Solver with Adam) | 291.08 Hours (BS:16, Default Solver with Adam) | N/A N/A | N/A N/A |
## 5 Analysis and Optimization
Based on the result analysis and comparison, the proposed FeCaffe utilizes the fine-grained and kernel-wise FPGA implementation to achieve the same granularity with GPU acceleration, and is capable of providing sufficient and flexible offload functions for deep learning development. It is a new path for deep learning and thus the overall performance is less competitive compared to the mature and well-developed GPU solutions. Therefore, a number of optimization directions mainly focusing on the performance improvement, from FPGA kernel, software runtime, CNN architectures etc., are introduced as follows:
## 5.1 FPGA-level
In this work, we currently choose the OpenCL flow with NDRange format to develop necessary kernels for CNN operations. Due to good adaptation of compiler tool, users are able to deploy most of NDRange kernel files on FPGA conveniently with minor or even no modifications. However, this implementation approach can cause performance issue and resource usage overhead especially for large scale and complicated designs. Therefore, it is recommended by compiler vendors to develop single work-item designs to achieve the best performance with resource optimization. Compared to NDRange style, the single work-item style is very similar with the traditional FPGA design flow, and provides more choices and flexibility to design and optimize kernels. Users can develop more flexible and sophisticated pipeline structures and utilize more optimization directives to fully unleash FPGA massive on-chip memory storage and bandwidth for better throughput performance
Another optimization approach is to improve FPGA logic clock frequency. Stratix 10 FPGA chip has the Hyperflex technology, which inserts some registers on routing resources during placement and routing phase, and thus is able to dramatically increase FPGA design timing frequency [26] . Current implementation approach cannot enable Hyperflex optimization cause this feature only allows single work-item design with stringent conditions for 19.2 version. Therefore, rewriting kernels can increase clock frequency significantly as well. Enlarging DDR storage size and bandwidth for the FPGA board can also improve performance. Currently DDR bandwidth is still a limitation, compared to GPU and CPUs, and thus multiple banks of DDR can mitigate this situation. In addition to these factors, lower bitwidth for training and inference is another important factor to consider for the performance optimization, with the development of retraining and quantization approaches. Int8 and even Int4 can significantly improve DSP efficiency, intermediate data storage and DDR bandwidth and lead to several times of overall CNN processing capability, compared to single floating-point. This enables FPGA solutions to become more competitive compared to GPUs and dedicated ASICs in terms of Int8 and Int4 computation capability.
## 5.2 System Pipeline-level
Currently, FeCaffe chooses synchronous interface to manage communication for higher-level function invocation, and that means the CPU launches FPGA kernels in sequence, and does not start to process the next kernel until current kernel has been completely executed. Therefore, data transfer between CPU and FPGA cannot be overlapped, but is viewed as kernel overhead for the performance measurement. In the meanwhile, FPGA cannot continue to operate all the time cause it has to wait during data transfer, resulting in lower acceleration efficiency. We can notice the phenomenon that kernels are executed discontinuously by Intel Vtune Amplifier in Figure 4. An optimization approach to address this issue is to utilize asynchronous mechanism for CPU and FPGA. By using asynchronous interface, host can put several kernel launches into the invocation queue and thus data transfer through PCIe for next kernels can be prefetched in advanced while FPGA is executing on the current kernel, realizing FPGA continuous operations and higher efficiency. Therefore, overhead of data transfer can be overlapped for the frame throughput calculation and FPGA continuous operation maximizes the throughput performance in terms of system pipeline level.
Fallback on CPU is also viewed as an alternative to improve performance with reasonable workload partition from system level. For example, based on the statistics of kernel execution time for GoogLeNet, total kernel time of im2col is the longest, achieving 187.418ms, and the sum of im2col and col2im kernels can reach 218.615ms, accounting for 37% of total kernel time. By nature, the purpose of im2col is to reshape the data and change data address, without data content processing, but requires DDR bandwidth significantly. Therefore, it is wiser to deploy such memorybounded and small functions on CPU in system workload partition, leaving more burden of data computation at FPGA side.
## 5.3 CNN Network-level
With respect to the current FeCaffe architecture, kernel designs are very fine-grained and fragmented and thus there are enormously frequent communications between CPU and FPGA, e.g. 201 times of memory writing and reading, and 759 times of kernel control to achieve one time of forward to backward for GoogLeNet topology. Therefore, how to reduce interaction times between host and FPGA can significantly improve performance for the given network topology, especially for inference phase. Building large pipeline by merging multiple kernels is a natural and effective approach to mitigate this issue for performance improvement. For this topic, many research studies have been done, as discussed in previous section. For example, graph-based architecture, i.e. pipeline structure to support all of CNN layers on FPGA at one time without CPU runtime interactions, and subgraphbased architecture, i.e. smaller pipeline scale to support layers of convolution, bias, pooling and activation, demonstrate significant advantages in terms of throughput performance and efficiency. Based on the larger pipeline structure, loading weights of entire network can be viewed as the offline initialization stage, and thus interactions of loading weights during runtime can be eliminated. In addition, large pipeline can dramatically shorten the processing latency by leveraging FPGA on-chip memory resource to connect each kernel, compared to normal multiple kernel designs which utilize global DDR memory for data connection between kernels. In order to support and leverage such advanced FPGA hardware architectures within FeCaffe, we can continue to follow the proposed hierarchical approach previously, as illustrated in Figure 6. Hardware designs can be imported and grouped into kernel layer, and runtime functions are required accordingly to manage the kernel operations. Finally, those subgraph-based or graph-based functions can be extended as the customized layers within the operation layer class in FeCaffe to ensure that the underlying optimized architectures can be invoked correctly.
Figure 6 Integration Approach for Various Architectures
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Diagram: Layered Architecture
### Overview
The image depicts a layered architecture diagram, showing the flow of data and processes through three distinct layers: Kernel Layer (L1), Wrapper Layer (L2), and Class Layer (L3). Each layer contains "Subgraph-based" and "Graph-based" components. The diagram illustrates how customized layers are built upon lower-level kernels and runtime environments.
### Components/Axes
* **Layers:**
* L1: Kernel Layer
* L2: Wrapper Layer
* L3: Class Layer
* **Components within each layer:**
* Subgraph-based
* Graph-based
* **Additional Elements:**
* "Forward_fpga" (appears in Class Layer)
* "AOCX" (appears in Kernel Layer)
* **Arrows:** Blue arrows indicate the flow of data/processes from the Class Layer to the Wrapper Layer and then to the Kernel Layer.
### Detailed Analysis
* **L1: Kernel Layer:**
* Contains "Subgraph-based Kernels" and "Graph-based Kernels."
* Labeled as "AOCX" at the bottom.
* **L2: Wrapper Layer:**
* Contains "Subgraph-based Runtime" and "Graph-based Runtime."
* **L3: Class Layer:**
* Contains "Customized Layer Subgraph-based" and "Customized Layer Graph-based."
* Each of these components has a yellow box labeled "Forward_fpga."
* **Flow:**
* The flow is top-down, from L3 to L1, indicated by blue arrows.
### Key Observations
* The architecture is structured in a layered fashion, with each layer building upon the previous one.
* Both "Subgraph-based" and "Graph-based" components are present in each layer, suggesting parallel processing or different approaches to the same task.
* The "Forward_fpga" component in the Class Layer likely represents a forward pass operation on an FPGA.
* "AOCX" at the Kernel Layer might indicate a specific hardware or software platform.
### Interpretation
The diagram illustrates a hierarchical system where high-level customized layers (L3) rely on lower-level runtime environments (L2) and fundamental kernels (L1). The presence of both "Subgraph-based" and "Graph-based" components suggests a hybrid approach to processing, possibly leveraging the strengths of both methods. The "Forward_fpga" component indicates hardware acceleration using FPGAs. The "AOCX" label at the Kernel Layer likely refers to a specific hardware or software platform upon which the entire architecture is built. The diagram suggests a modular and flexible design, allowing for customization and optimization at different levels of the system.
</details>
## 5.4 User Case Application-level
For CNN inference, the throughput performance is a general and straightforward criterion for the quality of various platforms and solutions. Therefore, the ultimate goal of FPGA-enabled CNN inference is to achieve the maximum inference per second with the fixed and optimized pipeline structures. On the contrary, the target of training is to develop new network structures or improve accuracy, forcing algorithm developers to explore and experience novel operations and definitions. Therefore, the flexibility, creativeness and extension capability with ease should be primary factors to consider for the training process. For some scenarios and applications on deep learning-based edge computing and nodes, training is also required from time to time. It is obvious that there is a dilemma by taking into account inference and training designs at the same time on one device. However, for the FPGA, with advanced and systematical reconfiguration mechanism, it has the potential to maintain their own design characteristics of inference and training at the same time .
Partial reconfiguration is an interesting reconfiguration technology for FPGA, and it has high degree of flexibility to allow some functional modification to rapidly update by downloading partial bitstreams while other parts can continue to operate without any interruption [27] . The idea of partial reconfiguration was proposed more than a decade ago, and it has been improving continuously and gradually by FPGA vendors, and becomes mature to work with OpenCL flow in design methodology. Therefore, it can also be considered as one of further optimization direction to the proposed FeCaffe architecture. Based on the discussion above, different targets drive various hardware architectures and thus at least two kinds of architectures, i.e. inference-driven architecture and flexibility-driven architecture for training, or even some variants with tradeoff between performance and flexibility, need to be supported by the system design. In addition, multiple hardware architectures are required to swap on the fly at millisecond level according to use case requirement. Taking into account of these considerations, partial reconfiguration is the most promising choice. The coarse-grained partial reconfiguration-based system for various use case applications is described in Figure 7. Some design modules on memory mechanism can be shared by various reconfigurable designs are placed in static region. Various architectures from inference-driven to flexibility-driven can be complied into a variety of partial bitstreams, which can be managed by host accordingly to various requirements so that FPGA-based deep learning solution can provide efficiency for inference and flexibility for training at the same time.
Figure 7 PR-based FPGA System for User Case Applications
<details>
<summary>Image 7 Details</summary>

### Visual Description
## System Diagram: FPGA Architecture for Deep Learning
### Overview
The image presents a system diagram illustrating an FPGA architecture designed for deep learning applications. It highlights the interaction between different components, including reconfiguration management, a host processor, memory control, and a deep learning architecture library. The diagram distinguishes between static and dynamic/PR (Partial Reconfiguration) regions within the FPGA.
### Components/Axes
* **Top-Left:** "Reconfiguration Management" (purple block) connected to "FeCaffe" (gray block) via a data flow arrow.
* **Left:** "Deep Learning Arch. Library" consisting of stacked blocks:
* "DLA Arch. Bit." (green)
* "Graph-based Arch. Bit." (blue)
* "Subgraph-based Arch. Bit." (light blue)
* "Training Arch. Bit." (yellow)
* **Center:** "Host" (orange block) connected to the "Deep Learning Arch. Library" via data flow arrows.
* **Top-Right:** "DDR" (orange block).
* **Center-Right:** "Static Region" (enclosed in a dashed blue box) containing:
* "PCIe" (gray block)
* "Memory Control" (gray block)
* "Global Memory Interconnection" (gray block)
* **Bottom-Center:** "Dynamic/PR Region" (enclosed in a dashed blue box) containing stacked blocks:
* "DLA Arch." (green)
* "Graph-based Arch." (blue)
* "Subgraph-based Arch." (light blue)
* "Training Arch." (yellow)
* **Bottom:** "FPGA" label.
* **Data Flow Arrows:**
* "Data" (solid line with arrowhead)
* "Conf. Files" (dashed line with arrowhead)
* "Control" (dotted line with arrowhead)
### Detailed Analysis
* **Reconfiguration Management and FeCaffe:** The "Reconfiguration Management" block feeds into the "FeCaffe" block, suggesting a process of managing and configuring the FPGA.
* **Deep Learning Arch. Library:** The library contains different architectural bitstreams for deep learning tasks, including DLA (Deep Learning Accelerator), graph-based, subgraph-based, and training architectures.
* **Host Interaction:** The "Host" processor interacts with both the "Deep Learning Arch. Library" and the "Static Region" of the FPGA.
* **Static Region:** The "Static Region" includes PCIe, Memory Control, and Global Memory Interconnection, indicating the fixed infrastructure for communication and memory access. "CvP" is a small label and arrow connecting PCIe to Global Memory Interconnection.
* **Dynamic/PR Region:** The "Dynamic/PR Region" contains the actual deep learning architectures that can be dynamically reconfigured.
* **Data Flow:** The arrows indicate the flow of data, configuration files, and control signals between the different components.
### Key Observations
* The architecture separates static and dynamic regions within the FPGA, allowing for partial reconfiguration of the deep learning accelerators.
* The "Deep Learning Arch. Library" provides a collection of pre-designed architectures for different deep learning tasks.
* The "Host" processor acts as a central control point for managing the FPGA and interacting with the deep learning accelerators.
### Interpretation
The diagram illustrates a flexible FPGA-based architecture for deep learning. The separation of static and dynamic regions enables partial reconfiguration, allowing for efficient utilization of FPGA resources and adaptation to different deep learning workloads. The "Deep Learning Arch. Library" provides a modular approach to deep learning acceleration, enabling developers to select and deploy the most appropriate architecture for their specific needs. The "Host" processor plays a crucial role in managing the FPGA and coordinating the execution of deep learning tasks. The presence of "FeCaffe" suggests integration with the Caffe deep learning framework.
</details>
On the other hand, partial reconfiguration has to bring in more FPGA design complexity and challenges significantly to achieve the benefits discussed above. Firstly, users have to develop and maintain several FPGA architectures for different purposes. Then, they need to plan out some dynamic regions by area constraints on device, and perform FPGA development flow several times to generate various partial bitstreams corresponding to various hardware architectures. Furthermore, how to find out the suitable granularity to fit by taking into account all of kernels from various architectures and build the partial reconfiguration-based pipeline accordingly is the most difficult and challenging for the design. However, it is a positive trend and direction for FPGA designs in CNN applications. Locking to the dedicated and efficient architecture limits FPGA capability, and cannot fully unleash its reconfigurability. This has to force FPGA to become less competitive than deep learning ASIC chips in terms of efficiency, peak performance, power consumption and cost, and thus FPGAbased CNN products have to be displaced automatically and naturally by such emerging and advanced deep learning ASIC chips in the market. However, with the extension of partial reconfiguration-based designs and software management mechanism accordingly, FPGA adaptation and flexibility has been dramatically enhanced. Therefore, FPGA-based CNN products have the potential to support more advanced reconfiguration mechanism to adapt a variety of use cases and scenarios, which is totally and naturally differentiated from current dedicated ASIC chips and conventional GPUs.
## 5.5 Heterogeneous Platform-level
Based on the memory synchronization mechanism between platforms of FeCaffe, creating a hybrid heterogeneous cluster with CPUs, FPGAs and GPUs might be an interesting direction to explore in the future. As discussed in the previous section, the rule of thumb for the cluster is to wisely partition the workload according to various workload characteristics and device features by nature. For instance, latency-sensitive modules are supposed to be deployed on FPGA with single work-item format to build flexible and deep pipeline so that dataflow path can be fully accelerated with FPGA hardware components, achieving the minimum latency to meet the requirements. Memory-hungry or DSP-hungry operations are suitable to offload on GPUs as modern GPUs usually have the highest memory bandwidth in terms of DDR or High Bandwidth Memory (HBM), and have several thousands of processing cores as well. CPU, as the host, is mainly required to manage the synchronization and balance the workload between platforms, and even to perform some small and fragmented functions from time to time, assuming the total execution time of those functions is close to the overhead of kernel launch for GPUs or FPGAs. Therefore, the proposed FeCaffe architecture has the potential to create hybrid platform cluster with more flexibility and higher performance, and to solve some current issues for GPU-only or CPU-only platforms.
## 6 Conclusion
In this paper, we propose FeCaffe framework, an extension of conventional Caffe, with fine-grained and fragmented kernel design on FPGA and OpenCL development flow for deep learning training, and introduce the hierarchical hardware and software design methodology accordingly in details. A number of benchmark results and performance analysis in detail have been provided accordingly. Compared to some prior studies, the proposed architecture demonstrates obvious advantages in supporting CNN network, solver types, training hyperparameter settings, expansibility, flexibility and ease of use for deep learning training development. In addition, current result can achieve 6.4x and 8.4x performance improvement for forward and backward respectively for LeNet. Based on the current performance analysis, we further propose a number of improvement and optimization directions in the future from FPGA-level, system pipeline-level, CNN network-level, use case application-level and heterogeneous platform-level respectively. Taking into account all of these optimizations, the proposed architecture has great potential to provide better system performance, efficiency and higher degree of flexibility for deep learning and CNN development. Therefore, FeCaffe leads to a new horizon of FPGA-based heterogeneous platform for deep learning development by building a bridge between FPGA low-level kernel design and high-level framework directly, and will create more feasibility and choices with gradual optimization and improvement in the future.
## REFERENCES
- [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, 'Caffe: Convolutional Architecture for Fast Feature Embedding'. In Proceedings of 2014 22 nd ACM International Conference on Multimedia , pages 675-678, Orlando, USA, November 03-07, 2014
- [2] F. Tschopp, Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems , Technical Report , [online], Edgenossische Technische Hochschule Zurich, September 2015. Available: arxiv.org/abs/1509.03371
- [3] J. Gu, Y. Liu, Y. Gao and M. Zhu, 'OpenCL Caffe: Accelerating and Enabling a Cross Platform Machine Learning Framework'. In Proceedings of 14 th Internation Workshop on OpenCL , Vienna, Austria, April 19-21, 2016
- [4] A. Krizhevsky, I. Sutskever and G. Hinton, 'ImageNet Classification with Deep Convolutional Neural Networks'. In Proceedings of 2012 Conference on Neural Information Processing System , pages 1097-1105, Lake Tahoe, USA, December 3-8, 2012
- [5] K. Simonyan and A. Zisserman, 'Very Deep Convolutional Networks for Large-scale Image Recognition'. In Proceedings of 2015 International Conference on Learning Representations (ICLR) , San Diego, USA, May 7-9, 2015
- [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, 'Going Deeper with Convolutions'. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Boston, USA, June 7-12, 2015
- [7] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally and K. Keutzer, SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size , [online]. Available: arxiv.org/abs/1602.07360
- [8] W. Zhao, H. Fu, T. Yu, S. Wang, B. Feng, Y. Ma and G. Yang, 'F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks'. In Proceedings of 2016 IEEE 27 th International Conference on Applicationspecific System, Architectures and Processors (ASAP) , London, UK, July 6-8 2016
- [9] T. Geng, T. Wang, A. Li, X. Jin and M. Herbordt, ' A Scalable Framework for Acceleration of CNN Training on Deeply-pipeline FPGA Clusters with Weight and Workload Balancing '. [oneline]. Available: arxiv.org/abs/1901.01007
- [10] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor and S. Areibi, 'Caffeinated FPGAs: FPGA Framework for Convolutional Neural Networks'. In Proceedings of 2016 International Conference on Field-Programmable Technology (FPT) , Xi'an, China, December 7-9 2016
- [11] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang and H. Yang, 'Going Deeper with Embedded FPGA Platform for Convolutional Neural Network'. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 16 , pages 26-35, New York, NY, USA 2016.ACM
- [12] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J. Seo, and Y. Cao, 'Throughput-optimized OpenCL-based FPGA Accelerator for Largescale Convolutional Neural Networks'. In Proceedings of 2016 ACM/SIGDA International Symposium on Field-programmable Gate Arrays , FPGA'16, pages 16-25, New York, NY, USA, 2016
- [13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, 'Optimizing FPGAbased Accelerator Design for Deep Convolution Neural Networks'. In Proceedings of 2015 ACM/SIGDA International Symposium on Fieldprogrammable Gate Arrays , FPGA15, pages 161-170, New York, NY, USA, 2015.
- [14] U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu, 'An OpenCL Deep Learning Accelerator on Arria 10'. In Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pages 55-64, Monterey, California, USA, February 22-24 2017
- [15] M. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O'Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. Ling and G. Chiu, 'DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration'. In Proceedings of 2018 International Conference on Field-Programmable Logic and Applications (FPL) , Dublin, Ireland, August 27-31 2018
- [16] B. Liu. and X. Xu, 'FCLNN: A Flexible Framework for Fast CNN Prototyping on FPGA with OpenCL and Caffe'. In Proceedings of 2018 International Conference on Field-Programmable Technology (FPT) , Naha, Japan, December 10-14, 2018
- [17] D. Wang, K. Xu and D. Jiang, 'PipeCNN: An OpenCL-based Open-Source FPGA Accelerator for Convolution Neural Networks'. In Proceedings of 2017 International Conference on Field Programmable Technology (ICFPT) , Melbourne, Australia, December 11-13, 2017
- [18] J. Zhang and J. Li, 'Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network'. In Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pages 25-34, Monterey, California, USA, February 22-24, 2017
- [19] K. Guo, S. Zeng, J. Yu, Y. Wang and H. Yang, '[DL] A Survey of FPGA-based Neural Network Inference Accelerator'. ACM Transactions on Reconfigurable Technology and Systems , Vol 9, No. 4, article 11, 26 pages, December 2017
- [20] Y. Fu, E. Wu, A. Sirasao, S. Attia, K. Khan and R. Wittig, Deep Learning with INT8 Optimization on Xilinx Devices .[online], Xilinx Inc, April, 2017.
- [21] S. Han, H. Mao and W. Dally, 'Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding'. In Proceedings of 2016 International Conference on Learning Representations (ICLR) , San Juan, Peurto Rico, May 2-4, 2016
- [22] Xilinx, Vivado Design Suite User Guide: High-level Synthesis UG902 . [online], Xilinx Inc, December 2018.
- [23] Xilinx, SDAccel Profiling and Optimization Guide, UG1207 . [online], Xilinx Inc, June 2018.
- [24] Intel, Intel High Level Synthesis Compiler User Guide , 19.2. [online] Intel Inc, July, 2019.
- [25] Intel, Intel FPGA SDK for OpenCL Pro Edition Programming Guide , 19.2. [online], Intel Inc, July 2019.
- [26] M. Hutton, Understanding How the New Intel HyperFlex FPGA Architecture Enables Next-generation High-performance Systems . [online] Intel, Inc,
- [27] K. He, L. Crockett and R. Stewart, 'Dynamic Reconfiguration Technologies based on FPGA in Software Defined Radio System'. Springer Journal of Signal Processing Systems , Vol. 69, Issue1, October, 2012, pp75-85