1.
Introduction
Deep neural networks (DNNs) have achieved remarkable achievements on various demanding applications including image classification[1, 2], object detection[3, 4] and semantic segmentation[5, 6]. In resource-limited settings, the development of real-time and low-power hardware accelerators is especially critical, and hence various hardware devices including FPGAs and ASICs have been utilized for implementing embedded DNN applications. In particular, FPGAs are gaining popularity because of their capability to provide superior energy efficiency and low-latency processing while supporting high reconfigurability, making them suitable for accelerating rapidly evolving deep neural networks[7-9].
However, most of the existing FPGA accelerators are designed for inference with low-precision DNN models, which are trained on high-precision models (e.g. 32/64-bit floating point models) separately on GPU or CPU. Since DNNs employ different precision formats for training and inference, they often need further fine-tuning to achieve acceptable accuracy. The separate training/inference processes make existing FPGA accelerators difficult to support, for example, systems requiring continual learning[10]. Various low-precision training techniques including mixed precision[11, 12], fixed-point[13, 14] and ternary[15, 16] parameters, have been proposed to reduce the fine-tuning overhead by low-precision models.
In this paper, we explore the benefits and drawbacks of employing CPU, GPU and FPGA platforms for low-precision training. An novel FPGA framework is developed to support DNN training on a single FPGA with a low-precision format of 8-bit integer (int8). Our objective is to determine if the fine-grained customizability and flexibility offered by FPGAs can be exploited to outperform cutting-edge GPUs in low precision training in terms of speed and power consumption.
To meet our objective, the following challenges should be addressed.
(1) The training process, compared to inference process, brings additional computations and different operations performed in backward propagation[17]. This leads to differences in requirements for hardware architecture and computational resources.
(2) Existing FPGA accelerators for inference usually exploit image-level and layer-level parallelism for efficient computing. On contrast, FPGA accelerators for training need to proceed with batches of training examples in parallel. Therefore, effective exploitation of the batch-level parallelism should contribute significant acceleration.
(3) Throughput is the primary performance metric of concern for training, while inference is latency sensitive. This cause batch-level parallelism to be neglected at inference accelerators.
To solve these problems, this paper proposes a novel FPGA architecture for DNN training by introducing a batch-oriented data pattern which we refer to as channel-height-width-batch (CHWB) pattern. The CHWB pattern allocates training samples of different batches at adjacent memory addresses, which enables parallel data transfer and processing to be achieved within one cycle. Our architecture can support the entire training process inside a single FPGA and accelerate it with batch-level parallelism. A thorough exploration of the design space with different levels of parallelism and their corresponding architectures with respect to resource consumption and performance is also presented in this paper.
Moreover, we propose DarkFPGA, an FPGA-based deep learning framework with a dataflow architecture. Our approach is built on Darknet framework[18], a neural network framework written in C and CUDA, with FPGA implementation written in MaxJ[19]. The proposed paper is an extended version work from an earlier conference paper[20]. Contributions of the previous version were as follows:
(1) A novel accelerator for a complete DNN training process. A dataflow architecture that explores batch-level parallelism for efficient FPGA acceleration of DNN training is developed, providing a power-efficient and high-performance solution for efficient training.
(2) A deep learning framework for low-precision training and inference on FPGAs called DarkFPGA. We perform extensive performance evaluations for our framework on the MAX5 platform for the training of several well-known networks.
(3) An automatic optimization tool for the framework to explore the design space to determine the optimal parameters for a given network specification.
Additionally, this paper contributes as follows:
(1) Toward the timing problems caused by batch-level parallelism, the pipelining registers are inserted to reduce fan-out, while the super-logic region allocation is proposed to avoid long-wires interconnection.
(2) Training with INT8 weights, instead of ternary weights, is proposed to maintain stable training performance for low-precision model.
The organization of this paper is organized as follows. Section 2 reviews the training and inference processes and some existing FPGA-based accelerators. Section 3 introduces the deep learning algorithm training using low-bits number system. Section 5 proposes the dataflow accelerators designed for GEMM operations. Section 6 discusses the design space exploration for optimizing accelerator design. Section 7 presents our framework of DarkFPGA. Section 8 shows the experimental results, and we conclude the whole paper on Section 9.
2.
Background
This section provides a background information of DNN training, emphasizing its difference from inference. Meanwhile, the cutting-edge FPGA accelerators for deep neural network are also introduced here.
2.1
Training versus Inference
The training consists of forward propagation to compute the loss of the cost functions, and backward propagation to compute the gradients of the cost function, subsequently using gradients to update the model weights for learning desirable behavior. Unlike inference with only forward propagation, training with backward propagation is more computationally expensive and introduce additional operations for backward propagation.
Fig. 1 illustrates the overview of the inference and training of a convolutional layer. For a specific layer
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-1.jpg'"
class="figure_img" id="Figure1"/>
Download
Larger image
PowerPoint slide
Figure1.
A overview of inference and training processes on the convolutional layer.
For better understanding, the pseudocode for training a convolutional layer is presented on Algorithm 1, which provides a precise description for the training process. The meaning of the notations can be found in Table 1, where the same set of notation is also followed in the rest of this paper.
Parameter | Description |
$ B $ | the batch size of training examples |
$ C $ | the size of channel |
$ F $ | the size of filter |
$ K $ | the kernel size of weights |
$ H $ | the height of frames |
$ W $ | the width of frames |
Table1.
Parameters for FPGA training.
Table options
-->
Download as CSV
Parameter | Description |
$ B $ | the batch size of training examples |
$ C $ | the size of channel |
$ F $ | the size of filter |
$ K $ | the kernel size of weights |
$ H $ | the height of frames |
$ W $ | the width of frames |
2.2
Related works
Most FPGA accelerators mainly focus on the DNN inference acceleration[22-27]. They[24, 28] usually exploits image-level and layer-level parallelism extensively for efficient inference speedup. For training accelerators[29-35], Geng et al.[31] explore layer-level parallelism for training a model on multiple FPGAs in a pipelined manner. Li et al.[32] study different reconfigurable communication patterns on a multi-FPGA cluster. Dicecco et al.[33] study low-precision training with a reduced precision floating-point library. However, those accelerators usually deploy the inference architecture naively for training without considering the batch-level parallelism and additional backward operations, which may lead to undesirable performance.
Recently, some researchers[29, 36] attempt to tackle the problem by distributing the DNN computations across a heterogeneous FPGA-CPU system. Moss et al.[36] propose to perform the core GEMM operations on FPGAs and leave CPU for the remaining jobs. This solution works well on FPGA-CPU heterogeneous system but requires effective load balancing support for heterogeneous devices, since unpredictable communication cost between CPUs and FPGAs can make the FPGA-CPU cross communication a new bottleneck of the design. Based on our profiling in Section 8, the operations executed on CPU may require more computational time than FPGA acceleration of matrix multiplication.
With the objective of speeding up training, this paper studies the acceleration of entire training on a single FPGA, explores the parallelism in training batches, and provides an architecture suitable for bidirectional propagation. We propose a low-precision DNN training framework accelerated on a single FPGA platform. Compared to other frameworks, our proposed customizable FPGA design achieves about 10 times speedup over a CPU-based implementation and is about 2.5 times more energy efficient than a GPU-based implementation.
Algorithm 1: Pseudocode for training convolutional layers |
1 Forward propagation: |
2 for b = 1 to B do |
3 for c = 1 to C × K do |
4 for f = 1 to F do |
5 for im = 1 to H *W do |
6 Al+1[b][f][im] += Wl[f][c] *Al[b][c][im] |
7 Backward propagation: |
8 for b = 1 to B do |
9 for c = 1 to C × K do |
10 for f = 1 to F do |
11 for im = 1 to H *W do |
12 El[b][c][im] += Wl[f][c] *El+1[b][f][im] |
13 Gradient Generation: |
14 for b = 1 to B do |
15 for c = 1 to C × K do |
16 for f = 1 to F do |
17 for im = 1 to H *W do |
18 Gl[b][f][c] += Al[b][c][im] *El+1[b][f][im] |
Table options
-->
Download as CSV
Algorithm 1: Pseudocode for training convolutional layers |
1 Forward propagation: |
2 for b = 1 to B do |
3 for c = 1 to C × K do |
4 for f = 1 to F do |
5 for im = 1 to H *W do |
6 Al+1[b][f][im] += Wl[f][c] *Al[b][c][im] |
7 Backward propagation: |
8 for b = 1 to B do |
9 for c = 1 to C × K do |
10 for f = 1 to F do |
11 for im = 1 to H *W do |
12 El[b][c][im] += Wl[f][c] *El+1[b][f][im] |
13 Gradient Generation: |
14 for b = 1 to B do |
15 for c = 1 to C × K do |
16 for f = 1 to F do |
17 for im = 1 to H *W do |
18 Gl[b][f][c] += Al[b][c][im] *El+1[b][f][im] |
3.
Low-precision DNN training algorithm
Our low-precision training algorithm is developed based on WAGE[15], which is modified version for better FPGA implementation using shift-based linear mapping and hardware-friendly quantization. Our optimizations are illustrated here.
The basic idea of WAGE[15] is to constrain four operands to low-bitwidth integers: weight
3.1
Shift-based linear mapping
In order to quantize floating-point numbers to fixed-point number, k-bit linear mapping is adopted on WAGE[15], where continuous and unbounded values are discretized with uniform distance
$$begin{array}{l} sigma(k) = 2^{(1-k)}, quad k in N_+ , Q(x,k) = { m{Clip}}{ sigma(k) times { m{round}} left[dfrac{x}{sigma(k)} ight], -1 + sigma(k), 1 - sigma(k) } .end{array}$$ |
Here round function maps quantized floating-point number to nearest fixed-point number. Clip is the saturation function to clip unbounded values to
Considering large hardware implementation overhead for floating-point operations, mathematical equivalent integer operations are introduced in our implementation, where the linear mapping is transformed into shifting from large data format (32-bit integers) to small integers (
$$ begin{array}{l}sigma(k) = 2^{(k-1)}, quad k in N_+, Q(x,k,{ m{shift}}) = { m{Clip}}left{ (x + { m{round_value}} ) gg { m{shift}}, ight.qquadqquadqquad left. -1 + sigma(k), 1 - sigma(k) ight} , { m{round_value}} = 1 ll ({ m{shift}} - 1). end{array} $$ |
Here we replace division operations used in float-point equations d with shift operations with an additional monolithic scaling factor shift for shifting values distribution to an appropriate order of magnitude. The scaling factor shift is obtained in WAGE[15] by following equation.
$$ begin{array}{l} { m{shift}}(x) = {{ m{round}}({ m{log}}_2 x )} . end{array} $$ |
With complex logarithm and exponential operation, the shift(x) requires extensive resources to be implemented on FPGA. To handle this problem, we fine-tune this formula, which is used to obtain the nearest power-of-two value from input
$$ begin{array}{l} { m{shift}}(x) = {{ m{ceil}}({ m{log}}_2 x )}. end{array} $$ |
After fine-tuning, the shift factor is obtained from smallest power-of-two value greater than
$$ begin{array}{l} { m{shift}}(x) = ({ m{leading1}}(x) + 1) . end{array} $$ |
Here leading1 function detects the position of the most significant "important" bit and return the index of the most significant "important" bit only. After detailed experiments, the fine-tuning has no effect on the convergence of network training but more hardware-friendly for FPGA implementation.
3.2
Quantization details
The quantization operations consist of four operations
3.2.1
Weight QW
Weights are initialized on software platform based on the initialization method of He et al.[37], which can be formulated as:
$$ begin{array}{l} W ; U(-L, +L), ; L = { m{max}}{sqrt{6/n_{ m {in}}}, L_{{ m{min}}}}, ; L_{{ m{min}}} = 1 , end{array} $$ |
where
m{ in}} $
m{ in}}} $
m{min}} $
3.2.2
Activation QA
For activation, the bitwith of activation would increase after computation. A filter-wise scaling factor
m {shift}} $
$$ begin{array}{l} a_{ m {shift}} = { m{log}}_2({ m{max}}{{ m{shift}}(L_{ m{min}}/L), 0}). end{array} $$ |
This factor is pre-defined constant for each layer determined by the network structure. Using this factor we can obtain the quantized activation using the following equation:
$$ begin{array}{l} a_{ m{q}} = Q(a,k_{ m{A}},a_{ m {shift}}). end{array} $$ |
3.2.3
Error QE
Experiments from WAGE[15] demonstrate the orientation of errors plays an important role on the converge performance during training. Therefore, orientation-based quantization scales errors into [–1,1] by dividing a shift factor as:
$$ begin{array}{l} e_{ m{q}} = Q(e,k_{ m{E}},{ m{shift}}({ m{max}}|e|)), end{array} $$ |
where
m{max}}{|e|} $
m{max}}{|e|} $
$$ begin{array}{l} e_{ m{q}} = Q(e,k_{ m{E}},{ m{shift}}({ m{or}}|e|)) , end{array} $$ |
where
m{or}}{|e|} $
3.2.4
Gradient QG
Since we only preserve the relative value of the error after shifting, the gradients are shifted consequently. Here we first rescale the gradient
$$ begin{array}{l} g_{ m{q}} = { m{Bernoulli}}{ (etatimes g ) gg g_{{ m{shift}}} }, g_{{ m{shift}}} = { m{shift}}({ m{or}}|g|)), end{array} $$ |
where
m{or}}{|g|} $
m{max}}{|g|} $
Bernoulli[38] function was originally design in floating-point number system to stochastically sample fractional parts to either 0 or 1. The nature of the Bernoulli distribution is the larger number has higher probability to 1 and the smaller number has higher probability to 0. On contrast in integer system, Bernoulli function is stochastically rounding the shifting parts of quantized value, which is realized by a random number generator MersenneTwister, a widely-used general-purpose pseudorandom number generator[39] to generate Limited range of random numbers according to shift data with uniform distribution. The MersenneTwister adds Bernoulli property by addition as following equation:
$$ begin{array}{l} g_{ m{q}} = { m{Clip}}left{ (etatimes g + { m{round_value}}) gg g_{{ m{shift}}}, ight.qquad left. -,1 + sigma(k), 1 - sigma(k) ight}, g_{{ m{shift}}} = { m{shift}}({ m{or}}|g|)) , { m{round_value}} = { m{random_int}};{ m{mod}} (1ll g_{{ m{shift}}}), end{array} $$ |
where
m{random_int}} $
4.
Date pattern and tilling technique
4.1
CHWB Pattern
For DNN training, the weights, activations, errors and gradients are too large to be stored completely in the on-chip memory, where only a portion of data can be cached on-chip while the remaining is kept off-chip. As the bandwidth between the on-chip and off-chip memory is limited, exploring an optimal data access pattern to for efficient bandwidth utilization is necessary for training.
Currently, the most widely-used data pattern for training on GPUs is referred as batch-channel-height-width (BCHW), which depicts the order of data dimensions in the memory space[40], where the elements along the lowest dimension W are stored consecutively. An example of data represented in the BCHW pattern is shown on Fig. 2(a), whose corresponding data layout is illustrates in the DRAM in Fig. 2(c). But this pattern is difficult to fetch the elements from different batches in burst mode, because they are usually not stored consecutively in memory. Therefore, BCHW data pattern may under-utilize the bandwidth when exploring batch-level parallelism.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-2.jpg'"
class="figure_img" id="Figure2"/>
Download
Larger image
PowerPoint slide
Figure2.
(Color online) Comparison of BCHW and CHWB patterns.
To handle the problem, we develop the channel-height-width-batch (CHWB) pattern to explore batch-level parallelism without compromising bandwidth utilization on FPGAs. As shown in Fig. 2(b), the elements from adjacent batches are allocated consecutively, which allows the memory interface to simultaneously read multiple training examples. In this manner, CHWB data pattern enables our accelerator to acquire all necessary input data with a single DRAM burst access, and greatly improve bandwidth utilization for FPGA accelerator.
4.2
Tiling
Tiling is a common optimization technique to improve bandwidth utilization for DNN acceleration on resource-limited FPGA devices[41]. The strategy partitions large input frames into smaller tiles of data, where each tile can be fitted into the on-chip memory of an FPGA. For training with some resource-intensive tasks such as matrix transpose, tiling strategy is necessary for their FPGA implementation.
For the CHWB pattern, we consider tiling along four data dimensions: batch tile
m B} $
m C} $
m F} $
m I} $
m I} $
m C} $
m F} $
m B} $
m I} $
m B} $
m I} $
Taking convolution to explain how tiling technique works. In Fig. 3, the input matrix is stored using CHWB pattern 3-dimensions
m B}times T_{
m I} times T_{
m I} $
m I} times T_{
m I} $
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-3.jpg'"
class="figure_img" id="Figure3"/>
Download
Larger image
PowerPoint slide
Figure3.
(Color online) The tiling flow for convolution.
The tiling parameters
m B} $
m I} $
5.
FPGA accelerators for DNN training
In this section, we follow the idea of CHWB pattern and tiling technique to develop the architecture of our training accelerator.
5.1
System overview
A system overview of our FPGA-based training accelerator is presented on Fig. 4, which consists of a computation kernel, a global controller and a DDR controller for off-chip memory transfer. The computational kernel consists of a batch splitter, a set of processing elements (PEs) and a batch merger. When a stream of training batches arrives at the kernel, the splitter divides the stream into multiple parallel streams via shift registers to facilitate batch-level parallelism. The streams are then processed by the PEs in parallel. Each PE involves a general matrix multiplication kernel (GEMM kernel) or an auxiliary kernel to perform training operations. After processing, the streams are merged into a single output stream, then sent to the DDR controller. The global controller is responsible for controlling the behaviour of each computation kernel, including assigning memory addresses for loading/writing data through the DDR controller, enabling special operations required by particular layers, and controlling the direction of the data flow. The CPU sets the network configuration in the global controller before starting training.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-4.jpg'"
class="figure_img" id="Figure4"/>
Download
Larger image
PowerPoint slide
Figure4.
(Color online) System overview.
5.2
Unified GEMM kernel
Fig. 5 presents the architecture of the GEMM kernel, which provides a unified datapath to support the convolutional and fully-connected computations of the forward and backward propagation, as well as the gradient generation. This unified approach employ matrix multiplication to implement for these computations, where only the input/output matrix to/from the kernel needs to be changed. Therefore, we can avoid time-consuming dynamic reconfiguration[29] or using separate kernels for different operations[31].
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-5.jpg'"
class="figure_img" id="Figure5"/>
Download
Larger image
PowerPoint slide
Figure5.
(Color online) Hardware architecture of GEMM kernel.
Before any computation, the input data streams are stored in the input buffers, which are organized as a double buffer in order to overlap the data transfer and matrix transposition with the computation. As shown in Fig. 6, when the
m{th}} $
m{th}} $
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-6.jpg'"
class="figure_img" id="Figure6"/>
Download
Larger image
PowerPoint slide
Figure6.
(Color online) Input double buffer supporting matrix transposition.
The GEMM kernel fetches data from the input buffers to perform tiled matrix multiplication. The intermediate values during each iteration are stored in the output buffers for the next iteration. The final results are post-processed by the batch merger then transferred back to the DRAM. The details of the tiled matrix multiplication are shown in Algorithm 2. Noted that the shift factor of forward propagation is pre-defined while the the shift factor of backward propagation is obtained from output results. Therefore, the quantization function can be attached after forward propagation to reduce output bandwidth. On contrast, the output results maintain Int32/Int16 format during backward propagation as well as gradients generations, which require quantization with the help of auxiliary kernels.
In order to support different modes of operations for the forward propagation, the backward propagation and the gradient generation, the global controller dynamically re-configures the buffers and data flow on the datapath. Under the control of global controller, the input buffer can be configured to perform on-the-fly matrix transposition for the computation of backward propagation. Furthermore, the multiplexer can be switched to feed the different input streams to their corresponding processing elements, and the demultiplexer can be switched to direct the output stream to the appropriate postprocessing unit.
5.3
Auxiliary kernels
The auxiliary kernels accelerate supplementary operations with batch-level parallelism including im2col, col2im, max-pooling, reshape, summation, nonlinear functions and quantization as well as their backward counterparts (if necessary). These supplementary operations processed independently since they have no learnable weights and occupy only a small amount of total computation.
For various supplementary operations, various types of separate processing units is implemented to support them. In particular, the maxpooling units are responsible for computing the maximum value and corresponding index over a number of neighbor pixels, whereas the backward maxpooling units propagate errors the to chosen index of subgraphs. The im2col expands the input feature map into column vectors, and col2im accumulate column the vectors back to input feature map. The quantization units are designed for casting intermediate variables of errors and gradients generated during backward propagation and gradients generation to low bit-width quantized data (8-bit integer in our design). The summation units simply add two flows together and the reshape units change the location of data in buffers.
6.
Design space exploration
This section presents the design space exploration for optimizing the proposed DNN training accelerator. The performance of FPGA implementations is affected by factors including batch tiling size
m B} $
m I} $
6.1
Resource modeling
There are three kinds of hardware resources in FPGAs: LUT, Block RAM and DSP, which form the resource constraints of our design space. We present equations to estimate the utilization for each of them.
First, the resource consumption of the global controller and the DRAM controller is independent of the design parameters, while they are defined as
m{LUT}}_{
m {fix}} $
m{DSP}}_{
m {fix}} $
m{BRAM}}_{
m {fix}} $
Second, the resource consumption of the computational kernels is affected significantly by different design parameters. For example, BRAMs are utilized in the input buffers of GEMM kernels and their usage is given by:
$$ begin{array}{l} { m{BRAM}} = dfrac{4 times T_{ m{B}} times T_{ m{I}}^2 times (2times L_{ m{I}}) + 4 times T_{ m{I}}^2 times L_{ m{W}}}{{ m{BRAM}}_{ m{SIZE}}}, end{array} $$ |
where the constants 4 are contributed by the double buffers for both the normal matrix and the transposed matrix.
The multiply-and-add units utilize the DSPs as
$$ begin{array}{l} { m{DSP}} = T_{ m B} times T_{ m I} times D_{ m{mul}} + T_{ m B} times A_{ m l} times D_{ m{add}} + T_{ m B} times D_{ m{add}}, end{array} $$ |
where
m {mul}} $
m {add}} $
m l} $
m{log}}_2 (T_{
m I}) $
Finally, an approximate regression model is proposed to estimate the resource consumption of LUT as it is difficult to predict statically:
$$ begin{array}{l} { m{LUT}} = T_{ m B} times T_{ m I} times beta + T_{ m B} times delta , end{array} $$ |
where
6.2
Bandwidth modeling
There are three streams flowing from the DRAM to the GEMM kernels. In each cycle of convolution, one weight is read from the weight stream while
m B} $
m I} $
$$ begin{array}{l} { m{BW}}_{ m{CONV}} = ( T_{ m{B}} times L_{ m{I}} + dfrac{ T_{ m{B}} times T_{ m{I}}} { N } times L_{ m{O}} + L_{ m W}) times f, { m{BW}}_{ m{FC}}= ( T_{ m{B}} times L_{ m{I}} + dfrac{ T_{ m B} times T_{ m{I}}} { N }times L_{ m{O}} + T_{ m{I}} times L_{ m{W}}) times f, end{array} $$ |
where
m I} $
m O} $
m W} $
For the auxiliary kernel, the bandwidth requirements are relatively large compared to the small amount of computations performed. In general, it may take one or two input values to generate one or two output values, which handles up to 4 values in each cycle. As these operations benefit from batch-level parallelism, the bandwidth requirements have also multiplied
m B} $
$$ begin{array}{l} { m{BW}}_{ m{auxiliary}} = (2times T_{ m B} times L_{ m I} + 2times T_{ m B} times L_{ m O}) times f. end{array} $$ |
6.3
Performance modeling
In each clock cycle, a GEMM kernel can accomplish
m I} $
m B} $
$$ begin{array}{l} {{T}}_{ m{CONV}}= dfrac{B times C times K times F times H times W}{T_{ m I} times T_{ m B} times f}, {{T}}_{ m{FC}}= dfrac{B times C times F }{T_{ m I} times T_{ m B} times f}. end{array} $$ |
However, the above formulae are only valid for the sequential case. In fact, in order to support parallel computing, tiled matrices are filled with zero values which may affect the actual computational time. Therefore, the computational time is estimated as:
$$ begin{array}{l} T_{ m{CONV}} = dfrac{lceil B ceil ^{T_{ m B}} times lceil C times K ceil ^{T_{ m I}} times lceil F ceil ^{T_{ m I}} times lceil H times W ceil ^{T_{ m I}} }{T_{ m B} times T_{ m I} times f}, T_{ m{FC}} = dfrac{lceil B ceil ^{T_{ m B}} times lceil C ceil ^{T_{ m I}} times lceil F ceil ^{T_{ m I}} }{T_{ m B}times T_{ m I}times f} , lceil X ceil ^{T} = { m{ceil}}(X / T) times T, end{array} $$ |
where function
ceil ^{T} $
On the other hand, in each cycle of the auxiliary kernel, frames batches can be handled simultaneously, where the computational time of auxiliary kernels is:
$$ begin{array}{l} T_{ m{auxiliary}} = dfrac{lceil B ceil ^{T_{ m B}} times C times H times W }{T_{ m B} times f,} end{array} $$ |
Benefited from our dataflow architecture, the transmission time of the computational kernels can be overlapped by the communication time.
By evaluating the performance of every combination based on the above models, a single-objective optimization tool can be built for minimal execution time as:
$$begin{array}{l} { m{Minimize;Time}} = T'_{ m{CONV}} + T'_{ m{FC}} + T'_{ m{auxiliary}} , { m{where}}left{ begin{array}{l} { m{LUT}} + { m{LUT}}_{ m{fix}} leqslant { m{LUT}}_{ m{limit}} , { m{BRAM}} + { m{BRAM}}_{ m fix} leqslant { m{BRAM}}_{ m{limit}}, { m{DSP}} + { m{DSP}}_{ m{fix}} leqslant { m{DSP}}_{ m{limit}} , { m{BW}} leqslant { m{BW}}_{ m{limit}}, end{array} ight. end{array} $$ |
where
m{LUT}}_{
m{limit}} $
m{BRAM}}_{
m{limit}} $
m{DSP}}_{
m{limit}} $
m{BW}}_{
m{limit}}$
m{CONV}} $
m{FC}} $
m{auxiliary}} $
7.
The DarkFPGA framework
For our proposed dataflow architecture, we present DarkFPGA, a hardware/software co-designed FPGA framework for effective training. DarkFPGA framework is built with scalable accelerator architecture which is software-definable to support various DNN networks and different parallel levels through deploying different FPGA bitstreams, where a multi-level parallelism scalable FPGA design is developed. Moreover, a optimizing tool is included to produce optimised design for optimised performance based on user constraints.
We automate the process of exploring design parameters for the DarkFPGA framework, which accelerates the entire training process with a unified module on FPGA. Our tool can receive a network description and a training dataset to produce the most suitable parameters for the accelerator. The overview of our DarkFPGA framework and optimizing tool are illustrated in Fig. 7 with six stages:
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-7.jpg'"
class="figure_img" id="Figure7"/>
Download
Larger image
PowerPoint slide
Figure7.
(Color online) The DarkFPGA framework.
(1) Parse network description. The tool predicts optimized parameter values and selects a suitable FPGA bitstream to configure hardware.
(2) Allocate device DRAM space for the activations, weights, errors and gradients.
(3) Initialize weights and transfer them to DRAM.
(4) Fetch and transfer the training samples to DRAM. Data reorganization is used to convert training samples into the CHWB sequence.
(5) Launch FPGA acceleration.
(6) Train neural network iteratively. Transfer loss and accuracy information back to the host for each complete training batch.
8.
Experimental result
We evaluate our framework on the Maxeler MAX5 platform, which consists of a Xilinx ultrascale+ VU9P FPGA. Three 16 GB DDR4 DIMMs are installed on the platform with a maximum bandwidth of 63.9 GB/s. Our hardware accelerator works at 200 MHz. Maxcompiler 2019.1 and Vivado 2018.3 are used for synthesis and implementation. The VGG-like network[43] trained on the Cifar10[1] dataset is evaluated in the following training experiments. Differ form the conference version of this paper[20], we use INT8 weights instead of ternary weight during training. Although the use of INT8 weights cause inevitable degradation in training performance, it is able to obtain more stable accuracy performance for training.
Noted while our implementation are able to achieve the massive parallelization with dataflow architecture, it may have difficulty in making the timing closure for a high clock frequency, or even passing the place and route. This is partly because these direct interconnects become long wires when the DSP blocks are distributed among the whole FPGA chip, and large-scale data reuse between DSP blocks introduces large fan-out[44]. Moreover, the timing closure get worse on Xilinx ultrascale+ VU9P FPGA which enables multiple super-logic regions (SLR) design. In this case, signal paths between two SLR may introduce large delays and significantly impact timing. DarkFPGA tackles the timing issue for two major optimization. First, a multiple level of pipelining registers is inserted after high fan-out data stream as a balanced tree to limits the fan-out it can have. Second, the GEMM Kernel of DarkFPGA is forced to be divided into two parallel submodules, which are assigned to a super-logic regions (SLR). These signal interconnects of GEMM Kernel are limited within their corresponding super-logic regions (SLR).
8.1
Exploration of DarkFPGA performance
Based on the discussions in Section 5 and Section 6, the performance of DarkFPGA is significantly determined by the tile sizes (
m B}, T_{
m I} $
Layer | B | C | F | H × W | K |
CONV1 | 128 | 3 | 128 | 32 × 32 | 3 × 3 |
CONV2 | 128 | 128 | 128 | 32 × 32 | 3 × 3 |
MAXPOOLING | 128 | 128 | 128 | 16 × 16 | 2 × 2 |
CONV3 | 128 | 128 | 256 | 16 × 16 | 3 × 3 |
CONV4 | 128 | 256 | 256 | 16 × 16 | 3 × 3 |
MAXPOOLING | 128 | 256 | 256 | 8 × 8 | 2 × 2 |
CONV5 | 128 | 256 | 512 | 8 × 8 | 3 × 3 |
CONV6 | 128 | 512 | 512 | 8 × 8 | 3 × 3 |
MAXPOOLING | 128 | 512 | 512 | 4 × 4 | 2 × 2 |
FC | 128 | 8096 | 1024 | – | – |
FC | 128 | 1024 | 10 | – | – |
SSE | 128 | 10 | 10 | – | – |
Table2.
The network architecture in experiment.
Table options
-->
Download as CSV
Layer | B | C | F | H × W | K |
CONV1 | 128 | 3 | 128 | 32 × 32 | 3 × 3 |
CONV2 | 128 | 128 | 128 | 32 × 32 | 3 × 3 |
MAXPOOLING | 128 | 128 | 128 | 16 × 16 | 2 × 2 |
CONV3 | 128 | 128 | 256 | 16 × 16 | 3 × 3 |
CONV4 | 128 | 256 | 256 | 16 × 16 | 3 × 3 |
MAXPOOLING | 128 | 256 | 256 | 8 × 8 | 2 × 2 |
CONV5 | 128 | 256 | 512 | 8 × 8 | 3 × 3 |
CONV6 | 128 | 512 | 512 | 8 × 8 | 3 × 3 |
MAXPOOLING | 128 | 512 | 512 | 4 × 4 | 2 × 2 |
FC | 128 | 8096 | 1024 | – | – |
FC | 128 | 1024 | 10 | – | – |
SSE | 128 | 10 | 10 | – | – |
The batch tile
m B} $
m B} $
m I} $
m {CONV}} $
m {FC}} $
m I} $
m B} $
m I} $
m B} $
m I} $
The corresponding performance and resource consumption under different design parameters (
m B} $
m I} $
m B} $
m I} $
m B} times T_{
m I} $
m B} times T_{
m I} $
m B} $
m B} $
m I} $
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-8.jpg'"
class="figure_img" id="Figure8"/>
Download
Larger image
PowerPoint slide
Figure8.
(Color online) Performance and resource consumption experiments under different design space using int8 weights. (a) Computational time. (b) Performance evaluation. (c) Resource consumption.
Therefore, we customize a DarkFPGA design to determine the optimal implementation of the training accelerator when
m B} = 128 $
m I} = 32 $
8.2
Heterogeneous versus homogeneous computing
Some of the existing FPGA accelerators rely on heterogeneous computing to handle auxiliary operations[36, 45]. To quantitatively compare the performance discrepancies between heterogeneous and homogeneous computing, our DarkFPGA framework is revised to implement heterogeneous computation across an FPGA-CPU heterogeneous system, which is achieved by delivering auxiliary operations to CPU and removing the auxiliary kernels.
In this experiment, the tiling size is set to (
m B} = 128, T_{
m I} = 32 $
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19110003-9.jpg'"
class="figure_img" id="Figure9"/>
Download
Larger image
PowerPoint slide
Figure9.
(Color online) Performance comparisons between homogeneous system and heterogeneous system.
Note that using multi-threaded or high-performance CPU can significantly improve heterogeneous computing performance. However their high power consumption brings a tough challenge for embedded DNN applications.
8.3
Performance comparison with GPU and CPU
Here we compare the performance of DarkFPGA with other platforms like GPU and CPU. All software results are running on an Intel Xeon X5690 CPU (6 cores, 3.47 GHz) and an NVIDIA GeForce GTX 1080 Ti GPU. After finishing the same number of batches, all platforms achieve similar accuracies. Unfortunately, GeForce GTX 1080 Ti does not have native int8 support, we evaluate the GPU performance by limiting the range of float32 number system, instead of actual GPU low-precision training.
Table 3 shows the performance and power consumption, as well as other important metric on different platforms. DarkFPGA can achieve over 200 times speedup over a CPU-based implementation of Darknet and is 2 times slower than a GPU-based implementation of Darknet on overall performance. The average power consumption (13.5 W) of FPGA is obtained by the Maxeler performance monitoring tools. By multiplying time and power consumption, our FPGA-based design is 5 times more energy efficient than GPU implementation of Darknet.
Parameter | CPU | GPU | DarkFPGA |
Platform | Intel Xeon X5690 | GTX 1080 Ti | MAX5 Platform |
No. of cores | 6 | 3584 | ? |
Compiler | GCC 5.4.0 | CUDA 9.0 | Maxcompiler 2019.2 |
Flag | -Ofast | ? | ? |
Frequency (GHz) | 3.47 | 1.58 | 0.2 |
Precision | 32-bit floating point | 32-bit floating point | 8-bit fixed point |
Technology (nm) | 32 | 28 | 16 |
Processing time per batch (ms) | 66439 (3270) | 126 (53.4) | 331 |
Threads | 1 (24) | ? | ? |
Power (W) | 131 (204) | 187 (217) | 13.5 |
Energy (J) | 8712 (667.1) | 23.6 (11.6) | 4.5 |
Energy efficiency | 1x (13x) | 369x (751x) | 1936x |
Table3.
Performance comparison among FPGA, CPU and GPU.
Table options
-->
Download as CSV
Parameter | CPU | GPU | DarkFPGA |
Platform | Intel Xeon X5690 | GTX 1080 Ti | MAX5 Platform |
No. of cores | 6 | 3584 | ? |
Compiler | GCC 5.4.0 | CUDA 9.0 | Maxcompiler 2019.2 |
Flag | -Ofast | ? | ? |
Frequency (GHz) | 3.47 | 1.58 | 0.2 |
Precision | 32-bit floating point | 32-bit floating point | 8-bit fixed point |
Technology (nm) | 32 | 28 | 16 |
Processing time per batch (ms) | 66439 (3270) | 126 (53.4) | 331 |
Threads | 1 (24) | ? | ? |
Power (W) | 131 (204) | 187 (217) | 13.5 |
Energy (J) | 8712 (667.1) | 23.6 (11.6) | 4.5 |
Energy efficiency | 1x (13x) | 369x (751x) | 1936x |
Note that Darknet is a lightweight neural network framework for fast iterative design, which limits the overall performance of GPU and CPU. For fair comparison, we evaluate the training performance on TensorFlow[46] using multi-threaded acceleration for CPU and cuDNN[47] acceleration for GPU, as shown in brackets. It shows that DarkFPGA achieves 10 times speed up over CPU-based implementation and 2.5 times more energy efficient than GPU implementation on TensorFlow.
8.4
Performance comparison with other FPGA-based training system
Finally, comparisons between out DarkFPGA and other FPGA training accelerators are conducted, in terms of resource utilization, performance and throughput and on Table 4. Since many important matrices are not provided and the models for training are different, comparison between different FPGA-based training accelerator in a fair is extremely difficult and not straightforward. Even in such situation, our DarkFPGA accelerator still show desirable performance. Our design outperforms FPDeep[31] in term of performance per FPGA. Accelerator from DiCecco et al.[33] outperform our accelerator in throughput by training small network similar to LeNet-5[1]. Moreover, in comparison to[35], a lightweight FPGA training accelerator, our DarkFPGA maintains a high throughput when training more smaller network VGG-16[43] compared with[35]. A work from[34] achieves much higher throughput than our implementation by pruning over 90% weights and 50% activation during training process, whose performance can be 4× faster than GPU.
Accelerator | Platform | Config | Model dataset | LUTs (kW) | DSPs efficiency | Performance (GOPS) | Throughput (image/s) |
F-CNN[29] FCCM 16 | Altera Stratix V | 8 FPGA | LeNet-5 MNIST | ? | ? | ? | 7 |
FPDeep[31] FCCM 18 | Virtex7 VC709 | 10 FPGA | AlexNet imagenet | ≈ 460 per FPGA | ≈ 2880 per FPGA | ≈ 1022 per FPGA | ? |
DiCecco et al.[33] FPGA 18 | Xilinx XCKU115 | 1 FPGA | LeNet-like CIFAR10 | ≈ 530 | ≈ 883 | ? | 522 |
Nakahara et al.[34] FPL 19 | UltraScale+ XCVU9P | 1 FPGA | VGG16 CIFAR10 | 934 | 1106 | ? | 4878 |
Sean et al.[35] FPT 19 | Zynq ZCU111 | 1 FPGA | VGG-16 CIFAR10 | 73.1 | 1037 | ? | 3.3 |
DarkFPGA | UltraScale+ XCVU9P | 1 FPGA | VGG-like CIFAR10 | 480 | 4202 | 1417 | 386.7 |
(1) '?' means this metrics is not provided on their papers, '≈' indicate that this value is obtained by approximate estimates. (2) The accelerator from Ref. [29] didn't compute the gradients for training. (3) The power consumption of Ref. [29] measured from entire development board when our power consumption is measured from single FPGA chip. |
Table4.
Performance comparison of different FPGA-based training accelerators.
Table options
-->
Download as CSV
Accelerator | Platform | Config | Model dataset | LUTs (kW) | DSPs efficiency | Performance (GOPS) | Throughput (image/s) |
F-CNN[29] FCCM 16 | Altera Stratix V | 8 FPGA | LeNet-5 MNIST | ? | ? | ? | 7 |
FPDeep[31] FCCM 18 | Virtex7 VC709 | 10 FPGA | AlexNet imagenet | ≈ 460 per FPGA | ≈ 2880 per FPGA | ≈ 1022 per FPGA | ? |
DiCecco et al.[33] FPGA 18 | Xilinx XCKU115 | 1 FPGA | LeNet-like CIFAR10 | ≈ 530 | ≈ 883 | ? | 522 |
Nakahara et al.[34] FPL 19 | UltraScale+ XCVU9P | 1 FPGA | VGG16 CIFAR10 | 934 | 1106 | ? | 4878 |
Sean et al.[35] FPT 19 | Zynq ZCU111 | 1 FPGA | VGG-16 CIFAR10 | 73.1 | 1037 | ? | 3.3 |
DarkFPGA | UltraScale+ XCVU9P | 1 FPGA | VGG-like CIFAR10 | 480 | 4202 | 1417 | 386.7 |
(1) '?' means this metrics is not provided on their papers, '≈' indicate that this value is obtained by approximate estimates. (2) The accelerator from Ref. [29] didn't compute the gradients for training. (3) The power consumption of Ref. [29] measured from entire development board when our power consumption is measured from single FPGA chip. |
These comparisons can somehow show that our DarkFPGA has the capacity to train deep neural network. Our analyses show that the improvement of performance comes from FPGA-based batch-level parallelism. In particular, the dataflow architecture allows us to fully exploit the advantages of batch-level parallelism and maximize throughput with the help of dataflow programming language[19].
Algorithm 2: Pseudocode of tiled matrix multiplication |
1 Consider that the weight matrix and gradient matrix are transferred into TI × TI tiled blocks, where input frames, output frames and error frames are transferred as 3-dimensions TB × TI × TI tiled blocks. In particular, the input frames and output frames of fully-connected layer are transferred as 2-dimensions TB × TI tiled blocks |
2 Convolutional forward propagation: |
3 for f = 1 to F/TI do |
4 for im = 1 to (H * W)/TI do |
5 for b = 1 to B/TB do |
6 for c = 1 to C × K/TI do |
7 Al+1(b)(f, im)c = Wl(f, c) × Al(b)(c, im) |
8 Al+1(b)(f, im) + = Al+1(b)(f, im)c |
9 Quantize (Al+1(b)(f, im)) |
10 Output Al+1(b)(f, im) |
11 Convolutional backward propagation: |
12 for c = 1 to $ {C times K/T_I }$ do |
13 for im = 1 to ${ (H * W)/T_I }$ do |
14 for b = 1 to ${ B/T_B }$ do |
15 for f = 1 to $ {F/T_I }$ do |
16 ${ E_{l-1}(b)(c,im)_f = W_l(f,c) times E_l(b)(f,im) } $ |
17 ${ E_{l-1}(b)(c,im) mathrel{+}= E_{l-1}(b)(c,im)_f } $ |
18 Output ${ E_{l-1}(b)(c,im)} $ |
19 Convolutional gradients generations: |
20 for f = 1 to ${ F/T_I} $ do |
21 for c = 1 to ${ C times K/T_I} $ do |
22 for b = 1 to $ {B/T_B} $ do |
23 for im = 1 to $ {(H * W)/T_I} $ do |
24 $ {G_l(b)(f,c)_{im} = E_{l+1}(b)(f,im) times A_l(b)(c,im)^T } $ |
25 $ {G_l(b)(f,c) mathrel{+}= G_l(b)(f,c)_{im}} $ |
26 $ {G_l(f,c) += sum_{b=1}^{T_B} G_l(b)(f,c)_{im}} $ |
27 Output $ {G_l(f,c) }$ |
28 Fully-connected forward propagation: |
29 for f = 1 to ${ F/T_I} $ do |
30 for b = 1 to ${ B/T_B} $ do |
31 for c = 1 to $ {C/T_I} $ do |
32 $ {A_{l+1}(b)(f)_c = A_l(b)(c)times W_l(f,c)^T} $ |
33 $ {A_{l+1}(b)(f) mathrel{+}= A_{l+1}(b)(f)_c } $ |
34 Output $ {A_{l+1}(b)(f)} $ |
Table options
-->
Download as CSV
Algorithm 2: Pseudocode of tiled matrix multiplication |
1 Consider that the weight matrix and gradient matrix are transferred into TI × TI tiled blocks, where input frames, output frames and error frames are transferred as 3-dimensions TB × TI × TI tiled blocks. In particular, the input frames and output frames of fully-connected layer are transferred as 2-dimensions TB × TI tiled blocks |
2 Convolutional forward propagation: |
3 for f = 1 to F/TI do |
4 for im = 1 to (H * W)/TI do |
5 for b = 1 to B/TB do |
6 for c = 1 to C × K/TI do |
7 Al+1(b)(f, im)c = Wl(f, c) × Al(b)(c, im) |
8 Al+1(b)(f, im) + = Al+1(b)(f, im)c |
9 Quantize (Al+1(b)(f, im)) |
10 Output Al+1(b)(f, im) |
11 Convolutional backward propagation: |
12 for c = 1 to $ {C times K/T_I }$ do |
13 for im = 1 to ${ (H * W)/T_I }$ do |
14 for b = 1 to ${ B/T_B }$ do |
15 for f = 1 to $ {F/T_I }$ do |
16 ${ E_{l-1}(b)(c,im)_f = W_l(f,c) times E_l(b)(f,im) } $ |
17 ${ E_{l-1}(b)(c,im) mathrel{+}= E_{l-1}(b)(c,im)_f } $ |
18 Output ${ E_{l-1}(b)(c,im)} $ |
19 Convolutional gradients generations: |
20 for f = 1 to ${ F/T_I} $ do |
21 for c = 1 to ${ C times K/T_I} $ do |
22 for b = 1 to $ {B/T_B} $ do |
23 for im = 1 to $ {(H * W)/T_I} $ do |
24 $ {G_l(b)(f,c)_{im} = E_{l+1}(b)(f,im) times A_l(b)(c,im)^T } $ |
25 $ {G_l(b)(f,c) mathrel{+}= G_l(b)(f,c)_{im}} $ |
26 $ {G_l(f,c) += sum_{b=1}^{T_B} G_l(b)(f,c)_{im}} $ |
27 Output $ {G_l(f,c) }$ |
28 Fully-connected forward propagation: |
29 for f = 1 to ${ F/T_I} $ do |
30 for b = 1 to ${ B/T_B} $ do |
31 for c = 1 to $ {C/T_I} $ do |
32 $ {A_{l+1}(b)(f)_c = A_l(b)(c)times W_l(f,c)^T} $ |
33 $ {A_{l+1}(b)(f) mathrel{+}= A_{l+1}(b)(f)_c } $ |
34 Output $ {A_{l+1}(b)(f)} $ |
9.
Conclusion
This work proposes DarkFPGA, a novel FPGA framework for efficient training of deep neural networks, with a customized low-precision DNN training algorithm. The DarkFPGA accelerator explores batch-level parallelism, which provides efficient training acceleration for both forward and backward propagation on a homogeneous FPGA system. Optimization strategies such as batch-focused data sequence CHWB and tiling strategies are employed to improve overall performance. Furthermore, an optimization tool is developed for determining the optimal design parameters for a specific network description. Future work includes applying DarkFPGA to multi-FPGA clusters, exploring mixed precision and binarised training, and supporting cutting-edge network functions like group normalization and depthwise convolution.