1.
Introduction
Interconnect and logic resources can be seen as the two significant parts of FPGAs. Logic blocks are used to implement a user design. Interconnect resources are designed to achieve connections among logic blocks. In FPGAs, logic resources are always organized as arrays of blocks. Interconnect resources are routing switches and wires grouped into channels. FPGA interconnects (or interconnection network) can be thought of as a programmable network of signal paths among FPGA logic resource ports.
Both measurements and analyses indicate that the programmable interconnections contribute the most to the FPGA area, latency, and power consumption. In order to achieve high routability with reduced routing, FPGA vendors use the substantial on-chip area to route programmable switches and wires[1]. This is why interconnects out-weight logic in terms of power budget and area. FPGA interconnects are slow because they need to traverse a series of tracks connected by switches to route among different logic blocks. Optimizing the interconnect network is critical because it has a profound impact on the performance of FPGAs.
Previous work by Trimberger et al.[2] has illustrated that the utilization of logic resources in FPGA can be improved by time-multiplexing, which inspires us to apply time-multiplexing to FPGA interconnection networks and leads to our new architecture and routing algorithm that we will introduce in this paper. We notice that most wires are only used for a short period in a clock cycle. The delay of wire for a signal is only a small portion in a clock cycle. So we can better utilize the interconnect resources by time-multiplexing signals. It can also improve the FPGA performance at a small cost of the multiplexing circuitry complexity.
FPGAs with time-multiplexed interconnects require specialized electric design automation (EDA) algorithms and tools to support them. Existing algorithms and tools cannot be reused in their present forms because they are not multiplexing-aware. In this paper, we propose a time-multiplexing-aware routing algorithm. This algorithm is similar to VPR routing algorithm[3]. Unlike the VPR algorithm, our algorithm can add the time-multiplex feature to the algorithm and implement it. In order to achieve this feature, we also design TM switches which are described in an architecture file. We define a bitmap for signals in our design, which is a new technique and can add the scheduling capability to the VPR algorithm. This algorithm can be seen as timing-driven because it has a delay-sensitive term in its multiplexing-aware congestion cost function.
We also validate our proposed routing algorithm experimentally. We evaluate our TM-ARCH FPGAs by using our TM-router to replace the conventional VPR router in a standard FPGA EDA flow. We also use the VPR 5 router[4] as a comparison on the same set of benchmark circuits with conventional architecture. Experimental results reveal that we only use 65% average minimum channel width of conventional architectures on average and can achieve 3.8% smaller circuit critical path delay on average.
The rest of this paper is as follows. Section 2 introduces existing FPGA architectures and their EDA algorithms. Section 3 briefly introduces TM-ARCH architecture. Section 4 proposes our time-multiplexing-aware routing algorithm. Section 5 introduces what we have done to validate this algorithm, and also presents experimental results. Finally, Section 6 concludes this paper.
2.
Related work
Previous work on time-multiplexed FPGA[2] has been done. Trimberger et al. propose both time-multiplexed interconnects and configurable logic blocks (CLBs) with the Xilinx XC4000E FPGA family. To facilitate mapping user designs into time-multiplexed architecture, a scheduling algorithm is presented[5] that is based on list scheduling. Lin et al. also apply time-multiplexing to Xilinx 4000 architecture FPGA[6], but they only time-multiplex routing resources. Even so, they still have achieved 30% reduction of channel density. Francis et al.[7] apply time-multiplexed interconnects to an Altera FPGA, which is similar to our architecture proposed in this paper. Only interconnects are time-multiplexed in ours, Lin's and Francis’ work, while both logic blocks and interconnects are time-multiplexed in Trimberger’s architecture. In terms of the supporting EDA flow, Trimberger’s work can be considered as close to ours in both works, time-multiplexing is scheduled after technology mapping and before placement. In our work, time-multiplexing is scheduled during routing, while Francis’ work does it after routing.
Shen et al. present a serial-equivalent static and dynamic parallel router[8]. This router provides a significant speedup in routing and can achieve the same result as the serial router. They also try box expansion and grid partitioning to speed up parallel FPGA routing[9]. Shen et al. use parallel recursive partitioning to accelerate FPGA routing that can achieve 7.06× speed up using 32 processers[10]. They propose a GPU-based parallel routing algorithm[11] and achieved 1.57× speedup improvement than VPR router. Based on PathFinder kernel, they develop Raparo, which is an angle based region partitioning[12]. Results show that it can provide 16× speedup when scaling to 32 processor cores. They also exploit strictly-ordered partitioning in parallelizing FPGA routing called Megrez and achieve 15.13× speed up on GPU without influence on quality[13].
Vercruyce et al. propose CRoute[14], which is a connection-based timing-driven router. This routing algorithm can reduce total wire-length and increase the maximum clock frequency at the same time. Wang et al. propose a new approach[15] based on the PathFinder algorithm. They only reroute illegal paths during each routing iteration and can reduce 68.5% routing runtime on average. Patil et al. use a heuristic method in routing for hybrid FPGA networks[16]. They schedule data streams efficiently to increase the bandwidth and then achieve 11% stream bandwidth improvement on five benchmark circuits. Chaplygin et al. propose an FPGA routing block optimization with a given number of trace signals[17]. Farooq et al. apply the time-multiplexing technology to multi-FPGA prototyping routing for more complicated designs[18]. Omam et al. use RRAM-based switches and decrease 56% path delay compared to CMOS base switches[19]. Chen et al. explore state-of-the-art research directions for FPGA placement and routing[20]. Huriaux et al. evaluate the routing of I/O signals of large applications through column interfaces in embedded FPGA fabrics[21]. Kashif et al. compare a completely connected graph and time-multiplexing nets on multi-FPGA system[22]. Results show that completely connected graphs can achieve higher performance and time-multiplexing can provide a better cost/performance ratio.
3.
Target FPGA architecture
Fig. 1 illustrates an island-style FPGA, which is the base of our target time-multiplexed FPGA architecture. The only difference between this island-style architecture and ours is that all wires in our design can be time-multiplexed. In the conventional island-style FPGAs, vertical and horizontal routing channels surround logic blocks from all four sides containing multiple tracks. We define channel width as the total number of tracks in a channel. Every track has multiple wire segments. A switch block is arranged at every intersection of a vertical channel and a horizontal channel. Programmable switches are placed in switch blocks and connection blocks to implement configurable routing. Logic block pins are connected to routing channels through connection blocks.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-1.jpg'"
class="figure_img" id="Figure1"/>
Download
Larger image
PowerPoint slide
Figure1.
Island-style architecture, which is the base of TM-ARCH with the time-multiplexed interconnects.
3.1
Clock cycle and microcycle
In our new architecture, we divide a clock cycle into multiple microcycles. Different signals can occupy the same wire if this wire is multiplexable, and they can use the same wire at different microcycles in a clock cycle. We use a similar definition of microcycle and clock cycle as in Trimberger et al.’s work[2]. A clock cycle is constrained by the critical path delay. We defined
m c} $
m c} $
3.2
Time-multiplexed wires
A route of nets is usually made up of many segments of wires. One segment of a wire can only be occupied between the time that the signal arrives and leaves in one clock cycle. In the TM-ARCH FPGAs, different signals can occupy the same segment of wires at different time if this wire segment can be multiplexed. Fig. 2(a) illustrates this. We first define that one clock cycle consists of two microcycles in this TM-ARCH architecture device.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-2.jpg'"
class="figure_img" id="Figure2"/>
Download
Larger image
PowerPoint slide
Figure2.
(a) Signals
3.3
TM switch
A time-multiplexing switch (TM switch) is the key to implement time-multiplexing at the hardware level. Compared with conventional FPGAs, a TM switch has two more features: latching data and associating with multiple contexts.
3.3.1
Multiple contexts
In TM-ARCH, a TM switch can have at most
m c} $
m c} $
3.3.2
Latching capability
A TM switch can latch the current logic value when it switches from on to off state. Fig. 2 illustrates the necessity of latching data. In the first microcycle, the state of TM switches S
Francis’ architecture[9] can provide the latching capability, but it is provided by wires which are different from ours. Our architecture does not have to apply latches at LUT inputs, because our TM switches can provide the latching function.
4.
Time-multiplexing-aware timing-driven-routing algorithm
4.1
Problem formulation
m{ap}} $
m{ap}} $
m c} = 2 $
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-3.jpg'"
class="figure_img" id="Figure3"/>
Download
Larger image
PowerPoint slide
Figure3.
Routing resource graph of TM-ARCH architecture.
For a signal
The router of TM-ARCH is designed for optimizing the circuit delay and routing all the nets at the same time. The router should identify and then schedule multiple qualified nets to a time-multiplex wire because the wire is time-multiplexed in TM-ARCH architecture.
4.2
Occupation bitmap
Our algorithm can identify a net (or a signal) that can be multiplexed to a wire with other signals. We record the time when a signal arrives and leaves. This helps us to get the signal’s occupation bitmap, which shows at which microcycle this signal may occupy this wire.
4.2.1
Arrival and leave time
A Maze router[23] is the core of our algorithm, the Pathfinder algorithm[24] and the VPR 5 algorithm. This router routes a net by wave-expanding from source and throughout the routing-resource graph until the wave-front reaches the sink of the net. It then back-traces the path and records it from its source to sink.
We compute two timing values:
m{arrival}} $
m{leave}} $
m{arrival}} $
m{leave}} $
$$ t_{ m{arrival}}(n) = sumlimits_{min source(i)leadsto n}d_{{m}}+T_{ m{arrival}}({{source}}(i)), $$ | (1) |
$$ t_{ m{leave}}(n) = t_{ m{arrival}}(n)+d_{{n}}, $$ | (2) |
In Eq. (1), the first item on the right side indicates the time needed from the source of net
m{arrival}} $
4.2.2
Occupation bitmap
We can compute the occupation bitmap by using all the
m{arrival}} $
m{leave}} $
m{c}} $
Each element in the bitmap array can be 0 or 1. “1” means that the net uses the wire at the corresponding microcycle. “0” means that the net does not occupy.
We use the following function to calculate each element in the array,
$$ Bitmap[i] = left{ {begin{array}{*{20}{l}}{0,}&{{ m{if}}}&{{t_{{ m{arrival}}}} > i{T_{{ m{ucycle}}}} + {T_{{ m{gb}}}}},{0,}&{{ m{else}};{ m{if}}}&{{t_{{ m{leave}}}} < (i - 1){T_{{ m{ucycle}}}} - {T_{{ m{gb}}}}},{1,}&{{ m{else}}},end{array}} ight. !!$$ | (3) |
$$ T_{ m{ucycle}} = T_{ m{crit}}/M_{ m{c}}. $$ | (4) |
We also use
m{gb}} $
Fig. 4 gives the pseudo code for computing bitmaps. We first set all bitmap array elements to zero. Next, we compute which microcycle signal will arrive and give it to variable
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-4.jpg'"
class="figure_img" id="Figure4"/>
Download
Larger image
PowerPoint slide
Figure4.
Pseudo code for computing the occupation bitmaps.
We also record the bitmap array for every wire in net
4.3
Congestion penalties at microcycles
Because wires in our architecture can be multiplexed by nets, our algorithm can record how many nets are currently using this wire at every microcycle by using the micro occupancy. This micro occupancy illustrates the degree of congestion at each microcycle for a wire. Our algorithm can compute the congestion penalties of wire at each microcycle based on its micro occupancy values when the wavefront arrives at the wire. Larger micro occupancy will cause larger congestion penalty.
4.3.1
Micro occupancy
Micro occupancy is a matrix that consists of
m c} $
Firstly we set all micro occupancy elements to zero. When wire
$$ uOcc[j] = left{ {begin{array}{*{20}{l}}{uOcc[j] + 1,}&{{ m{if}}}&{Bitmap[j] = 1},{uOcc[j],}&{{ m{else}}},&{}end{array}} ight. $$ | (5) |
where
m c}] $
When a net
$$ uOcc[j] = left{ {begin{array}{*{20}{l}}{uOcc[j] - 1,}&{{ m{if}}}&{Bitmap[j] = 1},{uOcc[j],}&{{ m{else}}},&{}end{array}} ight. $$ | (6) |
where
m c}] $
4.3.2
Historical and present congestion penalty
With the records of micro occupancy, we can calculate the historical and present congestion penalties of a wire in each microcycle. The functions we use are shown as follows:
$$ p[j] = 1+{ m{max}}(0,p_{f_{ m{ac}}}[uOcc[j]+1-C_{ m{ap}}]), $$ | (7) |
$$ h{[j]^i} = left{ {begin{array}{*{20}{l}}{1,}&{i = 1},{h{{[j]}^{i - 1}} + { m{max}} (0,{h_{{f_{{ m{ac}}}}}}[uOcc[j] - {C_{{ m{ap}}}}]),}&{i > 1}.end{array}} ight.$$ | (8) |
where
m c}] $
The routing schedule shows what
m{ac}}} $
m{ac}}} $
Routing schedule | Value |
$ p_{f_{ m ac}} $ | 0.5 in the first and the second routing iteration; 1.3 times its previous value from the third iteration onwards. |
$ h_{f_{ m ac}} $ | 1.0 for all the iterations. |
Table1.
Resource utilization of the system.
Table options
-->
Download as CSV
Routing schedule | Value |
$ p_{f_{ m ac}} $ | 0.5 in the first and the second routing iteration; 1.3 times its previous value from the third iteration onwards. |
$ h_{f_{ m ac}} $ | 1.0 for all the iterations. |
4.4
Multiplexing-aware congestion cost
Cost calculation of multiplexing-aware congestion is an innovation of this algorithm. We do not consider it as congestion if a wire in our architecture is used by two more signals and they occupy the wire at different time.
Fig. 5 shows the pseudo code for computing congestion cost of wire in the wave expansion process. In this pseudo code, we should have already computed the bitmap array and the newest historical and present congestion penalties for every microcycle.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-5.jpg'"
class="figure_img" id="Figure5"/>
Download
Larger image
PowerPoint slide
Figure5.
Pseudo code for computing the congestion cost.
Lines 1–4 look for the largest historical congestion penalty. Lines 5–10 look for the position where the first “1” value is occurred in a bitmap, and save it to begin index. This means from which microcycle the wire starts to be occupied. Lines 11–16 saves the end index in a similar way, which means at which microcycle the wire occupation ends. Lines 17–20 record the largest present congestion penalty between the begin microcycle and end microcycle. Line 21 computes the overall congestion cost by multiplying the largest historical and present congestion penalties and the base cost of wire.
$$ cCost = p_{{n}}times h_{{n}}times b_{{n}}.$$ | (9) |
Eq. (9) is used after the VPR router obtains the new congestion cost. We use the largest historical congestion penalty in all microcycles as
4.5
Overall cost function
The overcall cost function is the sum of two factors as in a VPR router. The first factor is the congestion sensitive cost, which is computed in Section 4.5. The second factor is the delay sensitive cost. We use the wire’s intrinsic delay as the delay cost and weighs the two parts based on timing criticality. Eq. (10) shows the overall cost function in our algorithm.
$$ c(n) = Crit(i,j)times d(n)+[1-Crit(i,j)]times cCost(n). $$ | (10) |
4.6
Legal routing solution
Our routing algorithm will check whether the current routing is effective after each iteration. If it is, our router iteration ends and keeps this solution. But if not, the router will start a new turn iteration.
A valid routing solution should not contain any overused routing resources. Although a wire can be multiplexed to be used by multiple nets, it is not overused as long as it is occupied by at most one net in any microcycle. Our router checks if Eq. (11) are met in every microcycle.
$$ uOcc[i]leqslant 1, quad i in [1,M_{ m c}]. $$ | (11) |
4.7
Pseudo code
Fig. 6 is the pseudo code of time-multiplexing-aware routing algorithm. For each signal in a net, we firstly use the maze router to route one signal from its source to sink over the routing-resource graph and record the path it has expanded. For each wire being expanded, we also record the arrival time and leave time to calculate the occupation bitmap. We use the bitmap to compute and update congestion penalties on this wire and then evaluates this wire's overall cost. We aim to decrease the criticality of the connection while doing the wave expansion. For all nodes in each net, we update its occupied time and Elmore delay after this net has been completely routed. After finishing routing all nets, we compute propagate timing in circuit timing graph and calculate critical path delay. If there is no more overused resources or reach the maximum number of iterations, the program will stop and return the final results.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-6.jpg'"
class="figure_img" id="Figure6"/>
Download
Larger image
PowerPoint slide
Figure6.
Pseudo code for computing the congestion cost.
Lines 6 and 20 show that this algorithm can compute the present congestion penalties at each microcycle for wires. Lines 14, 23, 27, 32 illustrate the fact that our algorithm can compute and update the occupation bitmap on wires. Lines 13, 22, and 26 are where our algorithm compute and update the time of signal arrival and leaving. Line 15 reflects the fact that this algorithm computes the time-multiplexing-aware congestion cost for a wire and then analyzes overall cost of this wire. Line 33 computes the historical congestion penalties at microcycles. Finally, line 4 checks the routing algorithm if there still have any remaining congested routing resources. This has been detailed in Section 4. In addition, our router can also check whether the next condition holds in every microcycle.
4.8
Algorithm analysis
In this part, we mainly focus on its unroutability detection, time complexity and memory requirement.
4.8.1
Unroutability detection
Our algorithm only declares the circuit is unroutable after 50 iterations on the given FPGA like Pathfinder algorithm. Because this process will take a long time, we will enhance our algorithm for quicker unroutability detection.
4.8.2
Time conplexity
The algorithm is based on iterations. In practice, the iteration number is usually limited to a certain number of times. Therefore, it is sufficient to analyze the iteration complexity in this algorithm.
The pseudo-code in Fig. 6 shows that, every iteration is made up of two parts. Netlist routing is the first (lines 5–29), and post-processing is the second (lines 30–34). Netlist routing means running routing algorithm for every net in netlist. Previous work shown that the complexity of the net routing algorithm in Pathfinder is
m{log}}, k) $
From Section 4.2, we know that the time consume on a routing-resource node
m c}) $
m c} $
m c} $
Next we will do post-processing. The main computation is computing
m crit} $
Back-annotation Elmore delay (line 30) will take
4.8.3
Memory requirement
In our algorithm, FPGA routing-resource graph and circuit timing graph are requiring a substantial amount of main memory requirement.
We record many information for all nodes in the routing-resource graph, such as its connectivity information, physical information, congestion information, timing information and some other information used for wave expansion.
We record the timing information and connectivity information for all vertex in timing graphs. For edges in timing graph, we record the timing information and connectivity information. Typically,
5.
Algorithm validation
We verify our time-multiplexing-aware routing algorithm through experiments. By using this algorithm, we implement benchmark circuits for the TM-ARCH architecture. For comparison, we also implement the same set of benchmark circuits for conventional architectures. To achieve an easy and fair comparison, we use MCNC 20 benchmark circuits with a standard EDA flow.
5.1
Experimental setup
5.1.1
EDA flow
Fig. 7 shows our EDA flow. First, LUTs are packaged into the cluster logic block using the TVPack tool in the VPR 5 package. Next, place the circuit using VPR 5. The wiring is performed twice according to the same placement result. The VPR 5 timing-driven router is used to route circuits to traditional island FPGAs. TM-ARCH FPGAs with time-multiplexed interconnects are supported by our time-multiplexing-aware router. We call these two routing branches as conventional routing and time-multiplexing routing. Since then, we will call VPR timing-driven router and our time-multiplexing-aware router as VPR-ROUTER and TM-ROUTER.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2020/2/PIC/19120006-7.jpg'"
class="figure_img" id="Figure7"/>
Download
Larger image
PowerPoint slide
Figure7.
(Color online) The TM-ARCH and TM-ROUTER evaluation framework.
m c} $
m c} $
m c} $
m c} $
m c} $
5.1.2
Assumptions of FPGA architecture
In this work, we only consider similar FPGA architectures. Even with this assumption, FPGA architecture’s design space is still quite large, so we cannot fully study the time-multiplexed interconnects. In order to make the work easier to do, we first fix a representative benchmark FPGA architecture. The benchmark FPGA architecture is represented in Fig. 7 as "arch.xml", which uses conventional interconnects. We replace all wires with our multiplexable wires, so we got the TM-ARCH FPGA architecture shown in Fig. 7 as "tm_arch.xml".
We choose the XML file used in iFAR[26] as the baseline FPGA architecture. Table 2 shows main features of this FPGA architecture.
Feature parameter | Value/specification |
LUT size | 4 |
Logic block size | 10 |
Logic block inputs | 22 |
Amount of bias between horizontal and vertical channels | No bias |
Uniformity of routing channels in the same direction | Uniform |
Aspect ratio | 1 : 1 (Assuming square logic blocks) |
Segmentation distribution | 100% length 4 wires |
Switch types used | Uni-directional single driver switches |
Switch block topology | Wilton |
Switch block internal population | 100% |
Connection block internal population | 100% |
Table2.
Main Features of Baseline FPGA Architecture.
Table options
-->
Download as CSV
Feature parameter | Value/specification |
LUT size | 4 |
Logic block size | 10 |
Logic block inputs | 22 |
Amount of bias between horizontal and vertical channels | No bias |
Uniformity of routing channels in the same direction | Uniform |
Aspect ratio | 1 : 1 (Assuming square logic blocks) |
Segmentation distribution | 100% length 4 wires |
Switch types used | Uni-directional single driver switches |
Switch block topology | Wilton |
Switch block internal population | 100% |
Connection block internal population | 100% |
5.2
Results
5.2.1
Minimum channel width
In this part of the experiments, a conventional routing and the time-multiplexing routing perform a binary search to find the minimum channel width for each circuit. VPR-ROUTER routes circuits to FPGA with conventional architecture to find the minimum channel width, and TM-ROUTER does the same work on TM-ARCH FPGA. We set 50 as the maximum iterations number for both routing algorithms.
Table 3 shows the minimum channel width experimental results. The numbers listed under “
m{min}} $
m{min}} $
m c} $
m c} $
Parameter | $ M_{ m c}=1,W_{ m{min}} $ | $ M_{ m c}=2,W,{'}_{ m{min} } $ | $ M_{ m c}=4,W,{'}_{ m{min} } $ | $ M_{ m c}=6,W,{'}_{ m{min} } $ | $ M_{ m c}=8,W,{'}_{ m{min} } $ |
Alu4 | 48 | 48 | 30 | 36 | 26 |
Apex2 | 62 | 62 | 36 | 34 | 22 |
Apex4 | 64 | 64 | 50 | 36 | 28 |
Bigkey | 44 | 44 | 32 | 32 | 30 |
Clma | 78 | 76 | 74 | 48 | 34 |
Des | 44 | 42 | 34 | 32 | 32 |
Diffeq | 38 | 36 | 34 | 32 | 30 |
Dsip | 38 | 36 | 30 | 30 | 30 |
Elliptic | 62 | 58 | 52 | N.A. | 34 |
Ex1010 | 74 | 74 | 60 | 40 | 34 |
Ex5p | 68 | 68 | 48 | 34 | 22 |
Frisc | 74 | 74 | 44 | 40 | 40 |
Misex3 | 54 | 54 | 42 | 26 | 22 |
Pdc | 90 | 90 | 128 | 60 | 70 |
S298 | 34 | 34 | 28 | 28 | 20 |
S38417 | 48 | 48 | 50 | 26 | 22 |
S38584.1 | 50 | 50 | 48 | 26 | 22 |
Seq | 60 | 60 | 48 | 26 | 22 |
Spla | 74 | 72 | N.A. | 52 | 34 |
Tseng | 46 | 46 | 40 | 34 | 32 |
Geo.Mean | 56 | 55 | 44 | 34 | 29 |
Reduction | – | –1.78% | –21.43% | –39.28% | –48.21% |
Table3.
Minimum channel width for different
m c} $
Table options
-->
Download as CSV
Parameter | $ M_{ m c}=1,W_{ m{min}} $ | $ M_{ m c}=2,W,{'}_{ m{min} } $ | $ M_{ m c}=4,W,{'}_{ m{min} } $ | $ M_{ m c}=6,W,{'}_{ m{min} } $ | $ M_{ m c}=8,W,{'}_{ m{min} } $ |
Alu4 | 48 | 48 | 30 | 36 | 26 |
Apex2 | 62 | 62 | 36 | 34 | 22 |
Apex4 | 64 | 64 | 50 | 36 | 28 |
Bigkey | 44 | 44 | 32 | 32 | 30 |
Clma | 78 | 76 | 74 | 48 | 34 |
Des | 44 | 42 | 34 | 32 | 32 |
Diffeq | 38 | 36 | 34 | 32 | 30 |
Dsip | 38 | 36 | 30 | 30 | 30 |
Elliptic | 62 | 58 | 52 | N.A. | 34 |
Ex1010 | 74 | 74 | 60 | 40 | 34 |
Ex5p | 68 | 68 | 48 | 34 | 22 |
Frisc | 74 | 74 | 44 | 40 | 40 |
Misex3 | 54 | 54 | 42 | 26 | 22 |
Pdc | 90 | 90 | 128 | 60 | 70 |
S298 | 34 | 34 | 28 | 28 | 20 |
S38417 | 48 | 48 | 50 | 26 | 22 |
S38584.1 | 50 | 50 | 48 | 26 | 22 |
Seq | 60 | 60 | 48 | 26 | 22 |
Spla | 74 | 72 | N.A. | 52 | 34 |
Tseng | 46 | 46 | 40 | 34 | 32 |
Geo.Mean | 56 | 55 | 44 | 34 | 29 |
Reduction | – | –1.78% | –21.43% | –39.28% | –48.21% |
We think that, in one clock cycle there are two microcycles that can only supplies little opportunity to TM-ROUTER to achieve time multiplexing of wires. Table 4, “
m{c}} = 2 $
m{min}} $
Parameter | $ M_{ m c}=2 $ | $ M_{ m c}=4 $ | |||||
1st | 2nd | 1st | 2nd | 3rd | 4th | ||
Alu4 | 94.59 | 4.67 | 56.04 | 29.64 | 4.19 | 0.45 | |
Apex2 | 97.09 | 2.42 | 67.82 | 25.32 | 2.17 | 0.21 | |
Apex4 | 87.35 | 10.56 | 28.13 | 56.96 | 9.45 | 1.00 | |
Bigkey | 93.65 | 4.77 | 61.57 | 28.13 | 4.77 | 0.00 | |
Clma | 96.50 | 2.97 | 68.86 | 25.69 | 2.83 | 0.12 | |
Des | 90.08 | 9.15 | 63.16 | 24.61 | 6.50 | 1.96 | |
Diffeq | 95.97 | 3.80 | 74.31 | 18.47 | 3.71 | 0.10 | |
Dsip | 94.70 | 4.28 | 59.70 | 31.89 | 4.04 | 0.18 | |
Elliptic | 95.35 | 4.46 | 87.51 | 7.36 | 3.65 | 0.73 | |
Ex1010 | 90.04 | 8.67 | 23.65 | 64.19 | 7.98 | 0.51 | |
Ex5p | 82.63 | 14.85 | 21.70 | 56.98 | 12.71 | 1.99 | |
Frisc | 86.87 | 12.06 | 68.23 | 18.50 | 10.58 | 1.41 | |
Misex3 | 93.09 | 5.60 | 52.19 | 33.60 | 4.64 | 0.80 | |
Pdc | 95.09 | 4.36 | 46.93 | 42.92 | 3.88 | 0.42 | |
S298 | 85.82 | 13.48 | 63.21 | 21.00 | 9.76 | 3.16 | |
S38417 | 92.87 | 6.60 | 65.09 | 24.77 | 5.83 | 0.67 | |
S38584.1 | 97.41 | 1.88 | 73.30 | 19.91 | 1.59 | 0.26 | |
Seq | 97.50 | 2.20 | 69.24 | 24.66 | 1.99 | 0.17 | |
Spla | 95.98 | 3.60 | 49.20 | 41.06 | 3.31 | 0.25 | |
Tseng | 96.48 | 3.34 | 88.49 | 6.82 | 2.35 | 0.73 | |
Geo.Mean | 92.85 | 5.17 | 55.83 | 26.19 | 4.49 | 0.51 |
Table4.
Percentages (%) of wire used in each individual microcycle for 20 benchmark circuits with
m c} = 2,4 $
Table options
-->
Download as CSV
Parameter | $ M_{ m c}=2 $ | $ M_{ m c}=4 $ | |||||
1st | 2nd | 1st | 2nd | 3rd | 4th | ||
Alu4 | 94.59 | 4.67 | 56.04 | 29.64 | 4.19 | 0.45 | |
Apex2 | 97.09 | 2.42 | 67.82 | 25.32 | 2.17 | 0.21 | |
Apex4 | 87.35 | 10.56 | 28.13 | 56.96 | 9.45 | 1.00 | |
Bigkey | 93.65 | 4.77 | 61.57 | 28.13 | 4.77 | 0.00 | |
Clma | 96.50 | 2.97 | 68.86 | 25.69 | 2.83 | 0.12 | |
Des | 90.08 | 9.15 | 63.16 | 24.61 | 6.50 | 1.96 | |
Diffeq | 95.97 | 3.80 | 74.31 | 18.47 | 3.71 | 0.10 | |
Dsip | 94.70 | 4.28 | 59.70 | 31.89 | 4.04 | 0.18 | |
Elliptic | 95.35 | 4.46 | 87.51 | 7.36 | 3.65 | 0.73 | |
Ex1010 | 90.04 | 8.67 | 23.65 | 64.19 | 7.98 | 0.51 | |
Ex5p | 82.63 | 14.85 | 21.70 | 56.98 | 12.71 | 1.99 | |
Frisc | 86.87 | 12.06 | 68.23 | 18.50 | 10.58 | 1.41 | |
Misex3 | 93.09 | 5.60 | 52.19 | 33.60 | 4.64 | 0.80 | |
Pdc | 95.09 | 4.36 | 46.93 | 42.92 | 3.88 | 0.42 | |
S298 | 85.82 | 13.48 | 63.21 | 21.00 | 9.76 | 3.16 | |
S38417 | 92.87 | 6.60 | 65.09 | 24.77 | 5.83 | 0.67 | |
S38584.1 | 97.41 | 1.88 | 73.30 | 19.91 | 1.59 | 0.26 | |
Seq | 97.50 | 2.20 | 69.24 | 24.66 | 1.99 | 0.17 | |
Spla | 95.98 | 3.60 | 49.20 | 41.06 | 3.31 | 0.25 | |
Tseng | 96.48 | 3.34 | 88.49 | 6.82 | 2.35 | 0.73 | |
Geo.Mean | 92.85 | 5.17 | 55.83 | 26.19 | 4.49 | 0.51 |
Table 4 shows that, in the first microcycle, 92.85% of the wires are used, while in the second microcycle, only 5.17% wires are used. This means that TM-ROUTER has very limited opportunities for time-multiplexing in this condition. To implement time-multiplexing in FPGAs, our routing algorithm will match the same wire in the first microcycle and the second microcycle. Table 4 shows that used wire segments are most likely to be used by net in the first microcycle, and less likely to be used by another net in the second microcycle, so our time-multiplexing-aware algorithm may less likely to implement time-multiplexing wires in this situation. Looking back at Section 4, the signal does not be actively delayed by our routing algorithm. But scheduling algorithm of Francis et al’s does it to implement more time-multiplexing[9]. We can see that, when a clock cycle is divided into four microcycles, Table 4, column “
m c} = 4 $
m c} = 2 $
m c} = 2 $
TM-ARCH generally needs a smaller channel width from
m c} $
m{c}} = 4 $
m{c}} = 6 $
m{c}} = 8 $
m{c}} $
5.2.2
Circuit critical path delay
FPGAs routing resources are often not heavily utilized in order to reduce latency in real applications. So we use 20% more minimum channel width than Table 3 to do the routing process. In this section, we also assume that TM switch has same delay with its conventional switch.
Table 5 lists the critical path delays with low-stress routing for 20 benchmark circuits. “
m{crit}} $
m{crit}} $
m{c}} = 2 $
m{c}} $
Parameter | $ M_{ m c}=1 $ | $ M_{ m c}=2 $ | $ M_{ m c}=4 $ | $ M_{ m{c}}=6 $ | $ M_{ m{c}}=8 $ | |||||||||
$ W_{ m{ls}} $ | $ T_{ m{crit}} $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | |||||
Alu4 | 58 | 3.31 | 58 | 3.31 | 36 | 3.31 | 44 | 3.23 | 32 | 3.45 | ||||
Apex2 | 74 | 3.90 | 74 | 3.90 | 46 | 3.69 | 44 | 3.69 | 26 | 3.65 | ||||
Apex4 | 76 | 3.80 | 76 | 3.80 | 58 | 3.13 | 40 | 3.20 | 32 | 3.20 | ||||
Bigkey | 52 | 1.80 | 52 | 1.87 | 38 | 1.80 | 38 | 1.80 | 38 | 1.80 | ||||
Clma | 94 | 6.74 | 94 | 6.74 | 88 | 6.63 | 58 | 6.70 | 44 | 6.70 | ||||
Des | 52 | 2.86 | 48 | N.A. | 40 | 2.78 | 38 | 2.85 | 38 | 2.85 | ||||
Diffeq | 46 | 4.44 | 44 | 4.51 | 40 | 4.37 | 38 | 4.44 | 36 | 4.37 | ||||
Dsip | 46 | 1.73 | 44 | 1.73 | 36 | 1.80 | 36 | 1.80 | 36 | 1.80 | ||||
Elliptic | 74 | 6.22 | 70 | 5.52 | 62 | 5.66 | N.A. | N.A. | 40 | 5.37 | ||||
Ex1010 | 88 | 4.42 | 88 | 4.42 | 72 | 4.49 | 48 | N.A. | 40 | 4.42 | ||||
Ex5p | 82 | 3.55 | 82 | 3.55 | 58 | 3.34 | 40 | 3.23 | 26 | 3.37 | ||||
Frisc | 88 | 7.63 | 88 | 7.63 | 52 | 7.42 | 48 | 7.42 | 48 | 7.49 | ||||
Misex3 | 64 | 3.13 | 64 | 3.13 | 50 | 3.06 | 32 | 3.20 | 26 | 3.27 | ||||
Pdc | 108 | 5.26 | 108 | 5.26 | 154 | 4.49 | 72 | 4.49 | 84 | 4.49 | ||||
S298 | 40 | N.A. | 40 | N.A. | 34 | 6.14 | 34 | 6.17 | 24 | 6.07 | ||||
S38417 | 58 | 4.68 | 58 | 4.47 | 60 | 4.47 | 32 | 4.61 | 26 | 4.40 | ||||
S38584.1 | 60 | 3.71 | 60 | 3.71 | 58 | N.A. | 32 | 3.64 | 26 | 3.65 | ||||
Seq | 72 | 3.13 | 72 | 3.13 | 58 | 3.06 | 32 | 3.20 | 26 | 3.06 | ||||
Spla | 88 | 4.46 | 88 | 4.46 | N.A. | N.A. | 62 | 4.14 | 40 | 4.07 | ||||
Tseng | 56 | 4.43 | 52 | 4.43 | 48 | 4.43 | 40 | 4.43 | 38 | 4.43 | ||||
Geo.Mean | 66 | 3.90 | 65 | 3.95 | 53 | 3.83 | 41 | 3.75 | 35 | 3.85 | ||||
Reduction | N.A. | N.A. | –1.51% | 1.28% | –19.69% | –1.79% | –37.87% | –3.84% | -46.97% | -1.28% |
Table5.
Minimum channel width for different
m c} $
Table options
-->
Download as CSV
Parameter | $ M_{ m c}=1 $ | $ M_{ m c}=2 $ | $ M_{ m c}=4 $ | $ M_{ m{c}}=6 $ | $ M_{ m{c}}=8 $ | |||||||||
$ W_{ m{ls}} $ | $ T_{ m{crit}} $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | $ W,{'}_{ m{ls} } $ | $ T,{'}_{ m{crit} } $ | |||||
Alu4 | 58 | 3.31 | 58 | 3.31 | 36 | 3.31 | 44 | 3.23 | 32 | 3.45 | ||||
Apex2 | 74 | 3.90 | 74 | 3.90 | 46 | 3.69 | 44 | 3.69 | 26 | 3.65 | ||||
Apex4 | 76 | 3.80 | 76 | 3.80 | 58 | 3.13 | 40 | 3.20 | 32 | 3.20 | ||||
Bigkey | 52 | 1.80 | 52 | 1.87 | 38 | 1.80 | 38 | 1.80 | 38 | 1.80 | ||||
Clma | 94 | 6.74 | 94 | 6.74 | 88 | 6.63 | 58 | 6.70 | 44 | 6.70 | ||||
Des | 52 | 2.86 | 48 | N.A. | 40 | 2.78 | 38 | 2.85 | 38 | 2.85 | ||||
Diffeq | 46 | 4.44 | 44 | 4.51 | 40 | 4.37 | 38 | 4.44 | 36 | 4.37 | ||||
Dsip | 46 | 1.73 | 44 | 1.73 | 36 | 1.80 | 36 | 1.80 | 36 | 1.80 | ||||
Elliptic | 74 | 6.22 | 70 | 5.52 | 62 | 5.66 | N.A. | N.A. | 40 | 5.37 | ||||
Ex1010 | 88 | 4.42 | 88 | 4.42 | 72 | 4.49 | 48 | N.A. | 40 | 4.42 | ||||
Ex5p | 82 | 3.55 | 82 | 3.55 | 58 | 3.34 | 40 | 3.23 | 26 | 3.37 | ||||
Frisc | 88 | 7.63 | 88 | 7.63 | 52 | 7.42 | 48 | 7.42 | 48 | 7.49 | ||||
Misex3 | 64 | 3.13 | 64 | 3.13 | 50 | 3.06 | 32 | 3.20 | 26 | 3.27 | ||||
Pdc | 108 | 5.26 | 108 | 5.26 | 154 | 4.49 | 72 | 4.49 | 84 | 4.49 | ||||
S298 | 40 | N.A. | 40 | N.A. | 34 | 6.14 | 34 | 6.17 | 24 | 6.07 | ||||
S38417 | 58 | 4.68 | 58 | 4.47 | 60 | 4.47 | 32 | 4.61 | 26 | 4.40 | ||||
S38584.1 | 60 | 3.71 | 60 | 3.71 | 58 | N.A. | 32 | 3.64 | 26 | 3.65 | ||||
Seq | 72 | 3.13 | 72 | 3.13 | 58 | 3.06 | 32 | 3.20 | 26 | 3.06 | ||||
Spla | 88 | 4.46 | 88 | 4.46 | N.A. | N.A. | 62 | 4.14 | 40 | 4.07 | ||||
Tseng | 56 | 4.43 | 52 | 4.43 | 48 | 4.43 | 40 | 4.43 | 38 | 4.43 | ||||
Geo.Mean | 66 | 3.90 | 65 | 3.95 | 53 | 3.83 | 41 | 3.75 | 35 | 3.85 | ||||
Reduction | N.A. | N.A. | –1.51% | 1.28% | –19.69% | –1.79% | –37.87% | –3.84% | -46.97% | -1.28% |
In conventional FPGAs, circuit critical path may not be routed in the shortest path due to limited routing. This is true even when low-stress wiring is performed. Unlike conventional FPGAs, the TM-ARCH architecture allows multiple signals multiplex wires in different clock cycles. This effectively mitigates its routing congestion. As a result, the circuit critical path is likely to be routed in a shorter path.
6.
Conclusion
In this paper we propose a time-multiplexing FPGA architecture and its routing algorithm. Our algorithm actively identifies the qualified interconnects that can be multiplexed on our new FPGA architecture. We have validated the architecture and algorithm by implementing it as a multiplexing-aware router and using this router to implement benchmark circuits to FPGAs with time-multiplexed interconnects. The results show that compared with an existing router targeting a conventional island-style architecture, 38% smaller minimum channel width and 3.8% smaller circuit critical path delay can be achieved.