High speed true random number generator with a new structure of coarse-tuning PDL in FPGA

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编 Free考研考试/2022-01-01

1.
Introduction

True random number generators (TRNGs) have become an essential block of modern cryptographic systems for various applications. They are used to generate cryptographic keys^[1], initialization vectors, nonces and implement countermeasures against side-channel attacks (SCA)^[2]. TRNGs extract randomness from non-deterministic random sources such as thermal noise, phase noise and chaos algorithm^[3] in order to assure unpredictability and good statistical quality^[4]. However, analog components, high-gain amplifier and ADC for example, are always employed in traditional TRNGs (as described in Ref. [5]) to amplify thermal noise and digitize it into discrete random sequences^[6]. Such approaches are rather area-costing, have high power-consumption and flexibility is limited on implementation.

Due to the digital nature of metastability, TRNGs can further simplify the design with standard libraries, be cost-saving and well-suited to FPGA design flow^{[7, 8]}. So metastability-based TRNGs can easily be utilized in digital devices. There are several proposed metastability-based TRNGs implemented in FPGA. Piotr^[9] proposes a novel method that instead of a deep-meta state, harvests resolve time through a nearly-meta state. The robustness of its system is improved with the local and global feedbacks, but the complexity of its system design is increased too. It offers 12.6 Mps throughput after post-processing in Xilinx XC6SLX16 and passes all NIST tests. Hata and Ichikawa^[10] proposes a solution that can be integrated in FPGA with LUT (look up table) based RS flip-flop. The biggest advantage of this approach is that no precise adjustment is required. But manual placement is very necessary in its circuit design, or the quality and the throughput of its system might be affected. Meanwhile, its robustness is vulnerable since the system is very sensitive to supply voltage and temperature fluctuation. Its final throughput is 12.5 Mps with 256 latches utilized by the core in Xilinx XC4VFX20. Majzoobi et al.^[11] proposes another novel metastability-based TRNG implemented in Xilinx Virtex5. A pair of high resolution programmable delay lines (PDLs) is employed to adjust the propagation difference Δ between the input signals of DFF precisely and flexibly. The system is PVT (process, voltage, temperature) tolerated by the proportional-integral controlling of Δ. However, the results of NIST show the pass rates of Non-Overlapping and Linear Complexity are only 90%, which might not meet the requirement in some applications. The 2.5 Mps throughput seems not enough when compared with other previous TRNGs with the same mechanism as well.

A metastability-based TRNG is presented in this paper. The designed TRNG intends to be cost-saving, operating stable and high throughput in FPGA with a simple circuit structure. The rest of this paper is organized as follows: Section 2 introduces the concept of metastability of the bi-state circuit; Section 3 presents the TRNG system design and the operation principle; the experiment results are discussed in Section 4; and Section 5 is the final conclusion.

2.
Metastability of bi-state circuit

For a bi-state circuit element, the D flip-flop for example, the output cannot resolve to a certain state when the input signal and the sampling edge of the clock arrive almost simultaneously, i.e. setup/hold time violated^[12]. This state is called the metastable state. In the presence of thermal noise, the circuit resolves to stable state and toggles in either 0 or 1 randomly. Ideally, as a random case, the probability of ‘0’ and ‘1’ of the circuit output should both be 50%. However, non-random factors, such as voltage supply noise in the circuit or temperature fluctuation, might affect one polarity of the output, causing the generated sequences to have less randomness. Fig. 1 illustrates the output waveform of a D flip-flop when the metastable state happens. It is seen that the metastable state takes place if the propagation delay difference Δ between input data and clock is small enough, and the range of this delay difference is also called the metastable window.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-1.jpg'"
class="figure_img" id="Figure1"/>

Download

Larger image

PowerPoint slide

Figure1.
Metastability of a D flip-flop.

3.
TRNG system design

The system design and the operating principle of the designed TRNG are described in this section. Fig. 2 depicts the framework of the system design. The black triangles are the proposed coarse-tuning PDL stages, which improve the automation level of the system and make the system cost-saving at a higher frequency in FPGA. The white triangles are the fine-tuning PDL stages, which can offer a very precise resolution in adjusting the value of Δ. As described in Section 2, the output of a bi-state circuit in the metastable state can easily be biased for various non-random factors. Here a feedback control system is adopted, which can monitor the bias of the output raw sequences. The value of the binary counter is increased or decreased based on the bias information of the raw sequences. The counter value is then transformed into control bits through the thermometer-decode to reconfigure the PDL.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-2.jpg'"
class="figure_img" id="Figure2"/>

Download

Larger image

PowerPoint slide

Figure2.
Framework of the designed TRNG.

3.1
Programmable delay line (PDL)

The propagation characteristic inside a look up table (LUT) as described in Ref. [11] is adopted to realize the programmable delay line. Altera Stratix IV FPGAs provide 6-input LUT. Fig. 3(a) shows the structure of a 6-input LUT. In this paper, the output of an LUT is configured as an inverter with A1 as its input. The rest [A₆ : A₂] act as address terminals. So there are 2⁵ = 32 tuning levels of propagation delay with five address bits. By configuring the address terminals with different values, one single LUT can generate delay differences with picosecond resolution. All 32 tuning levels of propagation delay are tested in Altera Stratix IV, and the comparisons between 32 propagation delay and the delay caused by [A₆ : A₂] = 00000 are shown in Fig. 3(b). The address is expressed in decimals. It is seen that the delay differences are nonlinear with the increment of address. But if only the most significant bit (MSB) is different between two address, [A₆ : A₂] = 00000 and [A₆ : A₂] = 10000 for example, the delay difference is always less than 1 ps. With this characteristic, the PDL can force the bi-state circuit into the metastable state precisely. In our design, only one delay difference between two fixed addresses, i.e., [A₆ : A₂] = 00000 and [A₆ : A₂] = 11111, is adopted to make sure that the tuning operation is linear.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-3.jpg'"
class="figure_img" id="Figure3"/>

Download

Larger image

PowerPoint slide

Figure3.
(a) Structure of a 6-input LUT. (b) Delay difference of 32 tuning levels compared with [A₆ : A₂] = 00000.

3.2
Coarse-tuning PDL and Fine-tuning PDL

Eq. (1)^[11] expresses the total delay difference between the input signals of DFF:

$$varDelta = {varDelta _{
m b}} + {varDelta _{
m p}} - {varDelta _{
m f}},$$

(1)

where Δ_b is the constant delay caused by routing asymmetries when the system is implemented in FPGA, and Δ_p is caused by the change of operation condition and adversarial attack. The calibration feedback delay difference generated by reconfiguring the PDL is marked as Δ_f. The resolution of one coarse PDL unit is Δ_cs and Δ_fn is the resolution of one fine PDL unit. Marking T as the tuning level of the PDL, then Δ_f can be expressed by Eq. (2):

$${varDelta _{
m f}} = {T_{
m cs}}{varDelta _{
m cs}} + {T_{
m fn}}{varDelta _{
m fn}}.$$

(2)

In order to adjust the delay difference Δ flexibly and improve the automation level of the system, a coarse-tuning PDL structure is proposed. As shown in Fig. 4(a), the LUT merely acts as a buffer and the 2-1 MUX decides the propagation path. The test result shows the delay difference is about 250 ps caused by one signal coarse PDL unit in Altera Stratix IV. Since the DFF is designed to sample the clock itself, this structure is aimed at narrowing the value of Δ and making it approach the metastable window in one clock cycle regardless of the delay situations caused by any non-ideal factors. So the tuning range of the coarse delay line needs to be wider than the period of the clock, i.e, MΔ_cs ≥ T_clk, where M is the total number of coarse PDL stages. In Ref. [11], two PDLs are employed and operate symmetrically to tune the state of the TRNG core. But placement and routing are necessary to ensure its symmetry, otherwise the tuning granularity of its PDL is too fine to offset the delay caused by compilation and synthesis in FPGA. With the proposed coarse-tuning PDL structure, the designed system does not require extra placement and routing. Moreover, the number of M can be decreased at a higher operating frequency, and thus saves more resources in FPGA. For example, at least 240 coarse PDL stages are needed for 25 MHz but only 60 are needed for 100 MHz. As mentioned above, the fine PDL is designed to utilize the delay difference between address [A₆ : A₂] = 00000 and [A₆ : A₂] = 11111 as its tuning granularity of 4.1 ps as shown in Fig. 3(b).

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-4.jpg'"
class="figure_img" id="Figure4"/>

Download

Larger image

PowerPoint slide

Figure4.
(a) Coarse-tuning PDL unit. (b) Fine-tuning PDL unit.

3.3
Feedback system operation principle

The feedback system is composed of the binary counter, the sampling module and the Finite State Machine (FSM). For the binary counter, high m bits are translated through thermometer-decode to control the coarse delay line and low n bits are translated to control the fine delay line. The counter value is initialized at the beginning, so there is a propagation delay difference Δ_i generated by the initial configuration of the PDL. So the raw sequences are biased. The sampling module samples the biased sequences with a sampling window and calculates the proportion of ‘1’s of the raw sequences it samples. The FSM increases or decreases the value of the counter in different tuning levels depending on the proportion of ‘1’s, as shown in Table 1. Then the counter value is decoded into control bits to reconfigure the PDL. Because the low n bits change when the system becomes stable and the counter value stays around a constant value with small fluctuation, only the fine PDL is tuned.

Proportion of ‘1’s	Tuning level of counter value
≥80%	+3
52%–80%	+1
48%–52%	+0
20%–48%	?1
≤20%	?3

Table1.
Tuning levels of the counter value with different proportions of ‘1’s.

Table options
-->

Download as CSV

Proportion of ‘1’s	Tuning level of counter value
≥80%	+3
52%–80%	+1
48%–52%	+0
20%–48%	?1
≤20%	?3

4.
Experiment results and analysis

The designed TRNG is implemented in Altera Stratix IV as shown in Fig. 5, and operates @ 100, 50, and 25 MHz. The number of the coarse PDL stages depends on the operating frequency as described in Section 3.2, but the number of the fine PDL stages is fixed at 50. During the experiment, the TRNG core consumes 170 LUTs, 290 LUTs and 530 LUTs @ 100, 50, and 25 MHz respectively. Notice that the configuration for higher frequency is not suitable for lower frequency, since the period of the latter is wider and needs more coarse PDL stages. In order to decide on the proper sampling window for the sampling module, the bias situations of 250 Mb raw sequences with 50, 100, 200, 400, 800, and 1600, six different sampling windows are tested and shown in Fig. 6(a). It is observed from the histogram that there are no obvious differences among the six bias situations. The feedback system works well in offsetting the bias caused by fluctuation parameters during operation, and the proportion of ‘1’s of the raw sequences basically maintains at around 0.5. Considering that the feedback time might be longer with a larger sampling window, a 200 sampling window is chosen as the operating parameter. Fig. 6(b) shows the variation of the counter value @ 100 MHz. The counter value stays around a constant value after certain clock cycles with only 1 or 2 low bits changing, which means the system works stably around a metastable window. The raw output waveform @ 100 MHz is shown in Fig. 7. In order to observe the waveform clearly, the output is divided into four for each of the four channels.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-5.jpg'"
class="figure_img" id="Figure5"/>

Download

Larger image

PowerPoint slide

Figure5.
Operating platform of the designed TRNG.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-6.jpg'"
class="figure_img" id="Figure6"/>

Download

Larger image

PowerPoint slide

Figure6.
(a) Probability of ‘1’s with different sampling windows. (b) The variation of counter value after certain clock cycles @ 100 MHz.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-7.jpg'"
class="figure_img" id="Figure7"/>

Download

Larger image

PowerPoint slide

Figure7.
Raw output waveform of the designed TRNG @ 100 MHz.

Despite the offsetting of the feedback mechanism, the raw sequences still suffer from problems such as autocorrelation and non-uniform distribution that might affect the randomness of the raw sequences. So proper post-processing is very necessary. An efficiency post-processing method is proposed in Ref. [13]. As shown in Fig. 8, the 4-stage XOR line can adjust the independence and uniform distribution of the output raw sequences. Then XOR with a periodic sequence, which is a uniform distribution and uncorrelated, can generate random sequences that are also uniformly distributed and uncorrelated^[14]. An m sequence generated by the LFSR (linear feedback shift register) can be the ideal periodic sequence described above. This method can balance both the throughput and the randomness of the output sequences. The throughput of the designed TRNG drops four times after post-processing. Fig. 9 depicts the post-processing output waveform @ 100 MHz, the input clock is also divided four times in order to be observed clearly.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-8.jpg'"
class="figure_img" id="Figure8"/>

Download

Larger image

PowerPoint slide

Figure8.
Post-processing circuit with 4-stage XOR line and m sequence.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2018/3/PIC/17070017-9.jpg'"
class="figure_img" id="Figure9"/>

Download

Larger image

PowerPoint slide

Figure9.
Post-processing output waveform of the designed TRNG @ 25 MHz divided operating frequency.

The post-processing sequences are tested by NIST and the final reports are shown in Table 2. The statistic quality of the post-processing sequences generated by the designed TRNG is verified by NIST. AIS-20/31 divides TRNGs into two classes: class P1 if the post-processing sequences pass test T0–T5, and class P2 if the raw sequences pass test T6–T8. The raw sequences do not pass T6 and T8 with the non-uniform distribution and weak entropy source inTable 3. It is because the randomness of the entire system only comes from the intrinsic thermal noise in the circuit, and no analog amplifier is employed to amplify the magnitude of the thermal noise. The comparison of the designed TRNG with prior works is shown in Table 4. The throughput @ 100 MHz is the highest among the works which are at the same or even higher operating frequency. The TRNG core might not be the most cost-saving among Table 4, but it is acceptable given the trade-off between throughput and resource consumption. For example, the TRNG core in Ref. [11] consumes 128 LUTs, which are 42 less than the designed TRNG core in this paper; however, it needs 10 cores (1280 LUTs + 10 DFFs) to operate in parallel to achieve the same throughput as the designed TRNG does.

NIST ↓ TRNG →	100 MHz (170 LUTs)		50 MHz (290 LUTs)		25 MHz (530 LUTs)
TEST SUITE	p-value	Proportion	p-value	Proportion	p-value	Proportion
Frequency	0.1045	0.9733	0.3986	0.9904	0.8916	0.9904
Block frequency	0.0451	0.9809	0.8301	0.9952	0.0451	0.9809
Cumulative sums	0.2881	0.9833	0.2715	0.9880	0.4685	0.9952
Runs	0.1094	0.9667	0.6519	1	0.2474	0.9952
Longest run of ones	0.1257	0.9857	0.2856	0.9857	0.4777	0.9952
Rank	0.9372	0.9867	0.7541	1	0.2053	0.9733
Discrete Fourier transform	0.3661	1	0.2992	1	0.0896	0.9809
Nonperiodic template matchings	0.4840	0.9900	0.4992	0.9890	0.5017	0.9890
Overlapping template matchings	0.2597	0.9904	0.7887	0.9583	0.8165	0.9602
Universal statistical	0.3621	0.9733	0.4915	1	0.3621	0.9733
Approximate entropy	0.0649	0.9952	0.2597	0.9809	0.8216	0.9857
Random excursions	0.3922	0.9950	0.1798	0.9957	0.2806	0.9950
Random excursions variant	0.2848	1	0.1267	0.9981	0.2378	0.9978
Serial	0.3539	0.9928	0.6321	0.9880	0.4353	0.9833
Linear complexity	0.8773	0.9714	0.0575	0.9761	0.7773	0.9809

Table2.
Test results of NIST for post-processing sequences of the three employed frequencies.

Table options
-->

Download as CSV

NIST ↓ TRNG →	100 MHz (170 LUTs)		50 MHz (290 LUTs)		25 MHz (530 LUTs)
TEST SUITE	p-value	Proportion	p-value	Proportion	p-value	Proportion
Frequency	0.1045	0.9733	0.3986	0.9904	0.8916	0.9904
Block frequency	0.0451	0.9809	0.8301	0.9952	0.0451	0.9809
Cumulative sums	0.2881	0.9833	0.2715	0.9880	0.4685	0.9952
Runs	0.1094	0.9667	0.6519	1	0.2474	0.9952
Longest run of ones	0.1257	0.9857	0.2856	0.9857	0.4777	0.9952
Rank	0.9372	0.9867	0.7541	1	0.2053	0.9733
Discrete Fourier transform	0.3661	1	0.2992	1	0.0896	0.9809
Nonperiodic template matchings	0.4840	0.9900	0.4992	0.9890	0.5017	0.9890
Overlapping template matchings	0.2597	0.9904	0.7887	0.9583	0.8165	0.9602
Universal statistical	0.3621	0.9733	0.4915	1	0.3621	0.9733
Approximate entropy	0.0649	0.9952	0.2597	0.9809	0.8216	0.9857
Random excursions	0.3922	0.9950	0.1798	0.9957	0.2806	0.9950
Random excursions variant	0.2848	1	0.1267	0.9981	0.2378	0.9978
Serial	0.3539	0.9928	0.6321	0.9880	0.4353	0.9833
Linear complexity	0.8773	0.9714	0.0575	0.9761	0.7773	0.9809

Parameter	100 MHz (170 LUTs)	50 MHz (290 LUTs)	25 MHz (530 LUTs)
T0-Disjointness	Pass	Pass	Pass
T1-Monobit	Pass	Pass	Pass
T2-Poker	Pass	Pass	Pass
T3-Runs	Pass	Pass	Pass
T4-LongRuns	Pass	Pass	Pass
T5-Autocorrelation	Pass	Pass	Pass
T6-UniformA	Fail	Fail	Fail
T6-UniformB	Fail	Fail	Fail
T7-HomogeneityA	Pass	Pass	Pass
T7-HomogeneityB	Pass	Pass	Pass
T8-Entropy	Fail	Fail	Fail

Table3.
Test results of AIS-20/31 for post-processing sequences (T0–T5) and raw sequences (T6–T8) of the three employed frequencies.

Table options
-->

Download as CSV

Parameter	100 MHz (170 LUTs)	50 MHz (290 LUTs)	25 MHz (530 LUTs)
T0-Disjointness	Pass	Pass	Pass
T1-Monobit	Pass	Pass	Pass
T2-Poker	Pass	Pass	Pass
T3-Runs	Pass	Pass	Pass
T4-LongRuns	Pass	Pass	Pass
T5-Autocorrelation	Pass	Pass	Pass
T6-UniformA	Fail	Fail	Fail
T6-UniformB	Fail	Fail	Fail
T7-HomogeneityA	Pass	Pass	Pass
T7-HomogeneityB	Pass	Pass	Pass
T8-Entropy	Fail	Fail	Fail

References	Methodology	Device	Resources used by TRNG core	Bit rate (Mbps)	Operating frequency (MHz)
This work	Deep meta state	Altera Stratix IV	170 LUTs + DFF	25	100
Ref. [9]	Nearly meta state	Xilinx XC6SLX16	36 LUTs + 24 DFFs	12.6	–
Ref. [10]	Deep meta state	Xilinx XC4VFX 20	256 latches	12.5	–
Ref. [11]	Deep meta state	Xilinx Virtex 5	128 LUTs + DFF	2.5	–
Ref. [15]	Deep meta state	Altera EPIS 25	100 latches	20	20
Ref. [16]	TERO	Xilinx Spartan 3E	2 TREOs	0.25	0.25
Ref. [17]	Clock jitter	Xilinx Virtex-II	DCM + DFF	6.05	270
Ref. [18]	Ring oscillator	Xilinx Spartan 6	3 slices	1.53	100
Ref. [19]	Deep meta state	65 nm	1609 μm²	3000	3000
Ref. [20]	Deep meta state	130 nm	9488 μm²	500	500

Table4.
Comparison of the designed TRNG with other methodologies in FPGA.

Table options
-->

Download as CSV

References	Methodology	Device	Resources used by TRNG core	Bit rate (Mbps)	Operating frequency (MHz)
This work	Deep meta state	Altera Stratix IV	170 LUTs + DFF	25	100
Ref. [9]	Nearly meta state	Xilinx XC6SLX16	36 LUTs + 24 DFFs	12.6	–
Ref. [10]	Deep meta state	Xilinx XC4VFX 20	256 latches	12.5	–
Ref. [11]	Deep meta state	Xilinx Virtex 5	128 LUTs + DFF	2.5	–
Ref. [15]	Deep meta state	Altera EPIS 25	100 latches	20	20
Ref. [16]	TERO	Xilinx Spartan 3E	2 TREOs	0.25	0.25
Ref. [17]	Clock jitter	Xilinx Virtex-II	DCM + DFF	6.05	270
Ref. [18]	Ring oscillator	Xilinx Spartan 6	3 slices	1.53	100
Ref. [19]	Deep meta state	65 nm	1609 μm²	3000	3000
Ref. [20]	Deep meta state	130 nm	9488 μm²	500	500

5.
Conclusion

A metastability-based TRNG that is implemented in Altera Stratix IV is presented in this paper. By real time monitoring and feedback calibrating, the PDL is reconfigured to force the circuit to stay around the metastable window precisely. The designed TRNG can save more resources at higher frequency and does not require extra placement and routing due to the proposed coarse-tuning PDL. The core only occupies 170 LUTs @ 100 MHz, and achieves 25 Mbps throughput after post-processing. The random quality of the designed TRNG is verified by NIST and accepted by class P1 of AIS-20/31 test suite. So the designed TRNG is proved to be efficient and reliable.