1.
Introduction
True random number generators (TRNGs) have become an essential block of modern cryptographic systems for various applications. They are used to generate cryptographic keys[1], initialization vectors, nonces and implement countermeasures against side-channel attacks (SCA)[2]. TRNGs extract randomness from non-deterministic random sources such as thermal noise, phase noise and chaos algorithm[3] in order to assure unpredictability and good statistical quality[4]. However, analog components, high-gain amplifier and ADC for example, are always employed in traditional TRNGs (as described in Ref. [5]) to amplify thermal noise and digitize it into discrete random sequences[6]. Such approaches are rather area-costing, have high power-consumption and flexibility is limited on implementation.
Due to the digital nature of metastability, TRNGs can further simplify the design with standard libraries, be cost-saving and well-suited to FPGA design flow[7, 8]. So metastability-based TRNGs can easily be utilized in digital devices. There are several proposed metastability-based TRNGs implemented in FPGA. Piotr[9] proposes a novel method that instead of a deep-meta state, harvests resolve time through a nearly-meta state. The robustness of its system is improved with the local and global feedbacks, but the complexity of its system design is increased too. It offers 12.6 Mps throughput after post-processing in Xilinx XC6SLX16 and passes all NIST tests. Hata and Ichikawa[10] proposes a solution that can be integrated in FPGA with LUT (look up table) based RS flip-flop. The biggest advantage of this approach is that no precise adjustment is required. But manual placement is very necessary in its circuit design, or the quality and the throughput of its system might be affected. Meanwhile, its robustness is vulnerable since the system is very sensitive to supply voltage and temperature fluctuation. Its final throughput is 12.5 Mps with 256 latches utilized by the core in Xilinx XC4VFX20. Majzoobi et al.[11] proposes another novel metastability-based TRNG implemented in Xilinx Virtex5. A pair of high resolution programmable delay lines (PDLs) is employed to adjust the propagation difference Δ between the input signals of DFF precisely and flexibly. The system is PVT (process, voltage, temperature) tolerated by the proportional-integral controlling of Δ. However, the results of NIST show the pass rates of Non-Overlapping and Linear Complexity are only 90%, which might not meet the requirement in some applications. The 2.5 Mps throughput seems not enough when compared with other previous TRNGs with the same mechanism as well.
A metastability-based TRNG is presented in this paper. The designed TRNG intends to be cost-saving, operating stable and high throughput in FPGA with a simple circuit structure. The rest of this paper is organized as follows: Section 2 introduces the concept of metastability of the bi-state circuit; Section 3 presents the TRNG system design and the operation principle; the experiment results are discussed in Section 4; and Section 5 is the final conclusion.
2.
Metastability of bi-state circuit
For a bi-state circuit element, the D flip-flop for example, the output cannot resolve to a certain state when the input signal and the sampling edge of the clock arrive almost simultaneously, i.e. setup/hold time violated[12]. This state is called the metastable state. In the presence of thermal noise, the circuit resolves to stable state and toggles in either 0 or 1 randomly. Ideally, as a random case, the probability of ‘0’ and ‘1’ of the circuit output should both be 50%. However, non-random factors, such as voltage supply noise in the circuit or temperature fluctuation, might affect one polarity of the output, causing the generated sequences to have less randomness. Fig. 1 illustrates the output waveform of a D flip-flop when the metastable state happens. It is seen that the metastable state takes place if the propagation delay difference Δ between input data and clock is small enough, and the range of this delay difference is also called the metastable window.

class="figure_img" id="Figure1"/>
Download

Larger image

PowerPoint slide
Figure1.
Metastability of a D flip-flop.
3.
TRNG system design
The system design and the operating principle of the designed TRNG are described in this section. Fig. 2 depicts the framework of the system design. The black triangles are the proposed coarse-tuning PDL stages, which improve the automation level of the system and make the system cost-saving at a higher frequency in FPGA. The white triangles are the fine-tuning PDL stages, which can offer a very precise resolution in adjusting the value of Δ. As described in Section 2, the output of a bi-state circuit in the metastable state can easily be biased for various non-random factors. Here a feedback control system is adopted, which can monitor the bias of the output raw sequences. The value of the binary counter is increased or decreased based on the bias information of the raw sequences. The counter value is then transformed into control bits through the thermometer-decode to reconfigure the PDL.

class="figure_img" id="Figure2"/>
Download

Larger image

PowerPoint slide
Figure2.
Framework of the designed TRNG.
3.1
Programmable delay line (PDL)
The propagation characteristic inside a look up table (LUT) as described in Ref. [11] is adopted to realize the programmable delay line. Altera Stratix IV FPGAs provide 6-input LUT. Fig. 3(a) shows the structure of a 6-input LUT. In this paper, the output of an LUT is configured as an inverter with A1 as its input. The rest [A6 : A2] act as address terminals. So there are 25 = 32 tuning levels of propagation delay with five address bits. By configuring the address terminals with different values, one single LUT can generate delay differences with picosecond resolution. All 32 tuning levels of propagation delay are tested in Altera Stratix IV, and the comparisons between 32 propagation delay and the delay caused by [A6 : A2] = 00000 are shown in Fig. 3(b). The address is expressed in decimals. It is seen that the delay differences are nonlinear with the increment of address. But if only the most significant bit (MSB) is different between two address, [A6 : A2] = 00000 and [A6 : A2] = 10000 for example, the delay difference is always less than 1 ps. With this characteristic, the PDL can force the bi-state circuit into the metastable state precisely. In our design, only one delay difference between two fixed addresses, i.e., [A6 : A2] = 00000 and [A6 : A2] = 11111, is adopted to make sure that the tuning operation is linear.

class="figure_img" id="Figure3"/>
Download

Larger image

PowerPoint slide
Figure3.
(a) Structure of a 6-input LUT. (b) Delay difference of 32 tuning levels compared with [A6 : A2] = 00000.
3.2
Coarse-tuning PDL and Fine-tuning PDL
Eq. (1)[11] expresses the total delay difference between the input signals of DFF:
$$varDelta = {varDelta _{ m b}} + {varDelta _{ m p}} - {varDelta _{ m f}},$$ ![]() | (1) |
where Δb is the constant delay caused by routing asymmetries when the system is implemented in FPGA, and Δp is caused by the change of operation condition and adversarial attack. The calibration feedback delay difference generated by reconfiguring the PDL is marked as Δf. The resolution of one coarse PDL unit is Δcs and Δfn is the resolution of one fine PDL unit. Marking T as the tuning level of the PDL, then Δf can be expressed by Eq. (2):
$${varDelta _{ m f}} = {T_{ m cs}}{varDelta _{ m cs}} + {T_{ m fn}}{varDelta _{ m fn}}.$$ ![]() | (2) |
In order to adjust the delay difference Δ flexibly and improve the automation level of the system, a coarse-tuning PDL structure is proposed. As shown in Fig. 4(a), the LUT merely acts as a buffer and the 2-1 MUX decides the propagation path. The test result shows the delay difference is about 250 ps caused by one signal coarse PDL unit in Altera Stratix IV. Since the DFF is designed to sample the clock itself, this structure is aimed at narrowing the value of Δ and making it approach the metastable window in one clock cycle regardless of the delay situations caused by any non-ideal factors. So the tuning range of the coarse delay line needs to be wider than the period of the clock, i.e, MΔcs ≥ Tclk, where M is the total number of coarse PDL stages. In Ref. [11], two PDLs are employed and operate symmetrically to tune the state of the TRNG core. But placement and routing are necessary to ensure its symmetry, otherwise the tuning granularity of its PDL is too fine to offset the delay caused by compilation and synthesis in FPGA. With the proposed coarse-tuning PDL structure, the designed system does not require extra placement and routing. Moreover, the number of M can be decreased at a higher operating frequency, and thus saves more resources in FPGA. For example, at least 240 coarse PDL stages are needed for 25 MHz but only 60 are needed for 100 MHz. As mentioned above, the fine PDL is designed to utilize the delay difference between address [A6 : A2] = 00000 and [A6 : A2] = 11111 as its tuning granularity of 4.1 ps as shown in Fig. 3(b).

class="figure_img" id="Figure4"/>
Download

Larger image

PowerPoint slide
Figure4.
(a) Coarse-tuning PDL unit. (b) Fine-tuning PDL unit.
3.3
Feedback system operation principle
The feedback system is composed of the binary counter, the sampling module and the Finite State Machine (FSM). For the binary counter, high m bits are translated through thermometer-decode to control the coarse delay line and low n bits are translated to control the fine delay line. The counter value is initialized at the beginning, so there is a propagation delay difference Δi generated by the initial configuration of the PDL. So the raw sequences are biased. The sampling module samples the biased sequences with a sampling window and calculates the proportion of ‘1’s of the raw sequences it samples. The FSM increases or decreases the value of the counter in different tuning levels depending on the proportion of ‘1’s, as shown in Table 1. Then the counter value is decoded into control bits to reconfigure the PDL. Because the low n bits change when the system becomes stable and the counter value stays around a constant value with small fluctuation, only the fine PDL is tuned.
Proportion of ‘1’s | Tuning level of counter value |
≥80% | +3 |
52%–80% | +1 |
48%–52% | +0 |
20%–48% | ?1 |
≤20% | ?3 |
Table1.
Tuning levels of the counter value with different proportions of ‘1’s.
Table options
-->

Download as CSV
Proportion of ‘1’s | Tuning level of counter value |
≥80% | +3 |
52%–80% | +1 |
48%–52% | +0 |
20%–48% | ?1 |
≤20% | ?3 |
4.
Experiment results and analysis
The designed TRNG is implemented in Altera Stratix IV as shown in Fig. 5, and operates @ 100, 50, and 25 MHz. The number of the coarse PDL stages depends on the operating frequency as described in Section 3.2, but the number of the fine PDL stages is fixed at 50. During the experiment, the TRNG core consumes 170 LUTs, 290 LUTs and 530 LUTs @ 100, 50, and 25 MHz respectively. Notice that the configuration for higher frequency is not suitable for lower frequency, since the period of the latter is wider and needs more coarse PDL stages. In order to decide on the proper sampling window for the sampling module, the bias situations of 250 Mb raw sequences with 50, 100, 200, 400, 800, and 1600, six different sampling windows are tested and shown in Fig. 6(a). It is observed from the histogram that there are no obvious differences among the six bias situations. The feedback system works well in offsetting the bias caused by fluctuation parameters during operation, and the proportion of ‘1’s of the raw sequences basically maintains at around 0.5. Considering that the feedback time might be longer with a larger sampling window, a 200 sampling window is chosen as the operating parameter. Fig. 6(b) shows the variation of the counter value @ 100 MHz. The counter value stays around a constant value after certain clock cycles with only 1 or 2 low bits changing, which means the system works stably around a metastable window. The raw output waveform @ 100 MHz is shown in Fig. 7. In order to observe the waveform clearly, the output is divided into four for each of the four channels.

class="figure_img" id="Figure5"/>
Download

Larger image

PowerPoint slide
Figure5.
Operating platform of the designed TRNG.

class="figure_img" id="Figure6"/>
Download

Larger image

PowerPoint slide
Figure6.
(a) Probability of ‘1’s with different sampling windows. (b) The variation of counter value after certain clock cycles @ 100 MHz.

class="figure_img" id="Figure7"/>
Download

Larger image

PowerPoint slide
Figure7.
Raw output waveform of the designed TRNG @ 100 MHz.
Despite the offsetting of the feedback mechanism, the raw sequences still suffer from problems such as autocorrelation and non-uniform distribution that might affect the randomness of the raw sequences. So proper post-processing is very necessary. An efficiency post-processing method is proposed in Ref. [13]. As shown in Fig. 8, the 4-stage XOR line can adjust the independence and uniform distribution of the output raw sequences. Then XOR with a periodic sequence, which is a uniform distribution and uncorrelated, can generate random sequences that are also uniformly distributed and uncorrelated[14]. An m sequence generated by the LFSR (linear feedback shift register) can be the ideal periodic sequence described above. This method can balance both the throughput and the randomness of the output sequences. The throughput of the designed TRNG drops four times after post-processing. Fig. 9 depicts the post-processing output waveform @ 100 MHz, the input clock is also divided four times in order to be observed clearly.

class="figure_img" id="Figure8"/>
Download

Larger image

PowerPoint slide
Figure8.
Post-processing circuit with 4-stage XOR line and m sequence.

class="figure_img" id="Figure9"/>
Download

Larger image

PowerPoint slide
Figure9.
Post-processing output waveform of the designed TRNG @ 25 MHz divided operating frequency.
The post-processing sequences are tested by NIST and the final reports are shown in Table 2. The statistic quality of the post-processing sequences generated by the designed TRNG is verified by NIST. AIS-20/31 divides TRNGs into two classes: class P1 if the post-processing sequences pass test T0–T5, and class P2 if the raw sequences pass test T6–T8. The raw sequences do not pass T6 and T8 with the non-uniform distribution and weak entropy source inTable 3. It is because the randomness of the entire system only comes from the intrinsic thermal noise in the circuit, and no analog amplifier is employed to amplify the magnitude of the thermal noise. The comparison of the designed TRNG with prior works is shown in Table 4. The throughput @ 100 MHz is the highest among the works which are at the same or even higher operating frequency. The TRNG core might not be the most cost-saving among Table 4, but it is acceptable given the trade-off between throughput and resource consumption. For example, the TRNG core in Ref. [11] consumes 128 LUTs, which are 42 less than the designed TRNG core in this paper; however, it needs 10 cores (1280 LUTs + 10 DFFs) to operate in parallel to achieve the same throughput as the designed TRNG does.
NIST ↓ TRNG → | 100 MHz (170 LUTs) | 50 MHz (290 LUTs) | 25 MHz (530 LUTs) | ||||||
TEST SUITE | p-value | Proportion | p-value | Proportion | p-value | Proportion | |||
Frequency | 0.1045 | 0.9733 | 0.3986 | 0.9904 | 0.8916 | 0.9904 | |||
Block frequency | 0.0451 | 0.9809 | 0.8301 | 0.9952 | 0.0451 | 0.9809 | |||
Cumulative sums | 0.2881 | 0.9833 | 0.2715 | 0.9880 | 0.4685 | 0.9952 | |||
Runs | 0.1094 | 0.9667 | 0.6519 | 1 | 0.2474 | 0.9952 | |||
Longest run of ones | 0.1257 | 0.9857 | 0.2856 | 0.9857 | 0.4777 | 0.9952 | |||
Rank | 0.9372 | 0.9867 | 0.7541 | 1 | 0.2053 | 0.9733 | |||
Discrete Fourier transform | 0.3661 | 1 | 0.2992 | 1 | 0.0896 | 0.9809 | |||
Nonperiodic template matchings | 0.4840 | 0.9900 | 0.4992 | 0.9890 | 0.5017 | 0.9890 | |||
Overlapping template matchings | 0.2597 | 0.9904 | 0.7887 | 0.9583 | 0.8165 | 0.9602 | |||
Universal statistical | 0.3621 | 0.9733 | 0.4915 | 1 | 0.3621 | 0.9733 | |||
Approximate entropy | 0.0649 | 0.9952 | 0.2597 | 0.9809 | 0.8216 | 0.9857 | |||
Random excursions | 0.3922 | 0.9950 | 0.1798 | 0.9957 | 0.2806 | 0.9950 | |||
Random excursions variant | 0.2848 | 1 | 0.1267 | 0.9981 | 0.2378 | 0.9978 | |||
Serial | 0.3539 | 0.9928 | 0.6321 | 0.9880 | 0.4353 | 0.9833 | |||
Linear complexity | 0.8773 | 0.9714 | 0.0575 | 0.9761 | 0.7773 | 0.9809 |
Table2.
Test results of NIST for post-processing sequences of the three employed frequencies.
Table options
-->

Download as CSV
NIST ↓ TRNG → | 100 MHz (170 LUTs) | 50 MHz (290 LUTs) | 25 MHz (530 LUTs) | ||||||
TEST SUITE | p-value | Proportion | p-value | Proportion | p-value | Proportion | |||
Frequency | 0.1045 | 0.9733 | 0.3986 | 0.9904 | 0.8916 | 0.9904 | |||
Block frequency | 0.0451 | 0.9809 | 0.8301 | 0.9952 | 0.0451 | 0.9809 | |||
Cumulative sums | 0.2881 | 0.9833 | 0.2715 | 0.9880 | 0.4685 | 0.9952 | |||
Runs | 0.1094 | 0.9667 | 0.6519 | 1 | 0.2474 | 0.9952 | |||
Longest run of ones | 0.1257 | 0.9857 | 0.2856 | 0.9857 | 0.4777 | 0.9952 | |||
Rank | 0.9372 | 0.9867 | 0.7541 | 1 | 0.2053 | 0.9733 | |||
Discrete Fourier transform | 0.3661 | 1 | 0.2992 | 1 | 0.0896 | 0.9809 | |||
Nonperiodic template matchings | 0.4840 | 0.9900 | 0.4992 | 0.9890 | 0.5017 | 0.9890 | |||
Overlapping template matchings | 0.2597 | 0.9904 | 0.7887 | 0.9583 | 0.8165 | 0.9602 | |||
Universal statistical | 0.3621 | 0.9733 | 0.4915 | 1 | 0.3621 | 0.9733 | |||
Approximate entropy | 0.0649 | 0.9952 | 0.2597 | 0.9809 | 0.8216 | 0.9857 | |||
Random excursions | 0.3922 | 0.9950 | 0.1798 | 0.9957 | 0.2806 | 0.9950 | |||
Random excursions variant | 0.2848 | 1 | 0.1267 | 0.9981 | 0.2378 | 0.9978 | |||
Serial | 0.3539 | 0.9928 | 0.6321 | 0.9880 | 0.4353 | 0.9833 | |||
Linear complexity | 0.8773 | 0.9714 | 0.0575 | 0.9761 | 0.7773 | 0.9809 |
Parameter | 100 MHz (170 LUTs) | 50 MHz (290 LUTs) | 25 MHz (530 LUTs) |
T0-Disjointness | Pass | Pass | Pass |
T1-Monobit | Pass | Pass | Pass |
T2-Poker | Pass | Pass | Pass |
T3-Runs | Pass | Pass | Pass |
T4-LongRuns | Pass | Pass | Pass |
T5-Autocorrelation | Pass | Pass | Pass |
T6-UniformA | Fail | Fail | Fail |
T6-UniformB | Fail | Fail | Fail |
T7-HomogeneityA | Pass | Pass | Pass |
T7-HomogeneityB | Pass | Pass | Pass |
T8-Entropy | Fail | Fail | Fail |
Table3.
Test results of AIS-20/31 for post-processing sequences (T0–T5) and raw sequences (T6–T8) of the three employed frequencies.
Table options
-->

Download as CSV
Parameter | 100 MHz (170 LUTs) | 50 MHz (290 LUTs) | 25 MHz (530 LUTs) |
T0-Disjointness | Pass | Pass | Pass |
T1-Monobit | Pass | Pass | Pass |
T2-Poker | Pass | Pass | Pass |
T3-Runs | Pass | Pass | Pass |
T4-LongRuns | Pass | Pass | Pass |
T5-Autocorrelation | Pass | Pass | Pass |
T6-UniformA | Fail | Fail | Fail |
T6-UniformB | Fail | Fail | Fail |
T7-HomogeneityA | Pass | Pass | Pass |
T7-HomogeneityB | Pass | Pass | Pass |
T8-Entropy | Fail | Fail | Fail |
References | Methodology | Device | Resources used by TRNG core | Bit rate (Mbps) | Operating frequency (MHz) |
This work | Deep meta state | Altera Stratix IV | 170 LUTs + DFF | 25 | 100 |
Ref. [9] | Nearly meta state | Xilinx XC6SLX16 | 36 LUTs + 24 DFFs | 12.6 | – |
Ref. [10] | Deep meta state | Xilinx XC4VFX 20 | 256 latches | 12.5 | – |
Ref. [11] | Deep meta state | Xilinx Virtex 5 | 128 LUTs + DFF | 2.5 | – |
Ref. [15] | Deep meta state | Altera EPIS 25 | 100 latches | 20 | 20 |
Ref. [16] | TERO | Xilinx Spartan 3E | 2 TREOs | 0.25 | 0.25 |
Ref. [17] | Clock jitter | Xilinx Virtex-II | DCM + DFF | 6.05 | 270 |
Ref. [18] | Ring oscillator | Xilinx Spartan 6 | 3 slices | 1.53 | 100 |
Ref. [19] | Deep meta state | 65 nm | 1609 μm2 | 3000 | 3000 |
Ref. [20] | Deep meta state | 130 nm | 9488 μm2 | 500 | 500 |
Table4.
Comparison of the designed TRNG with other methodologies in FPGA.
Table options
-->

Download as CSV
References | Methodology | Device | Resources used by TRNG core | Bit rate (Mbps) | Operating frequency (MHz) |
This work | Deep meta state | Altera Stratix IV | 170 LUTs + DFF | 25 | 100 |
Ref. [9] | Nearly meta state | Xilinx XC6SLX16 | 36 LUTs + 24 DFFs | 12.6 | – |
Ref. [10] | Deep meta state | Xilinx XC4VFX 20 | 256 latches | 12.5 | – |
Ref. [11] | Deep meta state | Xilinx Virtex 5 | 128 LUTs + DFF | 2.5 | – |
Ref. [15] | Deep meta state | Altera EPIS 25 | 100 latches | 20 | 20 |
Ref. [16] | TERO | Xilinx Spartan 3E | 2 TREOs | 0.25 | 0.25 |
Ref. [17] | Clock jitter | Xilinx Virtex-II | DCM + DFF | 6.05 | 270 |
Ref. [18] | Ring oscillator | Xilinx Spartan 6 | 3 slices | 1.53 | 100 |
Ref. [19] | Deep meta state | 65 nm | 1609 μm2 | 3000 | 3000 |
Ref. [20] | Deep meta state | 130 nm | 9488 μm2 | 500 | 500 |
5.
Conclusion
A metastability-based TRNG that is implemented in Altera Stratix IV is presented in this paper. By real time monitoring and feedback calibrating, the PDL is reconfigured to force the circuit to stay around the metastable window precisely. The designed TRNG can save more resources at higher frequency and does not require extra placement and routing due to the proposed coarse-tuning PDL. The core only occupies 170 LUTs @ 100 MHz, and achieves 25 Mbps throughput after post-processing. The random quality of the designed TRNG is verified by NIST and accepted by class P1 of AIS-20/31 test suite. So the designed TRNG is proved to be efficient and reliable.