Side-channel attack-resistant AES S-box with hidden subfield inversion and glitch-free masking

删除或更新信息，请邮件至freekaoyan#163.com(#换成@)

本站小编 Free考研考试/2022-01-01

1.
Introduction

Side-channel attack (SCA) resistance is an important issue in the design of cryptographic integrated circuits in many applications, including the internet of things (IoT) nodes^[1]. Accordingly, SCA countermeasures must be employed in most IoT devices for security and privacy. However, IoT devices are usually lightweight systems. Therefore, the countermeasures for IoT devices must be secure, energy-efficient and low-cost at the same time.

The advanced encryption standard (AES) is widely used in IoT systems^[2] and the key point for addressing these challenges for AES is usually to realize a low-power, low-cost, and secure S-box.

The state-of-the-art uses key updating^[3], which is a kind of leakage resilience method. However, although it is low-overhead, it requires both sides of communication to use the same key update mechanism, which is not always realizable. The masking method randomizes intermediates of a calculation making use of random masks, but is complex when it is applied to nonlinear operations. Even though it can achieve a small area (e.g. the threshold implementation method^[4-6]), the energy efficiency is still relatively low. The threshold implementations of AES S-box usually take tens of cycles for one computation. Concerning the hiding methods, most of the equalization methods^[7] or the noising methods^[8] must consume extra energy as the compensation or noise component. Random voltage or clock dithering^[9] has less overhead but its effectiveness will be decreased when enhanced alignment technique (e.g. elastic alignment^[10]) is used. Ref. [11] does not illustrate any considerations about the internal power difference of the dual-rail flush logic (DRFL) gates and early propagation issues. Moreover, like most of dual-rail DPA-resistant logic styles, it needs to develop a new library of full-custom gates. The power-aware hiding (PAH) method^{[12, 13]} is a new kind of power equalization technique that minimizes the compensation power and addresses the early propagation issue. Moreover, it supports semi-custom design flow. However, its area is too large area to be used in the IoT applications, which are sensitive to cost (e.g. RFID tags).

We proposed a novel AES S-box implementation method in Ref. [14], which shrinks the scope of the PAH method in the design to dramatically decrease the area of circuit in addition to the mismatch in PAH block. As an expansion of this conference paper, in addition to a more detailed discussion and evaluation of the PAH part, this paper presents a glitch-free implementation of the masked part in the proposed structure that has enhanced the security of the whole design. On the whole, this new solution results in a higher level of security with a low area and a high energy efficiency.

2.
Related countermeasures

2.1
Full-masking structure

The scheme developed by Canright^[15] is a classic tower-field masking architecture. The AES S-box consists of the following two substeps: inversion in GF(2⁸) and an affine transformation. The inversion in GF(2⁸) can be calculated in the subfield GF(2⁴), which in turn involves operations in GF(2²), to minimize the area. The masked inversion calculation in the Canright approach can be depicted as in Eqs. (1)–(5), where $tilde A$

is the input data masked (XORed) by a random byte (mask), M, i.e., $tilde A = A oplus M$

; and subscript 1 or 0 attached with the symbols indicates their high or low 4 bits, respectively. The symbol Inv^*() in Eq. (2) denotes the masked inversion operation in GF(2⁴). This generates ${B^{ - 1}}$

masked by M₁, denoted by $widetilde {B^{ - 1}}$

. Q is another 4-bit mask independent of M. In addition, a 2-bit mask (r) joins the calculation over GF(2²). The final output is the masked output $widetilde {A^{ - 1}}$

and its mask M. This approach requires all intermediates to be masked and is termed the “full-masking implementation.”

$$begin{array}{l}widetilde B = Q oplus N otimes {(widetilde {{A_1}} oplus widetilde {{A_0}})^2} oplus N otimes {({M_1} oplus {M_0})^2} oplus widetilde {{A_1}}otimeswidetilde {{A_0}} quad ;;; oplus widetilde {{A_1}} otimes {M_0} oplus widetilde {{A_0}} otimes {M_1} oplus {M_1} otimes {M_0},end{array}$$

(1)

$$widetilde {B^{ - 1}} =
m{Inv}^*(tilde B,Q,r),$$

(2)

$$widetilde {B_2^{ - 1}} = widetilde {B^{ - 1}} oplus left( {{M_0} oplus {M_1}}
ight),$$

(3)

$$widetilde {A_1^{ - 1}} = Q oplus widetilde {{A_0}} otimes widetilde {{B^{ - 1}}} oplus widetilde {{A_0}} otimes {M_1} oplus {M_0} otimes widetilde {{B^{ - 1}}} oplus {M_0} otimes {M_1} oplus ({M_1} oplus Q),$$

(4)

$$widetilde {A_0^{ - 1}} = Q oplus widetilde {{A_1}} otimes widetilde {B_2^{ - 1}} oplus widetilde {{A_1}} otimes {M_0} oplus {M_1} otimes widetilde {B_2^{ - 1}} oplus {M_1} otimes {M_0} oplus ({M_0} oplus Q).$$

(5)

2.2
Power-aware hiding technique

The power-aware hiding technique implements an N-bit substitution function as a lookup table (LUT). The function output is expanded by a flag bit and N/2 compensation bits. The LUT is generated according to the following rules: when the Hamming weight (HW) of the original result is greater than N/2, the resulting word in the table is its inverted value, and the flag bit is 1; otherwise, the result word in the table is the original value, and the flag bit is 0. The compensation bits have a proper number of bits being ones; thus, the HW of the whole output word is N/2. The method presented in Refs. [12, 13] is to implement the entire design in the way of PAH as a whole function (which will be termed the “full-PAH” implementation in this paper). This implementation has advantages in terms of power-delay product (PDP) and performance, whereas its area overhead is too expensive for cost-sensitive applications. More importantly, it has many 4-input AND gates and therefore its stack-effect is relatively severe, which is a main source of mismatch in the circuit. Additionally, a larger block usually has larger clock skews between the gates. The clock skews can cause data-related difference in terms of timing, and afterwards power, in dynamic logics. For these two reasons, the larger scale PAH array has a more significant power difference.

3.
Masked AES S-box with PAH GF(2⁴) inverter

3.1
PAH-masking mixed architecture

In the proposed architecture, the inverse calculation in GF(2⁴) is implemented by means of PAH rather than the masked subfield GF(2²) calculation. The other operations are still in the masking approach. Hence, the proposed implementation is given by Eqs. (4)–(10), where Inv^h() represents the inversion in GF(2⁴) implemented in the PAH manner. The input of the PAH inversion unit (PIU) should be unmasked before entering the PIU. The output of PIU (${B^{ - 1}}$

) is also remasked immediately after output; see Eqs. (9) and (10).

$${h_1} = Q oplus N otimes {(widetilde {{A_1}} oplus widetilde {{A_0}})^2} oplus N otimes {({M_1} oplus {M_0})^2} oplus widetilde {{A_1}} otimes widetilde {{A_0}} oplus widetilde {{A_1}} otimes {M_0} oplus widetilde {{A_0}} otimes {M_1},$$

(6)

$${h_2} = Q oplus {M_1} otimes {M_0},$$

(7)

$$B^{ - 1} =
m{Inv}^h({h_1} oplus {h_2}),$$

(8)

$$widetilde {B^{ - 1}} = B^{ - 1} oplus {M_1},$$

(9)

$$widetilde {B_2^{ - 1}} = B^{ - 1} oplus {M_0}.$$

(10)

PIU operates in two phases: precharge and evaluation (see Section 3.2). Therefore, the proposed S-box module is implemented as a two-phase pipeline. As shown in Fig. 1, there are two register stages, R1 and R2, which divide the datapath into three stages. Fig. 2 gives its timing diagram. The phase of the pipeline is controlled by a pair of inverse signals switching at every rising clock edge: $phi $

and $overline phi $

. The first stage calculates h₁ and h₂. The second stage completes the inversion. In addition to PIU, there is an unmasking unit (UMU) that adds h₁ and h₂ together to unmask B. The masks of $widetilde {B^{ - 1}}$

and $widetilde {B_2^{ - 1}}$

are also prepared in this stage. The third stage calculates the final output ($widetilde S$

) and its mask (T). In this stage, $B^{ - 1}$

is masked by M₁ and M₀, respectively, at a remasking unit (denoted by RMU) and becomes $widetilde {B^{ - 1}}$

and $widetilde {B_2^{ - 1}}$

. The rest logic of the third stage functions as Eqs. (4) and (5), in addition to the basis change and the affine transformation of $widetilde {A^{ - 1}}$

and its mask. These stages output fixed values in their respective precharge phase. The variables that appear in both the first and the third stage, such as $widetilde {{A_1}} otimes {M_0}$

(denoted by am₁₀), $widetilde {{A_0}} otimes {M_1}$

(denoted by am₀₁), ${M_1} otimes {M_0}$

(denoted by m₀₁), and so on, are latched by a group of registers named Rc, which is updated only once for each computation.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-1.jpg'"
class="figure_img" id="Figure1"/>

Download

Larger image

PowerPoint slide

Figure1.
(Color online) PAH-masking mixed S-box structure (modified from Ref. [14]).

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-2.jpg'"
class="figure_img" id="Figure2"/>

Download

Larger image

PowerPoint slide

Figure2.
(Color online) PAH-masking mixed S-box timing diagram (modified from Ref. [14]).

3.2
PAH GF(2⁴) inversion unit

The circuit of the PIU is a 16-entry LUT implemented as a domino logic array. The signal $overline phi $

is its clock. According to the principle of PAH^[12], its outputs include a 4-bit result word, a flag bit, and two compensation bits. The logic of each output bit is implemented as a sum of products. The numbers of products of all bits of the result word and also the flag bit are 5. The numbers of products of the two compensation bits are 3 and 4, respectively. As shown in Fig. 3, the whole PAH unit contains an AND-plane, which produces all 16 product terms, and an OR-plane, which is composed of seven dynamic OR gates. The PIU input is encoded in 1-of-4 code; thus, the input data are represented by two groups of wires. Each input of an AND gate comes from one of the two groups and corresponds to one possible pattern of its group. Hence, the number of input transitions is kept constant (equals to 2). There are always one and only one product signal transition, and two and only two output transitions in the array for any input pattern. Some additional transistors used to precharge the internal nodes in the ANG gates for the purpose of restraining charge sharing and making their initial states before evaluation consistent for different input pattern. Using domino logic, the OR gates avoid series connection of five PMOS transistors. In addition, the static power consumption is closed to the static complementary CMOS logic. Additionally, PIU's operation is naturally divided into precharge and evaluation phases, which is necessary for making switching activity constant. Such a regular array can automatically be implemented by a compiler, as presented by Ref. [12]. The noise and mismatch of each gate are predictable. Hence, the problems about robustness and design effort brought by domino logic are not challenges in this case. For a new process, we only need to re-design a small number of basic layout cells, whose design difficulty is much lower than those special power-balanced logic style gates.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-3.jpg'"
class="figure_img" id="Figure3"/>

Download

Larger image

PowerPoint slide

Figure3.
(Color online) PAH inversion in GF(2⁴) unit. (a) Circuit^[14]. (b) Layout (62 × 56 μm²).

To control for the difference of internal power, all gates in the same plane have identical fan-in, fan-out and size. For the OR gates, of which actual fan-ins are fewer than five, the unused input pins are connected to ground (dotted line transistors in Fig. 3(a)). In addition, the flip-flops of R2 employ the same kind of cells, so that the power consumption of the output data transition is constant.

Consequently, PIU has only 16 2-input AND gates and 7 5-input OR gates, while the full-PAH S-box has 256 4-input AND gates and 13 93-input OR gates^[12]. A smaller number of entries not only saves the area but also benefits clock skew minimization because the overall routing length of the clock network is shorter. The clock skew will cause a mismatch of the evaluation or precharge time point of the gates, and hence will result in difference on power traces. In addition, the fan-in of the AND gates in PIU is smaller than that of the full-PAH S-box, and thus the mismatch due to stack-effect is much smaller. These advantages can decrease the power difference of the PAH array. Another benefit brought by the smaller fan-in of the AND gates is a lower minimum operating supply voltage. A low supply voltage is an important technique that can be used in energy-efficient circuits. The lowest supply voltage of this circuit achieves 0.4 V, which is a near-threshold value, which is the typical high energy-efficiency operating supply voltage range.

3.3
Un/Remasking circuits

Unmasking and remasking must be implemented with power-balanced circuits because unmasked intermediates appear in them. Therefore, the inputs of UMU, h₁ and h₂, are converted into 1-of-4 code in advance to facilitate power equalization because the XOR logic of the 1-of-4 data has a symmetric structure and is glitch-free. In RMU, $B^{ - 1}$

are masked with M₁ and M₀ in parallel. This is completed by the following two steps. First, $B^{ - 1}$

(in the 2-of-7 PAH code) are XOR-ed with ${M_1}'$

and ${M_0}'$

, respectively, both of which are 7-bit masks derived from M₁ and M₀, respectively, by introducing three additional random bits. Second, the masked data are decoded into 4-bit normal binary data. This two-stage structure provides uniform fan-outs for all 7 flip-flops of $B^{ - 1}$

.

To power-balance the RMU circuit, the XOR gates in it adopt the same cell. However, the XOR cells in the standard cell library are not satisfied. Their structure is shown in Fig. 4(a). The loads of the two input pins are different. Therefore, the transition power consumption of the gate will be different when different input pin switches are used. For the sake of eliminating this power difference, a kind of symmetric XOR gate is used instead; as shown in Fig. 4(b). In the symmetric gate, two identical XOR gates (whose input pins are cross connected) are combined in parallel. Thus, the capacitance at the two inputs is symmetric.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-4.jpg'"
class="figure_img" id="Figure4"/>

Download

Larger image

PowerPoint slide

Figure4.
Modification of XOR gates in RMU.

3.4
Glitch elimination in the masking domain

Glitches in the combinational logic pose a non-ignorable threat to masked implementations, especially to the XOR-chain structures^[16]. Therefore, the terms in Eqs. (4)–(6) must be added individually in a proper sequence^[15]; thus, each one forms an XOR-chain. To suppress glitches of RMU and UMU, their inputs are synchronized by registers; as shown in Fig. 1. Meanwhile, an enable signal is introduced into the XOR-chains to eliminate the dangerous glitches in them.

The glitch-free structure of the chain that calculates $widetilde {A_0^{ - 1}}$

is shown in Fig. 5. Its XOR-chain (including the XOR in the $widetilde {{A_1}} otimes widetilde {B_2^{ - 1}}$

multiplier, which is not drawn) is composed of six synchronized-XOR (Syn-XOR) gates and an “enable chain” (colored in grey). A Syn-XOR gate has an additional enable pin. The original input signal can affect the XOR gate only when enable =1. The enable chain is a duplication of the critical path that stems from the $phi $

signal and is used to match the path delay. It is shared by all slices of the block. When $phi $

is activated, the enable signal will be launched and it propagates along the enable chain to activate the Syn-XOR gates when their critical signals have arrived. Hence, at most, one transition can occur at each XOR gate during one evaluation. The enable signals are reset by the fall of $phi $

. The AND gates driving the enable pins are used to speed up this reset procedure to avoid the early arriving signals, such as am₀₁, causing transitions when enable has not fallen in the reset procedure.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-5.jpg'"
class="figure_img" id="Figure5"/>

Download

Larger image

PowerPoint slide

Figure5.
Structure of $widetilde {A_0^{ - 1}}$ calculation logic with enable chain.

4.
Implementation results

The proposed S-box has been implemented in a 180 nm technology. PIU (including circuit and layout), which was generated with a macro compiler. The RTL description of the entire S-box was synthesized at first to obtain a standard-cell-based initial gate-level netlist. Next, the obtained netlist was modified manually, as follows: (1) the XOR-chains were replaced with the glitch-free structure shown in Fig. 5, where the critical path in the initial netlist was cloned as the delay matching logic; (2) the cells of UMU and RMU were modified so that all bit slices are identical. Then, after validation, the modified gate-level netlist was transformed into a SPICE netlist, in which the GF(2⁴) inverter made of standard cells was replaced with the transistor-level PIU netlist. Finally, the transistor-level design was verified, evaluated, and analyzed through SPICE simulation. For comparison, the unprotected Canright S-box^[17], the full-masking and the full-PAH^[13] designs are also implemented in the same technology.

Concerning the security of the implemented S-box, the maximum $left| t
ight|$

of the proposed designs and also the full-masking, the full-PAH, and the unprotected S-box were obtained by a 10000-sample Nonspecific t-test^[18], in which two datasets, one random, another fixed data, were used as the input of the S-box to obtain power traces. The power traces of all the circuits were collected through simulation in SPICE, with the sampling rate of 10 GSa/s. A t-test was then performed on the traces of the two datasets. The detail results of t-test of the designs are shown in Fig. 6, where the red lines indicate the general threshold to reject the null hypothesis ($left| t
ight|{
m{ = }}4.5$

^[18]). The proposed S-box is the only one that reveals no significant difference. Hence, the leakage of PIU has proven to be less than that of the 256-entry PAH module.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-6.jpg'"
class="figure_img" id="Figure6"/>

Download

Larger image

PowerPoint slide

Figure6.
(Color online) Nonspecific t-test results (with 10 000 traces). (a) Unprotected. (b) Full-masking. (c) Full-PAH. (d) Proposed.

The maximum $left| t
ight|$

values of different S-boxes are compared in Table 1. Besides the designs in Fig. 6, we also include the leakage metric of the random fast voltage dithering (RFVD) method^[9] and the proposed S-box but with a normal masking part (with glitches). The proposed design wins out. The security improvement due to the glitch-free design is also illustrated. It also shows that even with glitch leakage, the solution based on PIU and masking is still more secure than the full-PAH solution.

Design	Ref. [17]	Full-masking	Ref. [13]	Ref. [9]	This work (with glitches)	This work
Countermeasure	Unprotected	Full-masking	Full-PAH	RFVD	PAH-masking	PAH-glitch-free masking
${left\| t ight\|_{ m{max }}}$	69.8	25.2	37.2	4.2	14.4	3.9

Table1.
Comparison with other DPA-resistant S-box in terms of security.

Table options
-->

Download as CSV

Design	Ref. [17]	Full-masking	Ref. [13]	Ref. [9]	This work (with glitches)	This work
Countermeasure	Unprotected	Full-masking	Full-PAH	RFVD	PAH-masking	PAH-glitch-free masking
${left\| t ight\|_{ m{max }}}$	69.8	25.2	37.2	4.2	14.4	3.9

Additionally, the measurements to disclosure (MTD) values of this work and the unprotected S-box were measured by the first order moments-correlating profiled DPA (MCP-DPA)^[19], in which the mean traces of different data were estimated through a random dataset and then the correlation power analysis based on Pearson's correlation of samples with the profiled mean values was performed on another random dataset. The result of the MCP-DPA on the proposed implementation using 665 000 traces is shown in Fig. 7. The correct key has not been revealed using these traces, so its MTD is greater than 665 000. Note that these traces have no noise because they are collected by simulation. For a real chip, the MTD value will be much larger.

onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-7.jpg'"
class="figure_img" id="Figure7"/>

Download

Larger image

PowerPoint slide

Figure7.
(Color online) MCP-DPA attack result (332 500 traces for profiling and 332 500 traces for correlation).

With respect to the performance and cost of the proposed method, Table 2 compares the delay (the latency of lookup), energy per-operation, and the area of different DPA-resistant AES S-boxes. With respect to the latency data of the proposed design, it is the sum of the delays of the three stages of logics in addition to the overhead time (setup and propagation time) of the registers^[14]. The highest clock frequency of it is 333 MHz. The following points can be observed: with the proposed method: the area is shrunk to 61.2% of the full-PAH area, the energy is approximately 17.6% of the improved masking solution proposed in Ref. [20]. The energy efficiency and area of the proposed S-box are, respectively, worse than the full-PAH one and the masking one. Compared with the other candidates, this work provides a more balanced tradeoff between the cost and energy efficiency. Meanwhile, it achieves a higher level of security compared with both of them.

Parameter	Ref. [17]	Ref. [20]	Ref. [13]	This work (with glitches)	This work
Countermeasure	No	Masking	Full-PAH	PIU+masking	Proposed
Technology (nm)	180	180	180	180	180
Delay (ns)	4	9.44	1.56	5.55	5.71
Area (GE)	373	635	3865	1558	2365
Energy (pJ)	24.56	152.93	5.6	24.08	26.87

Table2.
Comparison with other DPA-resistant S-box in terms of delay, energy, and cost.

Table options
-->

Download as CSV

Parameter	Ref. [17]	Ref. [20]	Ref. [13]	This work (with glitches)	This work
Countermeasure	No	Masking	Full-PAH	PIU+masking	Proposed
Technology (nm)	180	180	180	180	180
Delay (ns)	4	9.44	1.56	5.55	5.71
Area (GE)	373	635	3865	1558	2365
Energy (pJ)	24.56	152.93	5.6	24.08	26.87

5.
Conclusion

Applying the PAH technique to the inversion in GF(2⁴) in a masked tower-field implementation of AES S-box can realize higher security, and a good tradeoff between energy efficiency and cost. Based on the wave pipeline structure, an enable-based glitch-eliminating method can be used to further improve the security of the masked part. Implemented in a 180 nm process, it achieves 26.87 pJ/operation energy, 2365 gates equivalent, and no detectable leakage. It provides a high-security and overhead-balanced selection for AES S-box implementation. In the future, we will study its performance under a low voltage to explore energy-efficiency optimization through voltage scaling.

Acknowledgements

This work was supported by the National Science and Technology Major Project of China (2017ZX01030301).