1.
Introduction
Side-channel attack (SCA) resistance is an important issue in the design of cryptographic integrated circuits in many applications, including the internet of things (IoT) nodes[1]. Accordingly, SCA countermeasures must be employed in most IoT devices for security and privacy. However, IoT devices are usually lightweight systems. Therefore, the countermeasures for IoT devices must be secure, energy-efficient and low-cost at the same time.
The advanced encryption standard (AES) is widely used in IoT systems[2] and the key point for addressing these challenges for AES is usually to realize a low-power, low-cost, and secure S-box.
The state-of-the-art uses key updating[3], which is a kind of leakage resilience method. However, although it is low-overhead, it requires both sides of communication to use the same key update mechanism, which is not always realizable. The masking method randomizes intermediates of a calculation making use of random masks, but is complex when it is applied to nonlinear operations. Even though it can achieve a small area (e.g. the threshold implementation method[4-6]), the energy efficiency is still relatively low. The threshold implementations of AES S-box usually take tens of cycles for one computation. Concerning the hiding methods, most of the equalization methods[7] or the noising methods[8] must consume extra energy as the compensation or noise component. Random voltage or clock dithering[9] has less overhead but its effectiveness will be decreased when enhanced alignment technique (e.g. elastic alignment[10]) is used. Ref. [11] does not illustrate any considerations about the internal power difference of the dual-rail flush logic (DRFL) gates and early propagation issues. Moreover, like most of dual-rail DPA-resistant logic styles, it needs to develop a new library of full-custom gates. The power-aware hiding (PAH) method[12, 13] is a new kind of power equalization technique that minimizes the compensation power and addresses the early propagation issue. Moreover, it supports semi-custom design flow. However, its area is too large area to be used in the IoT applications, which are sensitive to cost (e.g. RFID tags).
We proposed a novel AES S-box implementation method in Ref. [14], which shrinks the scope of the PAH method in the design to dramatically decrease the area of circuit in addition to the mismatch in PAH block. As an expansion of this conference paper, in addition to a more detailed discussion and evaluation of the PAH part, this paper presents a glitch-free implementation of the masked part in the proposed structure that has enhanced the security of the whole design. On the whole, this new solution results in a higher level of security with a low area and a high energy efficiency.
2.
Related countermeasures
2.1
Full-masking structure
The scheme developed by Canright[15] is a classic tower-field masking architecture. The AES S-box consists of the following two substeps: inversion in GF(28) and an affine transformation. The inversion in GF(28) can be calculated in the subfield GF(24), which in turn involves operations in GF(22), to minimize the area. The masked inversion calculation in the Canright approach can be depicted as in Eqs. (1)–(5), where
$$begin{array}{l}widetilde B = Q oplus N otimes {(widetilde {{A_1}} oplus widetilde {{A_0}})^2} oplus N otimes {({M_1} oplus {M_0})^2} oplus widetilde {{A_1}}otimeswidetilde {{A_0}} quad ;;; oplus widetilde {{A_1}} otimes {M_0} oplus widetilde {{A_0}} otimes {M_1} oplus {M_1} otimes {M_0},end{array}$$ | (1) |
$$widetilde {B^{ - 1}} = m{Inv}^*(tilde B,Q,r),$$ | (2) |
$$widetilde {B_2^{ - 1}} = widetilde {B^{ - 1}} oplus left( {{M_0} oplus {M_1}} ight),$$ | (3) |
$$widetilde {A_1^{ - 1}} = Q oplus widetilde {{A_0}} otimes widetilde {{B^{ - 1}}} oplus widetilde {{A_0}} otimes {M_1} oplus {M_0} otimes widetilde {{B^{ - 1}}} oplus {M_0} otimes {M_1} oplus ({M_1} oplus Q),$$ | (4) |
$$widetilde {A_0^{ - 1}} = Q oplus widetilde {{A_1}} otimes widetilde {B_2^{ - 1}} oplus widetilde {{A_1}} otimes {M_0} oplus {M_1} otimes widetilde {B_2^{ - 1}} oplus {M_1} otimes {M_0} oplus ({M_0} oplus Q).$$ | (5) |
2.2
Power-aware hiding technique
The power-aware hiding technique implements an N-bit substitution function as a lookup table (LUT). The function output is expanded by a flag bit and N/2 compensation bits. The LUT is generated according to the following rules: when the Hamming weight (HW) of the original result is greater than N/2, the resulting word in the table is its inverted value, and the flag bit is 1; otherwise, the result word in the table is the original value, and the flag bit is 0. The compensation bits have a proper number of bits being ones; thus, the HW of the whole output word is N/2. The method presented in Refs. [12, 13] is to implement the entire design in the way of PAH as a whole function (which will be termed the “full-PAH” implementation in this paper). This implementation has advantages in terms of power-delay product (PDP) and performance, whereas its area overhead is too expensive for cost-sensitive applications. More importantly, it has many 4-input AND gates and therefore its stack-effect is relatively severe, which is a main source of mismatch in the circuit. Additionally, a larger block usually has larger clock skews between the gates. The clock skews can cause data-related difference in terms of timing, and afterwards power, in dynamic logics. For these two reasons, the larger scale PAH array has a more significant power difference.
3.
Masked AES S-box with PAH GF(24) inverter
3.1
PAH-masking mixed architecture
In the proposed architecture, the inverse calculation in GF(24) is implemented by means of PAH rather than the masked subfield GF(22) calculation. The other operations are still in the masking approach. Hence, the proposed implementation is given by Eqs. (4)–(10), where Invh() represents the inversion in GF(24) implemented in the PAH manner. The input of the PAH inversion unit (PIU) should be unmasked before entering the PIU. The output of PIU (
$${h_1} = Q oplus N otimes {(widetilde {{A_1}} oplus widetilde {{A_0}})^2} oplus N otimes {({M_1} oplus {M_0})^2} oplus widetilde {{A_1}} otimes widetilde {{A_0}} oplus widetilde {{A_1}} otimes {M_0} oplus widetilde {{A_0}} otimes {M_1},$$ | (6) |
$${h_2} = Q oplus {M_1} otimes {M_0},$$ | (7) |
$$B^{ - 1} = m{Inv}^h({h_1} oplus {h_2}),$$ | (8) |
$$widetilde {B^{ - 1}} = B^{ - 1} oplus {M_1},$$ | (9) |
$$widetilde {B_2^{ - 1}} = B^{ - 1} oplus {M_0}.$$ | (10) |
PIU operates in two phases: precharge and evaluation (see Section 3.2). Therefore, the proposed S-box module is implemented as a two-phase pipeline. As shown in Fig. 1, there are two register stages, R1 and R2, which divide the datapath into three stages. Fig. 2 gives its timing diagram. The phase of the pipeline is controlled by a pair of inverse signals switching at every rising clock edge:
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-1.jpg'"
class="figure_img" id="Figure1"/>
Download
Larger image
PowerPoint slide
Figure1.
(Color online) PAH-masking mixed S-box structure (modified from Ref. [14]).
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-2.jpg'"
class="figure_img" id="Figure2"/>
Download
Larger image
PowerPoint slide
Figure2.
(Color online) PAH-masking mixed S-box timing diagram (modified from Ref. [14]).
3.2
PAH GF(24) inversion unit
The circuit of the PIU is a 16-entry LUT implemented as a domino logic array. The signal
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-3.jpg'"
class="figure_img" id="Figure3"/>
Download
Larger image
PowerPoint slide
Figure3.
(Color online) PAH inversion in GF(24) unit. (a) Circuit[14]. (b) Layout (62 × 56 μm2).
To control for the difference of internal power, all gates in the same plane have identical fan-in, fan-out and size. For the OR gates, of which actual fan-ins are fewer than five, the unused input pins are connected to ground (dotted line transistors in Fig. 3(a)). In addition, the flip-flops of R2 employ the same kind of cells, so that the power consumption of the output data transition is constant.
Consequently, PIU has only 16 2-input AND gates and 7 5-input OR gates, while the full-PAH S-box has 256 4-input AND gates and 13 93-input OR gates[12]. A smaller number of entries not only saves the area but also benefits clock skew minimization because the overall routing length of the clock network is shorter. The clock skew will cause a mismatch of the evaluation or precharge time point of the gates, and hence will result in difference on power traces. In addition, the fan-in of the AND gates in PIU is smaller than that of the full-PAH S-box, and thus the mismatch due to stack-effect is much smaller. These advantages can decrease the power difference of the PAH array. Another benefit brought by the smaller fan-in of the AND gates is a lower minimum operating supply voltage. A low supply voltage is an important technique that can be used in energy-efficient circuits. The lowest supply voltage of this circuit achieves 0.4 V, which is a near-threshold value, which is the typical high energy-efficiency operating supply voltage range.
3.3
Un/Remasking circuits
Unmasking and remasking must be implemented with power-balanced circuits because unmasked intermediates appear in them. Therefore, the inputs of UMU, h1 and h2, are converted into 1-of-4 code in advance to facilitate power equalization because the XOR logic of the 1-of-4 data has a symmetric structure and is glitch-free. In RMU,
To power-balance the RMU circuit, the XOR gates in it adopt the same cell. However, the XOR cells in the standard cell library are not satisfied. Their structure is shown in Fig. 4(a). The loads of the two input pins are different. Therefore, the transition power consumption of the gate will be different when different input pin switches are used. For the sake of eliminating this power difference, a kind of symmetric XOR gate is used instead; as shown in Fig. 4(b). In the symmetric gate, two identical XOR gates (whose input pins are cross connected) are combined in parallel. Thus, the capacitance at the two inputs is symmetric.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-4.jpg'"
class="figure_img" id="Figure4"/>
Download
Larger image
PowerPoint slide
Figure4.
Modification of XOR gates in RMU.
3.4
Glitch elimination in the masking domain
Glitches in the combinational logic pose a non-ignorable threat to masked implementations, especially to the XOR-chain structures[16]. Therefore, the terms in Eqs. (4)–(6) must be added individually in a proper sequence[15]; thus, each one forms an XOR-chain. To suppress glitches of RMU and UMU, their inputs are synchronized by registers; as shown in Fig. 1. Meanwhile, an enable signal is introduced into the XOR-chains to eliminate the dangerous glitches in them.
The glitch-free structure of the chain that calculates
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-5.jpg'"
class="figure_img" id="Figure5"/>
Download
Larger image
PowerPoint slide
Figure5.
Structure of
4.
Implementation results
The proposed S-box has been implemented in a 180 nm technology. PIU (including circuit and layout), which was generated with a macro compiler. The RTL description of the entire S-box was synthesized at first to obtain a standard-cell-based initial gate-level netlist. Next, the obtained netlist was modified manually, as follows: (1) the XOR-chains were replaced with the glitch-free structure shown in Fig. 5, where the critical path in the initial netlist was cloned as the delay matching logic; (2) the cells of UMU and RMU were modified so that all bit slices are identical. Then, after validation, the modified gate-level netlist was transformed into a SPICE netlist, in which the GF(24) inverter made of standard cells was replaced with the transistor-level PIU netlist. Finally, the transistor-level design was verified, evaluated, and analyzed through SPICE simulation. For comparison, the unprotected Canright S-box[17], the full-masking and the full-PAH[13] designs are also implemented in the same technology.
Concerning the security of the implemented S-box, the maximum
ight|$
ight|{
m{ = }}4.5$
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-6.jpg'"
class="figure_img" id="Figure6"/>
Download
Larger image
PowerPoint slide
Figure6.
(Color online) Nonspecific t-test results (with 10 000 traces). (a) Unprotected. (b) Full-masking. (c) Full-PAH. (d) Proposed.
The maximum
ight|$
Design | Ref. [17] | Full-masking | Ref. [13] | Ref. [9] | This work (with glitches) | This work |
Countermeasure | Unprotected | Full-masking | Full-PAH | RFVD | PAH-masking | PAH-glitch-free masking |
${left| t ight|_{ m{max }}}$ | 69.8 | 25.2 | 37.2 | 4.2 | 14.4 | 3.9 |
Table1.
Comparison with other DPA-resistant S-box in terms of security.
Table options
-->
Download as CSV
Design | Ref. [17] | Full-masking | Ref. [13] | Ref. [9] | This work (with glitches) | This work |
Countermeasure | Unprotected | Full-masking | Full-PAH | RFVD | PAH-masking | PAH-glitch-free masking |
${left| t ight|_{ m{max }}}$ | 69.8 | 25.2 | 37.2 | 4.2 | 14.4 | 3.9 |
Additionally, the measurements to disclosure (MTD) values of this work and the unprotected S-box were measured by the first order moments-correlating profiled DPA (MCP-DPA)[19], in which the mean traces of different data were estimated through a random dataset and then the correlation power analysis based on Pearson's correlation of samples with the profiled mean values was performed on another random dataset. The result of the MCP-DPA on the proposed implementation using 665 000 traces is shown in Fig. 7. The correct key has not been revealed using these traces, so its MTD is greater than 665 000. Note that these traces have no noise because they are collected by simulation. For a real chip, the MTD value will be much larger.
onerror="this.onerror=null;this.src='http://www.jos.ac.cn/fileBDTXB/journal/article/jos/2021/3/PIC/20080006-7.jpg'"
class="figure_img" id="Figure7"/>
Download
Larger image
PowerPoint slide
Figure7.
(Color online) MCP-DPA attack result (332 500 traces for profiling and 332 500 traces for correlation).
With respect to the performance and cost of the proposed method, Table 2 compares the delay (the latency of lookup), energy per-operation, and the area of different DPA-resistant AES S-boxes. With respect to the latency data of the proposed design, it is the sum of the delays of the three stages of logics in addition to the overhead time (setup and propagation time) of the registers[14]. The highest clock frequency of it is 333 MHz. The following points can be observed: with the proposed method: the area is shrunk to 61.2% of the full-PAH area, the energy is approximately 17.6% of the improved masking solution proposed in Ref. [20]. The energy efficiency and area of the proposed S-box are, respectively, worse than the full-PAH one and the masking one. Compared with the other candidates, this work provides a more balanced tradeoff between the cost and energy efficiency. Meanwhile, it achieves a higher level of security compared with both of them.
Parameter | Ref. [17] | Ref. [20] | Ref. [13] | This work (with glitches) | This work |
Countermeasure | No | Masking | Full-PAH | PIU+masking | Proposed |
Technology (nm) | 180 | 180 | 180 | 180 | 180 |
Delay (ns) | 4 | 9.44 | 1.56 | 5.55 | 5.71 |
Area (GE) | 373 | 635 | 3865 | 1558 | 2365 |
Energy (pJ) | 24.56 | 152.93 | 5.6 | 24.08 | 26.87 |
Table2.
Comparison with other DPA-resistant S-box in terms of delay, energy, and cost.
Table options
-->
Download as CSV
Parameter | Ref. [17] | Ref. [20] | Ref. [13] | This work (with glitches) | This work |
Countermeasure | No | Masking | Full-PAH | PIU+masking | Proposed |
Technology (nm) | 180 | 180 | 180 | 180 | 180 |
Delay (ns) | 4 | 9.44 | 1.56 | 5.55 | 5.71 |
Area (GE) | 373 | 635 | 3865 | 1558 | 2365 |
Energy (pJ) | 24.56 | 152.93 | 5.6 | 24.08 | 26.87 |
5.
Conclusion
Applying the PAH technique to the inversion in GF(24) in a masked tower-field implementation of AES S-box can realize higher security, and a good tradeoff between energy efficiency and cost. Based on the wave pipeline structure, an enable-based glitch-eliminating method can be used to further improve the security of the masked part. Implemented in a 180 nm process, it achieves 26.87 pJ/operation energy, 2365 gates equivalent, and no detectable leakage. It provides a high-security and overhead-balanced selection for AES S-box implementation. In the future, we will study its performance under a low voltage to explore energy-efficiency optimization through voltage scaling.
Acknowledgements
This work was supported by the National Science and Technology Major Project of China (2017ZX01030301).