

Contents lists available at ScienceDirect

# Microelectronics Journal



journal homepage: www.elsevier.com/locate/mejo

# A power scalable 2–10 Gb/s PI-based clock data recovery for multilane applications



Fangxu Lv<sup>a</sup>, Xuqiang Zheng<sup>b,\*</sup>, Feng Zhao<sup>c</sup>, Jianye Wang<sup>a</sup>, Shigang Yue<sup>d</sup>, Ziqiang Wang<sup>e</sup>, Weidong Cao<sup>f</sup>, Yajun He<sup>e</sup>, Chun Zhang<sup>e</sup>, Hanjun Jiang<sup>e</sup>, Zhihua Wang<sup>e</sup>

<sup>a</sup> Air and Missile Defense College, Air Force Engineering University, Xi'an 710051, China

<sup>b</sup> Institute of Microelectronics of Chinese Academy of Sciences, Beijing 100029, China

<sup>c</sup> Department of Computer Science, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, United Kingdom

<sup>d</sup> School of Computer Science, University of Lincoln, Lincoln LN6 7TS, United Kingdom

<sup>e</sup> Institute of Microelectronics, Tsinghua University, Beijing 100084, China

<sup>f</sup> Department of Electrical and System Engineering, Washington University, St. Louis, MO 63130, United States

ARTICLE INFO

Keywords: Clock data recovery (CDR) Global biasing strategy Local clock conditioner Phase interpolator (PI) Power scaling Timing averaging (TA) Voltage-controlled delay line (VCDL)

#### ABSTRACT

This paper presents a power scalable clock data recovery (CDR) suitable for multilane and multirate applications. To make the power consumption scale with the data rate and guarantee appropriate edge overlaps for the phase interpolation, a delay-locked loop-based global biasing strategy is proposed to automatically adjust the bandwidth of the current-mode logic buffers and phase interpolator (PI). The I, Q clocks are generated by a local clock conditioner, which employs an open-loop voltage-controlled delay line to produce the evenly spaced multiple phases and adopts a two-stage timing averaging to correct the duty cycle distortion and I, Q mismatch. Additionally, a phase-compensating technique is adopted in the PI to optimize its linearity. Implemented in a 65-nm CMOS process with an area occupation of 0.12 mm<sup>2</sup>, the presented CDR can operate from 2 to 10 Gb/s with a scalable power consumption from 11 to 42 mW. When it operates at 10 Gb/s, the maximum tolerable amplitude of the sinusoidal jitter at 50 MHz is 0.52 UJpp, and the total jitter of the recovered clock is 16.6 ps at a BER of 1e-12.

# 1. Introduction

High-density multirate serial links are playing more important roles in modern communication networks, because they can satisfy the aggregate bandwidth demand by integrating high-volume parallel lanes, and accommodate different protocols through adjusting the operation rates [1-5]. As one of the most important components in these serial links, the clock data recovery (CDR) is quite challenging to design since it not only involves the most high-speed data slicing but also needs to produce high-quality sampling clocks [6-8]. Compared to other topologies, the phase interpolator (PI)-based CDR is more suitable for such applications because of the following advantages. First, it exhibits high power efficiency and area efficiency owing to the clock sharing capability and the compact implementation of the digital-loop filter [9]. Second, the crosstalk between inter-channels can be effectively reduced since the multiple voltage-controlled oscillators (VCOs) in VCO-based CDRs are excluded [10]. Third, the operation rate can be flexibly adjusted by changing the frequency of the shared clock [9,11].

However, it is a nontrivial task to design a wide-range PI-based CDR with both high power efficiency and small area occupation. The main difficulty is how to maintain a high power efficiency over a variety of data rates, which means the CDR should be able to automatically adjust the power dissipation according to the operation rate. Wei et al. [12] employed an adaptively regulated supply to optimize the power efficiency of the digital CMOS circuits. Nevertheless, the reduced supply decreases the voltage margin, which degrades the operation robustness or even causes bit errors in harsh environments. In Ref. [13], replica-biased symmetric load-based current mode logic (CML) circuits are utilized to obtain the desired gain/bandwidth scaling. Since the power scaling is realized by manually adjusting the bias currents, it is difficult to achieve the optimal power efficiency. Another difficulty is how to generate high-quality clocks for the phase interpolation. For traditional quadrature PI (preferred choice due to its compact implementation and robust operation), the linearity of the phase interpolation can be deteriorated by the I, Q mismatch, clock duty cycle distortion (DCD), and inadequate edge overlap of the input clocks [14-16]. To

https://doi.org/10.1016/j.mejo.2018.10.007

Received 1 March 2018; Received in revised form 16 September 2018; Accepted 19 October 2018 Available online 27 October 2018 0026-2692/© 2018 Elsevier Ltd. All rights reserved.

<sup>\*</sup> Corresponding author. *E-mail address:* zhengxuqiang@ime.ac.cn (X. Zheng).

mitigate these effects, local duty cycle correction (DCC), I, Q phase correction, and slew-rate calibration are usually utilized to produce qualified I, Q clocks [17–19]. Nonetheless, these existing techniques are limited to a narrow operation range. Finally, the CDR performance is also constrained by the phase-step non-uniformity [14]. Although the octagonal PI exhibits a higher linearity than the traditional quadrature PIs, it needs a complex phase rotator circuit to generate the octagonal phase constellation.

To address the above mentioned issues, a delay-locked loop (DLL)based global biasing strategy is proposed in this work to provide adaptively adjusted bias to support a wide operation range with a scalable power consumption. Besides, a voltage-controlled delay line (VCDL)based local clock conditioner is developed to perform the desired functions of DCC, I, Q mismatch adjustment, and slew-rate calibration. In addition, a novel compensating PI is employed to optimize the linearity of the phase interpolation.

The remainder of this paper is organized as follows. Section 2 presents the devised CDR architecture. Crucial block designs are described in Section 3, focusing on the clock pre-amplifier, the local clock conditioner, and the compensating PI. The measurement results are given in Section 4 and Section 5 concludes this paper.

# 2. Proposed CDR architecture

Fig. 1 depicts the block diagram of the proposed CDR within a multilane receiver. The input clock (CLK\_P/CLK\_N) and the input data (RX0\_P/RX0\_N ... RXn\_P/RXn\_N) are received from the transmitter. A separate common lane containing a self-biased DLL and a clock preamplifier with DCC is utilized to produce the global bias and prerectify the received half-rate clocks. A CDR loop consisting of a local clock conditioner, a compensating PI, two buffers, four samplers, a demux, and a cock recovery unit (CRU) logic is integrated into each data lane to retime and demux the input data. The local clock conditioner involving an open-loop VCDL and a two-stage time averaging (TA) is used to generate qualified I, Q clocks for the phase interpolation. The gray blocks in Fig. 1 are implemented in PMOS symmetric load-based CMLs to support a wide operation range with scalable power consumption. The symmetric loads consist of a diode-connected PMOS device in shunt with an equally sized biased PMOS device. In this design the swing of delay cell is adjusted to 0.89(VDD-VBP) to mitigate the asymmetry caused by short channel effect.

Compared to traditional receiver architecture, the main feature of this design is the shared DLL-based biasing strategy. As described in Fig. 1, the global bias is acquired by filtering the VCDL bias generated by the shared DLL. This strategy brings in several benefits.



Fig. 1. Proposed CDR architecture for multi-channel applications.

- The elimination of local filter capacitors in the data lane remarkably improves the area efficiency, which is very important for multichannel applications.
- The jitter amplification in the VCDL is reduced by inserting a low-pass filter (LPF) between the local VCDL and self-biased DLL (see Fig. 1).
- The power consumption is scalable with respect to the data rate since the bias is adaptively adjusted by the shared DLL according to the operation rate.
- The slew rate of the input PI's clock is also adjusted automatically in accordance with the data rate, thus proper I, Q overlap and PI bandwidth can be always achieved to support a wide operation range.

#### 3. Crucial block designs

#### 3.1. Clock pre-amplifier

Fig. 2 shows the block diagram of the clock pre-amplifier. It is made up of an AC coupler, four cascaded buffers, an LPF, and an amplifier (AMP). The circuit details are given in Fig. 3. In principle, the AC coupler is used to reject the input clock DCD, where the DC voltages of the differential output are adjusted by the feedback loop. The four cascaded buffers are utilized to sharpen the edges of the clock and provide the driving ability. The LPF consisting of large resistors and small capacitors is applied to extract the DC voltages of clocks CP/CN. The difference between the two DC voltages is amplified by the high-gain AMP and then



Fig. 2. Architecture of the clock pre-amplifier.





**Fig. 3.** Circuit details of the clock pre-amplifier. (a) AC coupler, (b) LPF, and (c) AMP.

fed into the AC coupler to adjust the DC voltages of CKON/CKOP. By forcing the DC voltages of the output clocks CP/CN to be equal, a 50% duty clock can be attained. To set a proper common voltage on CKON/ CKOP in the AC coupler in Fig. 3(a), a common-mode feedback path is also integrated in the AMP [see Fig. 3(c)]. The VCMREF is manually set by an external pin and it is fixed to 900 mV in this design. By making the voltage VCMFB extracted by the AC coupler equal VCMREF, appropriate common voltages of CKON/CKOP can be attained.

There are two obvious poles in this design, which are contributed by the LPF and the AMP. To produce stable bias voltages for the differential input clock, we place the dominant pole at the AMP output by adopting a second order loop filter with a large C1, where the zero created by R1 and C1 is used to neutralize the secondary pole contributed by the LPF [see Fig. 3(b)]. This scheme allows a large bandwidth, which enables the DCC loop to track the fast fluctuation on the DC voltage. Moreover, a large bandwidth means a small filter capacitance, thus helping to reduce the area occupation.

#### 3.2. Local clock conditioner

Fig. 4 displays the implementation details of the local clock conditioner. Its main function is to generate qualified I, Q clocks for different data rates. In order to guarantee a high linear phase interpolation, the I, Q clocks should possess several properties, including accurate I, Q matching, low DCD, and well-conditioned slew rate. In this work, the VCDL generates a set of equally spaced clocks, which are directly fed into the first-stage of the time-averaging buffers where the DCD is reduced. These output clocks are then applied to the second time-averaging stage, which improves the phase separation uniformity. As a result, the I, O mismatch can be corrected. Besides, the delay cell and TA are both implemented in the PMOS symmetric load-based CML (see the sub-block diagram in Fig. 1) and driven by the self-biased DLL. Thus, their bandwidth can be automatically adjusted to obtain proper slew rates for different data rates. Fig. 5 describes the details of the time-averaging implementation, which averages the arriving time of the two adjacent input signals. Fig. 6 displays the delay of the Dcell with respect to the bias (VBP) of the PMOS symmetric. As the bias changes from 310 mV to 610 mV, the Dcell delay varies from 21.6 ps to 71.4 ps

Compared to [18], the local VCDL in this design is biased by the global biasing voltage generated through the shared DLL, thus the local feedback loop that requires a large area (mainly occupied by the filter capacitor) is removed. This is critical to multi-channel applications, because a large area occupation means a long-distance clock distribution, which deteriorates the clock jitter and needs more power to drive the heavier load. Note that the potential fabrication mismatch, inconsistent



Fig. 4. Local clock conditioner based on the open-loop VCDL.



Fig. 5. Circuit implementation of the TA.



Fig. 6. Relationship of bias-delay of the Dcell.

voltage drop, and different ambient temperatures may make the delay of the open-loop VCDL deviate from its ideal value. Meanwhile, the duty cycle of the clock may drift away from 50% after a long distance distribution. In this design, these two problems are solved by the following two-stage TA, where the DCD is corrected in the first stage and the I, Q mismatch is calibrated in the second stage.

To demonstrate the working principles and calibration effects of the DCC, Fig. 7 presents the simulation waveforms of the first-stage TA, which is driven by a 5 GHz input clock with a 46% DCD and biased by a specified voltage introducing a 10.5% delay error. The top and middle rows display the waveforms of the input clocks CK1N/CK1P and CK5N/CK5P, while the bottom row depicts the output clocks CK0/CK180. Note that CK5N/CK5P are the delayed version of CK1N/CK1P and the delay is supposed to be T/2. To clearly illustrate the DCC operation of the first-



Fig. 7. Simulation waveforms of the first-stage TA.

stage TA, a time-domain coordinate system is adopted in Fig. 7. As a consequence, we can get the following equations,

$$T = t_1(c) - t_1(a) = t_2(c) - t_2(a),$$
(1)

$$t_1(b) - t_1(a) = t_2(c) - t_2(b) = T/2 + \Delta \tau_d,$$
 (2)

$$t_1(c) - t_1(b) = t_2(b) - t_2(a) = T/2 - \Delta \tau_d,$$
(3)

where *T* is the clock period and  $\Delta \tau_d$  is the duty cycle error existing in the input clock. Considering the fact that the function of the TA is to time-average the crossing point of the two input clocks, the coordinates of  $t_3(i)$ , (i = a, b, c) can be separately obtained by,

$$t_3(i) = [t_2(i) + t_1(i)]/2 + t_{d1}, (i = a, b, c),$$
(4)

where  $t_{d1}$  is the delay of the first-stage TA. Then, the time spacing between the crossing points  $t_3(a)$  and  $t_3(b)$  can be attained by,

$$t_{3}(b) - t_{3}(a) = \{ [t_{1}(b) - t_{1}(a)] + [t_{2}(b) - t_{2}(a)] \}/2 = \{ [T/2 + \Delta\tau] + [T/2 - \Delta\tau] \}/2 = T/2.$$
(5)

Clearly, the delay between the two crossing points is theoretically equal to a half of the clock period, which is not related to  $\Delta \tau_d$  and  $t_{d1}$ . Consequently, the DCD can be corrected. Simulation results show that the duty cycle is optimized from 46% to 50.1%.

To illustrate the calibration process of the I, Q mismatch, Fig. 8 gives the simulation waveforms of the input and output of the second-stage TA. The second row of Fig. 8 depicts the four first-stage TA outputs (secondstage TA inputs), and the selected part is enlarged in the first row. The time spacings between adjacent clocks (CK0, CK45, CK90, and CK135) are actually the transport delays of the corresponding delay cells, which can be denoted by  $\Delta \tau_1$ ,  $\Delta \tau_2$ , and  $\Delta \tau_3$ . Applying the time-domain coordinate system to Fig. 8, we can obtain,

$$t_2(a) - t_1(a) = \Delta \tau_1, \quad t_3(a) - t_2(a) = \Delta \tau_2, t_2(a) - t_3(a)$$
  
=  $\Delta \tau_3, \quad t_1(b) - t_1(a) = T/2.$  (6)

It is worthy to note that the last formula in (6) is based on the fact that



Fig. 8. I, Q mismatch correction effect of the second-stage TA.

the clock duty cycle has been corrected by the first-stage TA. The third and fourth rows of Fig. 8 describe the second-stage TA outputs, where the coordinates of  $t_5(a)$  and  $t_6(a)$  can be respectively obtained by,

$$t_5(a) = [t_2(a) + t_3(a)]/2 + t_{d_2},$$
(7)

$$t_6(a) = [t_4(a) + t_1(b)]/2 + t_{d_2},$$
(8)

where  $t_{d2}$  is the delay of the second-stage TA. The I, Q time spacing can be calculated by,

$$t_{6}(a) - t_{5}(a) = [t_{4}(a) + t_{1}(b)]/2 - [t_{2}(a) + t_{3}(a)]/2$$
  
=  $[t_{4}(a) - t_{3}(a)]/2 + [t_{1}(b) - t_{1}(a)]/2 - [t_{2}(a) - t_{1}(a)]/2$  (9)  
=  $\Delta \tau_{3}/2 + T/4 - \Delta \tau_{1}/2$ 

Obviously, the sufficient condition to obtain perfectly matched quadrature I, Q clocks (i.e., the time spacing between the quadrature I, Q clocks is T/4) is  $\Delta \tau_3 = \Delta \tau_1$ . Referring to Fig. 4, the input clocks (CK0/ CK180, CK45/CK225, CK90/CK270, and CK135/CK315) are produced by averaging pairs of the VCDL output clocks (CK1N/CK1P & CK5N/ CK5P, CK2N/CK2P & CK6N/CK6P, CK3N/CK3P & CK7N/CK7P, and CK4N/CK4P & CK8N/CK8P). If the delays passing through the delay cells in the VCDL are identical, the condition of  $\Delta \tau_3 = \Delta \tau_1$  holds. Fortunately, this condition is held in theory because the delay cells share identical circuit implementations and bias conditions. In practice, special attentions are paid to the layout design to make sure that each delay cell has the same driving load. Additionally, a two-stage pre-driver (see Fig. 4) is adopted to pre-shape the input clock waveform to ensure that each delay cell shares similar input or output waveform, thus mitigating the delay variations. Post simulations indicate that the delay variations can be controlled under 3%. Therefore, it is reasonable to assume that the delays between the adjacent outputs of the four TA1s in Fig. 4 are the same. In consequence, a pair of well-matched I, Q clocks can be obtained. Simulation results demonstrate that a good I, Q time interval of 50.2 ps (ideally 50 ps, i.e., one quarter of a clock period) is achieved, addressing the 8.4% space error, which is consistent with the theoretical analysis. Fig. 9 further shows the Monte Carlo simulation results of the I, Q spacing, which reveals that the standard deviation is only 1.27 ps against the fabrication mismatch. In general, the VCDL with a two-stage TA can provide accurate I, Q clocks with well-calibrated duty cycles.

It is worth noting that the time-averaging effect can be deteriorated by inadequate edge overlap, large input clock DCD, and big delay error. In this design, a DCC is integrated into the pre-amplifier to pre-calibrate the DCD. Meanwhile, the global biasing strategy is applied to make the VCDL delay close to the targeted delay and guarantee proper edge transitions for different operation rates.



Fig. 9. Monte Carlo simulation results of the I, Q spacing.

# 3.3. Compensating phase interpolator

A high linear PI is desired to optimize the uniformity of the phase steps to reduce the deterministic jitter, which can increase the maximum tolerable amplitude of the input jitter. However, prominent differential nonlinearity (DNL) and integral nonlinearity (INL) exist in traditional PIs [1,14]. Therefore, a compensating PI with high linearity becomes a preferred solution [20]. Fig. 10 (a) shows its topology, where two identical quadrature PIs are used to generate the 45°-spaced CK1 and CK2 by shifting eight steps (half of the total quadrant steps). These clocks are fed into the rectifying buffer to produce clocks with similar amplitudes, which are then phase-averaged by the succeeding TA. To demonstrate the linearity optimization, mathematic analysis based on trigonometric function approximation is discussed as follows.

According to previous studies [1,14], the I, Q clocks can be approximated by,

$$CK_I = A \sin(2\pi f t), CK_Q = A \cos(2\pi f t).$$
<sup>(10)</sup>

The output of conventional PI1 can be computed by,

$$CK_{PI1}^{T} = (1 - \alpha)A\sin(2\pi ft) + \alpha A\cos(2\pi ft) = A_{1}^{T}\sin\left(2\pi ft + \theta_{1}^{T}\right),$$
(11)

$$\theta_1^T = \arctan\left(\frac{\alpha}{1-\alpha}\right),$$
 (12)

$$A_{1}^{T} = \sqrt{\alpha^{2} + (1 - \alpha)^{2} A},$$
(13)

where  $\alpha$  ( $0 \le \alpha \le 1$ ) is the ratio of the current phase code to the total phase steps in each quadrant. Similarly, the equation for PI2 can also be obtained. As shown in Fig. 11, the red dashed line is plotted according to (12). The curve presents an S-shape phase transfer characteristic, which results in a maximum DNL of  $1.81^{\circ}$ . This phase error will cause additional deterministic jitter to the recovered clock. Simultaneously, the prominent INL of up to  $4.06^{\circ}$  will be converted into recovered clock jitter for continuous phase tracking applications.

For the compensating PI, when the decimal value of the phase code changes from 0 to 16 (the total steps in each quadrant), the relative phase relation of CK1 and CK2 can be respectively represented by the red dashed line and blue dotted line in Fig. 11, where CK2 is eight steps (half of the total quadrant steps) ahead of CK1. Therefore, the final output of the compensating PI can be expressed as,



Fig. 10. Compensating PI. (a) Functional implementation and (b) clock amplitude relationship.



Fig. 11. Transfer characteristics of the compensating PI.

$$CK_{PI}^{T} = \frac{1}{2} \left[ A \sin(2\pi f t + \theta_{1}^{T}) + A \sin(2\pi f t + \theta_{2}^{T}) \right]$$
  
$$= A^{P} \sin\left(2\pi f t + \frac{\theta_{1}^{T} + \theta_{2}^{T}}{2}\right)$$
  
$$= A^{P} \sin(2\pi f t + \theta^{P}),$$
  
(14)

where the clock phase  $\theta^{P}$  can be given by,

$$\theta^{P} = \frac{1}{2} \begin{cases} \arctan\left(\frac{\alpha}{1-\alpha}\right) + \arctan\left(\frac{\alpha+1/2}{1/2-\alpha}\right), & 0 \le \alpha \le 1/2, \\ \arctan\left(\frac{\alpha}{1-\alpha}\right) + \arctan\left(\frac{\alpha-1/2}{3/2-\alpha}\right) + \frac{\pi}{2}, & 1/2 \le \alpha \le 1. \end{cases}$$
(15)

The black solid line (CKO) in Fig. 11 illustrates the phase transfer characteristic based on (15), which indicates that a much more linear phase transfer line can be obtained by averaging the two conventional S-shape phase transfer curves. Fig. 12 gives the theoretical INL/DNL comparison between a traditional PI and the compensating one, where the INL and DNL are respectively reduced from 4.06 to  $1.81^{\circ}$  to under  $0.2^{\circ}$  for the 64-phase interpolation.

In addition, the output amplitude of the traditional PI varies with respect to the control weight according to (13). As discussed in Ref. [21], dynamic buffer can potentially convert an amplitude fluctuation into a delay variation through the amplitude modulation (AM) to phase modulation (PM) conversion, and the delay variation is approximately proportional to the square of the input-signal swing. Therefore, the delay variation of the buffer following the PI is not negligible. Theoretically, the maximum amplitude reduction of a conventional PI can reach 29.3%, occurring at half of the total steps in each quadrant. It is also under the same condition that the DNL runs up to its maximum value, thus any extra delay caused by the AM-PM conversion can increase the maximum DNL directly. Fortunately, the introduced phase-averaging technique can mitigate this delay variation since the amplitude variations of CK1 and CK2 can complement each other, as shown in Fig. 10(b).

Fig. 13 depicts the circuit implementation of the quadrature PI utilized in PI1 and PI2 (see Fig. 10). In comparison to traditional designs, the PMOS symmetric loads are employed as the PI loads and their equivalent resistance can be adjusted by changing the bias voltage VBP. Combining with the VBN adjustment, this design can realize the desired features of power scaling and slew-rate controlling. Meanwhile, the phase rotation is implemented by the current selection among 16 identical cells that are controlled by the thermometer code generated by the decoder using the quadrant selection code Ph<5:4> and the fine phase adjustment code Ph<3:0>. This arrangement eliminates the nonmonotonic issues. Additionally, a fixed current (half of the current cell) is added to prevent the clock paths from completely turning off, which may bring glitches into the output clock. It is worth noting that both the



Fig. 12. Theoretical INL/DNL comparison between (a) traditional PI and (b) compensating PI.



Fig. 13. Circuit details of the quadrature PI.

amplitude limiters and the time-averaging buffer utilize the PMOS symmetric loads to accomplish the power scaling.

Fig. 14 displays the simulated INL/DNL of the conventional PI and the compensating PI. It can be seen that the maximum INL and DNL of the compensating PI are  $1.48^{\circ}$  and  $1.04^{\circ}$  respectively, which are dramatically optimized from  $9.91^{\circ}$  to  $5.87^{\circ}$  corresponding to the traditional PI. Note that the simulation results are worse than those calculated ones in Fig. 12, mainly because of the inherent nonlinearity of the transistors and the non-sinusoidal waveform shapes of the clocks.

# 4. Measurement results

The prototype is fabricated in a 65-nm CMOS process. It contains a common lane, a data lane, and a test lane (see Fig. 15). The CDR along with the proposed techniques is integrated into the data lane, occupying an area of  $0.12 \text{ mm}^2$ . In this work, the Tektronix MSO73304DX scope is used to perform the clock jitter analysis, and the Agilent N4903BJ-BERT is applied to measure the CDR performance using a PRBS7 pattern.

Fig. 16 shows the power consumption in terms of the operation rate. It can be seen that the power consumption is roughly linear with the data rate from 2 to 10 Gb/s. In addition, the energy efficiency of the chip is 4.2



**Fig. 14.** Simulated INL/DNL performance comparison between (a) traditional PI and (b) compensating PI.



Fig. 15. Test chip micrograph.



Fig. 16. Power consumption versus operation rate.

pJ/bit at 10 Gb/s. The I, Q phase errors at different rates are plotted in Fig. 17. The measured results show that all the I, Q mismatches are under  $2^{\circ}$  from 1 to 6 GHz, providing a good I, Q accuracy for the CDR.

The recovered clock eyes utilizing the traditional PI and the compensating PI are compared in Fig. 18. As annotated in the diagrams, the total jitter (TJ) at BER = 1e-12 is reduced from 23.4 to 16.6 ps. Fig. 19 depicts that the optimized PI improves the JTOL of the CDR. Specifically,



Fig. 17. I, Q phase errors at different operation rates.



Fig. 18. Eye-diagram of the recovered clock. (a) Traditional PI and (b) compensating PI.



Fig. 19. Measured JTOL.

the sinusoidal jitter (SJ) amplitude tolerance of the compensating PI demonstrates a more than 10% improvement (from 0.47 to 0.52 UIpp).

Table 1 gives the comparison results between this work and two stateof-the-art methods, indicating that our design outperforms the other two in terms of most measurements, especially the area occupation and JTOL at high frequency.

# 5. Conclusion

In this paper, a power scalable 2–10 Gb/s PI-based CDR with linearity

# Table 1

Performance summary and comparison.

|                             | This work | [1]  | [9]     |
|-----------------------------|-----------|------|---------|
| Data Rate (Gb/s)            | 2–10      | 7    | 4–10.5  |
| Power Efficiency (mW/Gb/s)  | 4.2       | 4.21 | 2.25    |
| Area (mm <sup>2</sup> )     | 0.12      | 0.23 | 1.63    |
| TJ of Recovered Clock (UI)  | 0.17      | N/A  | 0.24    |
| JTOL at High Frequency (UI) | 0.52      | 0.47 | 0.4     |
| Technology (nm)             | 65        | 130  | 65      |
| Supply (V)                  | 1.2       | 1.2  | 1.2/1.0 |

optimization is developed for multilane and multirate serial links. The main features of this design mainly include developing a DLL-based global biasing strategy to obtain a wide operation range, proposing an open-loop VCDL-based clock conditioner to perform the desired functions of DCC, I, Q mismatch adjustment, and slew-rate calibration, and introducing a compensating PI to optimize the phase-interpolation linearity. The prototype is fabricated in a 65-nm CMOS process and occupies an area of 0.12 mm<sup>2</sup>. The measured phase spacing mismatches between I, Q clocks are below  $2^{\circ}$  within a range of 1–6 GHz. The power consumption shows an approximate linearity with the operation rate when operating from 2 to 10 Gb/s and consumes 42 mW at 10 Gb/s. The maximum tolerable high-frequency amplitude of the sinusoidal jitter is 0.52 UIpp and the total jitter of the recovered clock is optimized to 16.6 ps when operating at 10 Gb/s.

### Acknowledgment

This work was supported in part by China National Science Technology Major Project under Grant 2016ZX01012101 and in part by EUH2020 project STEP2DYNA under Grant 691154. The authors would like to thank Tektronix Open Lab in Beijing for the measurement support.

# Appendix A. Supplementary data

Supplementary data related to this article can be found at https://doi.org/10.1016/j.mejo.2018.10.007.

#### References

- N. Kalantari, J.F. Buckwalter, A multichannel serial link receiver with dual-loop clock-and-data recovery and channel equalization, IEEE Trans. Circuits Syst. I Reg. Pap. 60 (11) (Nov. 2013) 2920–2931.
- [2] B. Casper, F. O'Mahony, Clocking analysis, implementation and measurement techniques for high-speed data links – A tutorial, IEEE Trans. Circuits Syst. I Reg. Pap. 56 (1) (Jan. 2009) 17–39.
- [3] J. Kenney, et al., A 6.5 Mb/s to 11.3 Gb/s continuous-rate clock and data recovery, in: Proc. IEEE Custom Integrated Circuits Conf, Sep. 2014, pp. 1–4.
- [4] E. Mammei, et al., A power-scalable 7-Tap FIR equalizer with tunable active delay line for 10-to-25Gb/s multi-mode fiber EDC in 28nm LP-CMOS, in: Proc. IEEE Int. Solid-state Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 142–143.
- [5] K. Kim, et al., A 2.6mW 370MHz-to-2.5GHz open-loop quadrature clock generator, in: Proc. IEEE Int. Solid-state Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 458–459.
- [6] Y.-H. Kwak, et al., A 20 Gb/s clock and data recovery with a ping-pong delay line for unlimited phase shifting in 65 nm CMOS process, IEEE Trans. Circuits Syst. I Reg. Pap. 60 (2) (Feb. 2013) 303–313.
- [7] H. Takauchi, et al., A CMOS multichannel 10-Gb/s transceiver, IEEE J. Solid State Circ. 38 (12) (Dec. 2003) 2094–2100.
- [8] S. Huang, J. Cao, M.M. Green, An 8.2 Gb/s-to-10.3 Gb/s full-rate linear referenceless CDR without frequency detector in 0.18 μm CMOS, IEEE J. Solid State Circ. 50 (9) (Sep. 2015) 2048–2060.
- [9] G. Shu, et al., A 4-to-10.5Gb/s 2.2mW/Gb/s continuous-rate digital CDR with automatic frequency acquisition in 65nm CMOS, in: Proc. IEEE Int. Solid-state Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 150–151.
- [10] R. Kreienkamp, et al., A 10-Gb/s CMOS clock and data recovery circuit with an analog phase interpolator, IEEE J. Solid State Circ. 40 (3) (Mar. 2005) 736–743.
- [11] T. Musah, et al., A 4-32 Gb/s bidirectional link with 3-tap FFE/6-tap DFE and collaborative CDR in 22 nm CMOS, IEEE J. Solid State Circ. 49 (12) (Dec. 2014) 3079–3090.
- [12] G.-Y. Wei, et al., A variable-frequency parallel I/O interface with adaptive powersupply regulation, IEEE J. Solid State Circ. 35 (11) (Nov. 2000) 1600–1610.
- [13] G. Balamurugan, et al., A scalable 5-15 Gbps, 14-75 mW low-power I/O transceiver in 65 nm CMOS, IEEE J. Solid State Circ. 43 (4) (Apr. 2008) 1010–1019.
- [14] G.R. Gangasani, et al., A 16-Gb/s backplane transceiver with 12-tap current integrating DFE and dynamic adaptation of voltage offset and timing drifts in 45-nm SOI CMOS technology, IEEE J. Solid State Circ. 47 (8) (Aug. 2012) 1828–1841.
- [15] P.K. Hanumolu, G.-Y. Wei, U.-K. Moon, A wide-tracking range clock and data recovery circuit, IEEE J. Solid State Circ. 43 (2) (Feb. 2008) 425–439.
- [16] P.K. Hanumolu, et al., A sub-picosecond resolution 0.5-1.5 GHz digital-to-phase converter, IEEE J. Solid State Circ. 43 (2) (Feb. 2008) 414–424.
- [17] G. Wu, et al., A 1-16-Gb/s all-digital clock and data recovery with a wideband, highlinearity phase interpolator, IEEE Trans. VLSI Syst. (2017), https://doi.org/ 10.1109/TVLSI.2015.2508045.
- [18] N. Kurd, et al., Next generation Intel<sup>®</sup> Core<sup>™</sup> micro-architecture (Nehalem) clocking, IEEE J. Solid State Circ. 44 (4) (Apr. 2009) 1121–1129.

#### Microelectronics Journal 82 (2018) 36-45

- [19] M.Y. He, J. Poulton, A CMOS mixed-signal clock and data recovery circuit for OIF CEI-6G+ backplane transceiver, IEEE J. Solid State Circ. 41 (3) (Mar. 2006) 597–606.
- [20] X. Zheng, et al., A 40-Gb/s quarter-rate SerDes transmitter and receiver chipset in 65-nm CMOS, IEEE J. Solid State Circ. (2017), https://doi.org/10.1109/ JSSC.2017.2746672.
- [21] B. Razavi, Design of Integrated Circuits for Optical Communications, second ed., John Wiley & Sons, Inc., Hoboken, 2012.



Fangxu Lv received the B.S. and M.S. degrees from Air Force Engineering University, Xi'an, China, in 2011 and 2014, respectively. He is currently pursuing the Ph.D. degree at Tsinghua University, Beijing, China. His current research interests include high-speed wireline system design.



**Xuqiang Zheng** received the B.S. and M.S. degrees both in physics from Central South University, Hunan, China, in 2006 and 2009, respectively, and the Ph.D. degree in computer science from University of Lincoln, Lincoln, U.K., in 2018.

From 2010 to 2015, he was a Mixed Signal Engineer with the Institute of Microelectronics, Tsinghua University, Beijing, China. Since 2018, he has been with the Institute of Microelectronics of Chinese Academy of Sciences, Beijing, China, where he is currently an Associated Professor. His current research interests include high-performance A/D converters and high-speed wireline communication systems.



Feng Zhao received the B.Eng. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, in 2000, and the M.Phil. and Ph.D. degrees in computer vision from The Chinese University of Hong Kong, Hong Kong, in 2002 and 2006, respectively. From 2006 to 2007, he was a Post-Doctoral Fellow with the Department of Information Engineering, The Chinese University of Hong Kong.

From 2007 to 2010, he was a Research Fellow with the School of Computer Engineering, Nanyang Technological University, Singapore. He was then a Post-Doctoral Research Associate with the Intelligent Systems Research Centre, University of Ulster, Londonderry, U.K. From 2011 to 2015, he was a Workshop Developer and a Post-Doctoral Research Fellow with the Department of Computer Science, Swansea University, Swansea, U.K. From 2015 to 2017, he was a Post-Doctoral Research Fellow with the School of Computer Science, University of Lincoln, Lincoln, U.K. Since 2017, he has been with the Department of Computer Science, Liverpool John Moores University, Liverpool, U.K., where he is currently a Senior Lecturer. His research interests include image processing, biomedical image analysis, computer vision, pattern recognition, machine learning, artificial intelligence, and robotics.

#### F. Lv et al.



Jianye Wang received the B.S. and M.S. degrees from Nanjing University of Science and Technology, Nanjing, China. He received the Ph.D. degree from Air Force Engineering University, Xi'an, China.



Weidong Cao received the B.S. degree from the Northwestern Polytechnical University, Xi'an, China, in 2013, and the M.S. degree from Tsinghua University, Beijing, China in 2016, both in electrical engineering. He is currently pursuing his Ph.D. degree in Washington University in St. Louis, MO, USA.





Shigang Yue (M'05–SM'17) received the B.Eng. degree from Qingdao Technological University, Shandong, China, in 1988, and the M.Sc. and Ph.D. degrees from the Beijing University of Technology (BJUT), Beijing, China, in 1993 and 1996, respectively.

He was with BJUT as a Lecturer from 1996 to 1998 and an Associate Professor from 1998 to 1999. He was an Alexander von Humboldt Research Fellow at the University of Kaiserslautern, Kaiserslautern, Germany, from 2000 to 2001. He is currently a Professor of computer science with the School of Computer Science, University of Lincoln, Lincoln, Li.K. Before joining the University of Lincoln as a Senior Lecturer in 2007 and promoted to Reader in 2010 and Professor in 2012, he held research positions with the University of Cambridge, Cambridge, UK, Newcastle University, Newcastle upon Tyne, UK, and University College London, London, UK, respectively. His current research interests include artificial intelligence, computer vision, robotics, brains and neuroscience, biological visual neural systems, evolution of neuronal subsystems, and their applications, e.g., in collision detection for vehicles, interactive systems, and robotics. Dr. Yue is a member of the International Neural Network Society, International Society of Artificial Life, and International Symposium on Biomedical Engineering. He is the Founding Director of the Computational Intelligence Laboratory, Lincoln. He is the coordinator for several EU FP7 projects.



Ziqiang Wang received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1999 and 2006, respectively.

After the Ph.D. degree, he was a Research Assistant with the Institute of Microelectronics, Tsinghua University, where he has been an Associate Professor since 2015. His current research interests include analog circuit design.



Yajun He received the B.S. degree from the School of Microelectronics and Solid-State Electronics, University of Electronic Science and Technology of China, Chengdu, China, in 2015. She is now working toward the M.S. degree at Tsinghua University, Beijing, China. Her research interests include high-speed wireline transmitter and PLL.



**Chun Zhang** (M'03) received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1995 and 2000, respectively.

Since 2000, he has been with Tsinghua University, where he was with the Department of Electronic Engineering from 2000 to 2004 and he has been an Associate Professor with the Institute of Microelectronics since 2005. His current research interests include mixed signal integrated circuits and systems, embedded microprocessor design, digital signal processing, and radio frequency identification.



Hanjun Jiang (S'01–M'07) received the B.S. degree in electronic engineering from Tsinghua University, Beijing, China, in 2001, and the Ph.D. degree in electrical engineering from Iowa State University, Ames, IA, USA, in 2005.

From 2005 to 2006, he was with Texas Instruments, Dallas, TX, USA. After that, he was with Tsinghua University, where he is currently an Associate Professor. He has authored over 80 peer reviewed journal and conference papers. His current research interests include analog and RF circuits design, and system technologies for wireless medical and healthcare applications. Dr. Jiang has been the IEEE Solid-State Circuits Society Beijing Chapter Chair since 2015. He is currently the Associate Editor of the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS.



Zhihua Wang (SM'04–F'17) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, in 1983, 1985, and 1990, respectively.

four U.S. patents. His current research interests include CMOS radio frequency integrated circuit (RFIC), biomedical applications, radio frequency identification, phase locked loop, low-power wireless transceivers, and smart clinic equipment with combination of leading edge CMOS RFIC and digital imaging processing techniques. Prof. Wang was an Official Member of the China Committee for the Union Radio-Scientifique Internationale from 2000 to 2010. He served as a Technologies Program Committee Member of the IEEE International Solid-State Circuit Conference from 2005 to 2011. He has been a Steering Committee Member of the IEEE Asian Solid-State Circuit Conference since 2005. He has served as the Deputy Chairman of the Beijing Semiconductor Industries Association and the ASIC Society of Chinese Institute of Communication, as well as the Deputy Secretary General of the Integrated Circuit Society in the China Semiconductor Industries Association. He was one of the chief scientists of the China Ministry of Science and Technology serves on the Expert Committee of the National High Technology Research and Development Program of China (863 Program) in the area of information science and technologies from 2007 to 2011. He was the Chairman of the IEEE Solid-State Circuit Society Beijing Chapter from 1999 to 2009. He has served as the Technical Program Chair of the 2013 A-SSCC. He served as the Guest Editor of the IEEE J OURNAL OF S OLID -S TATE C IRCUITS Special Issue in 2006 and 2009. He is an Associate Editor of the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS and the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS.

In 1983, he joined the faculty at Tsinghua University, where he has been a Full Professor since 1997 and the Deputy Director of the Institute of Microelectronics since 2000. From 1992 to 1993, he was a Visiting Scholar with Carnegie Mellon University, Pittsburgh, USA. From 1993 to 1994, he was a Visiting Researcher with KU Leuven, Leuven, Belgium. He is the co-author of ten books and book chapters, over 90 papers in international journnals, and over 300 papers in international conferences. He holds 58 Chinese patents and