# Circuit Techniques for Enhancing the Clock Data Compensation Effect under Resonant Supply Noise

Dong Jiao, Jie Gu, and Chris H. Kim

University of Minnesota, Minneapolis, MN 55455 USA

*Abstract*- Recent publications have shown that clock jitter can improve timing margin through the compensation effect between the clock cycle and the datapath delay under the influence of resonant supply noise. In this paper, novel phase-shifted clock buffer designs are proposed to enhance this "beneficial jitter effect". Compared with existing designs, our design saves 85% of the clock buffer area while achieving a similar 10% increase in maximum operating frequency for typical pipeline circuits. Measurement results are presented from a test chip implemented in a 1.2V, 65nm process.

## I. INTRODUCTION

Power supply noise is considered as one of the major causes for performance degradation. Recently, supply noise at the resonant frequency has drawn a lot of attention as it is recognized as the dominant component of supply noise in high performance designs. Resonant noise, which is caused by the resonance between the package/bonding inductance and the die capacitance, typically resides in the 50-300MHz frequency band [1, 2]. Fig. 1 shows the measured power supply network impedance profile of an Intel Nehalem microprocessor [3] which exhibits a large impedance peak at around 150MHz. A sudden current spike, possibly caused by a clock signal or a processor wakeup, can excite the resonant noise [4]. Due to its large magnitude and long duration, resonant noise leads to the worst-case supply noise scenario and as a result has spurred numerous research activities [5-7].



Recent literatures have revealed an intriguing timing compensation effect between the clock cycle and the datapath delay under the influence of resonant supply noise [8-10]. This phenomenon, which we will refer to as the *beneficial jitter effect*, is illustrated in Fig. 2. The simple pipeline circuit consists of a Phase Locked Loop (PLL), a clock path, and a datapath. In traditional analyses, only the delay variation of the datapath under supply noise is considered, i.e., a constantperiod clock is assumed. Fig. 2(b) gives an example of the waveforms corresponding to the traditional analyses showing that the circuit fails to sample the correct output when the supply voltage falls. In reality, however, the clock cycle also gets modulated by the supply noise. Therefore, it is possible that the clock cycle gets stretched out when the datapath delay increases such that timing violations are partially avoided. Fig. 2(c) shows an example of the waveforms for this scenario, in which the output is always sampled correctly by the stretched clock cycle. As a matter of fact, adaptive clocking schemes have been deployed in microprocessor products to enhance the beneficial jitter effect [3, 11]. There, the clock generator, a PLL, is intentionally designed to be sensitive to the supply noise so that the period of the generated clock varies systematically as the supply voltage fluctuates. It has been experimentally validated that the adaptive clock period helps compensate the increased datapath delay in the presence of supply noise.



Instead of modulating the clock cycle in the clock generator, shifting the phase of the supply noise seen by the clock path [8-10] can also enhance the beneficial jitter effect. For example, the clock path can use an RC filtered supply voltage so that the noise phase is shifted by a desired amount. The beneficial jitter effect can then be dramatically enhanced by carefully selecting the phase-shift value. However, existing designs based on this principle require a large area overhead due to the large capacitance and small resistance requirements.

In this work, we propose novel phase-shifted clock buffer designs that can save 85% of the clock buffer area while achieving similar amount of timing improvement enabling a 10% increase in the maximum operating frequency. The proposed designs can be used in conjunction with adaptive clocking schemes for further improving chip performance.

# II. PROPOSED PHASE-SHIFTED CLOCK BUFFER DESIGN

The intrinsic beneficial jitter effect provides only limited timing margin relief for pipeline circuits. This is because the clock period is stretched out the most when the noise slope is



Fig. 4. (left) Schematic of a conventional buffer, an *RC* filtered buffer, and proposed stacked high  $V_t$  and low  $V_t$  buffers. (right) Layout of clock buffers.

the sharpest (point "A" in Fig. 3) while the worst-case datapath delay occurs when the supply voltage is the smallest (point "B" in Fig. 3). These two time points unfortunately do not coincide with each other, hence providing little timing compensation between the clock cycle and the datapath delay. This observation indicates that a larger compensation effect can be achieved if the phase of the supply noise seen by the clock path is shifted such that points A and B are aligned.

Fig. 4(left) shows the schematic of a conventional buffer and various types of phase-shifted clock buffers including the previous RC filtered buffer [8-10]. The RC filtered buffer contains a PMOS pull-up device and an NMOS capacitor to generate a phase-shifted supply. The major drawback of this design is the large area. The resistance of the RC filter must be very small to minimize the IR drop across the resistor which in turn requires a large enough capacitance to provide the desired phase shift. As seen from Fig. 4(right), the RC filtered buffer consumes 10X larger area than the conventional one when designed for an IR drop less than 50mV.

In this work, we propose a phase-shifted clock buffer using stacked devices to significantly reduce the buffer area while achieving a similar timing improvement. As shown in Fig. 4, a header and a footer device are added to the conventional buffer. Instead of using an RC filter to provide the supply voltage, we use RC filters to control the header and the footer gate voltages. We refer to this design as the stacked buffer. The clock cycle modulation effect can be further enhanced by increasing the clock buffer delay's sensitivity to the filtered supply noise by using high  $V_t$  devices. Thus, the proposed stacked buffer design was evaluated for both low  $V_t$  (LVT) and high  $V_t$  (HVT) header and footer switches. In both the stacked LVT and HVT buffer designs, the switching current no longer flows through the resistor, so large resistance can be safely used for reducing the corresponding capacitor area. As shown in Fig. 4(right), the two proposed buffer designs consume only 10% of the area of the RC filtered buffer.



Fig. 5. High level block diagram of the 65nm test chip.

Considering the fact that the proposed stacked buffer is 50% larger than the conventional non-stacked buffer for the same drive current, the proposed stacked buffer can actually save the buffer area by  $\sim$ 85%.

#### III. TEST CHIP ORGANIZATION

Fig. 5 shows the block diagram of the proposed test chip that contains two VCOs, a clock path block, a core logic block, two 13-bit counters, a noise inject block, a supply noise sensor, and a read-out block. The VCOs used for generating the clock signal and the supply noise each consists of 5 inverting stages as well as a header and a footer device. By controlling the external bias voltage VBIAS, the output frequency can be varied from 0Hz to 3.4GHz. The two VCO frequencies can be monitored by measuring the output signal from a 10 bit frequency divider circuit. 5 clock paths are implemented with different clock buffers: the conventional buffer, the RC filtered buffer, the stacked LVT buffer, the stacked HVT buffer and a "no buffer" design in which the output of the clock VCO is directly connected to the local registers. Each path contains 9 buffer stages and long RC wires giving a 1.0ns delay from the clock VCO to the local registers. One of the 5 clock paths is selected at a time to test each clock buffer design separately. The datapath circuit consists of two standard d-flip-flops and a ten-stage FO4 inverter chain in between to represent a chip critical path with a delay of 0.6ns. The input to the datapath is toggled between 1 and 0 in each cycle. Additional control logic increments the "data counter" only when the sampled output and the corresponding input are identical (during input '1' cycles only). A "reference counter" increments every other cycle, and is used for counting the total number of the sampled outputs. By scanning out the number stored in the data counter when the reference counter overflows, the percentage of the correct samples can be conveniently measured. The noise injection block has 32 NMOS devices clocked by the noise VCO that can trigger the supply noise. By adjusting the noise VCO frequency and turning on different number of noise injection devices, noise current can be injected into the supply network at a specific frequency with different amplitudes. A supply noise sensor is also included in the chip for in-situ measurements. It takes the noisy supply and ground signals as the differential inputs, and its output indicates the on-chip



supply noise frequency and amplitude [7]. The frequency response of the noise sensor can be characterized using external high frequency inputs and a spectrum analyzer.

The read-out block consists of a 10-bit parallel-to-serial shift register and some control logic gates. In COUNT mode, the shift register captures the upper 10 bits of the data counter when the reference counter overflows. In READ mode, an external clock is provided to scan out the stored data serially. Fig. 6 shows the read-out waveforms including a mode selection signal, an external clock, and a read-out scan value. The read-out value we record is the average of 512 scan values to eliminate transient noise effects.

### IV. TEST CHIP MEASUREMENTS

The test chip was fabricated in a 1.2V, 65nm process and its die photo is shown in Fig. 7. In the first test, eight noise injection devices were turned on and the noise VCO bias was adjusted to provide a 118MHz noise frequency. Fig. 8 shows the percentage of correct samples measured from different clock buffer designs. The percentage of correct samples for each design is 100% at low frequencies and it drops quickly as the clock frequency is raised beyond a certain value. This value is denoted as  $F_{max}$ , the maximum clock frequency for the pipeline circuit to operate correctly.  $F_{max}$  of the conventional buffer design decreases from 1.64GHz to 1.2GHz for the given noise condition. The measured  $F_{max}$  of the RC filtered buffer, the stacked LVT buffer and the stacked HVT buffer were 1.33GHz, 1.31GHz and 1.34GHz, respectively, which translate into around a 10% performance improvement compared to that of the conventional clock path.

Fig. 9 shows the measured  $F_{max}$  for the different clock buffer designs when incrementing the number of noise injection devices while maintaining the resonant supply noise at 118MHz. As expected, the performance degrades linearly as more noise injection devices are turned on. The proposed designs improve the  $F_{max}$  by 8%-15% when more than 8 noise





Fig. 10. Measured  $F_{max}$  normalized to the conventional buffer case for different noise frequencies.

injection devices are turned on. This is similar to what the *RC* filtered buffer design achieves under the same condition.

The measured  $F_{max}$  (normalized to the conventional design) for the different designs are shown in Fig. 10 for a range of noise frequencies. The number of noise injection devices is carefully adjusted so that the  $F_{max}$  of the conventional buffer design is fixed around 1.2GHz. The figure clearly shows that the  $F_{max}$  of the phase-shifted clock buffer designs is 8%-27% higher than that of the conventional design in the typical resonant frequency range of 100MHz to 300MHz. For noise frequencies higher than 400MHz or lower than 50MHz, the  $F_{max}$  of the phase-shifted clock buffer designs and the conventional design are similar. This is because the clock cycle modulation effect is very weak in both extreme frequency cases: when the noise frequency is high. the strong averaging effect makes consecutive clock edges see almost the same average supply voltages; on the other hand, when the noise frequency is low, consecutive clock edges again see almost the same supply voltages since it fluctuates



clock cycle modulation schemes.

very slowly. At some high frequencies, the phase-shifted buffer designs exhibit some performance degradation. But it does not hurt the effectiveness of the phase-shifted buffer designs because the resonant impedance is much larger than the impedances in other frequencies, as illustrated in Fig. 1, and the worst-case noise scenario always happens in the resonant frequency band, rather than high frequencies [3]. In summary, the  $F_{max}$  of the phase-shifted designs is improved by 8%-27% for typical resonant noise frequencies, and it approaches that of the conventional design for noise frequencies outside of the resonant band.

## V. COMPARISONS WITH THE ADAPTIVE CLOCKING SCHEME

The effectiveness of the proposed stacked buffer design has been verified from our test chip measurements. Recall that the beneficial jitter effect can also be enhanced by modulating the clock cycle at the clock source (e.g., a PLL or a clock VCO) rather than in the clock buffers. Such adaptive clocking schemes have been recently deployed in microprocessor products [3, 11]. In this section, we show that the proposed buffer design can be used in tandem with existing adaptive clocking schemes to further improve chip performance. Fig. 11 shows the schematic of the test circuit, in which the clock VCO block and the datapath block are identical to those used in Fig. 5. A noisy power supply is applied to all the blocks and the noise amplitude of the supply noise is set to be 10% of the nominal value. In the adaptive clocking design, the clock path is implemented with conventional buffers, and thus the clock modulation effect is mainly from the clock VCO. By replacing the conventional clock buffers with the proposed phase-shifted buffers, both the clock VCO and the clock path modulate the clock cycle with respect to the supply noise.

Timing slack, which is defined as the difference in the arrival times of CLK and DOUT, is simulated and plotted in Fig. 12 for noise frequencies from 10MHz to 1.2GHz. A negative slack means that the CLK signal arrives at the sampling register before the DOUT does, indicating a sampling failure. In other words, the slack must be kept positive for a datapath to function correctly. It is shown that the adaptive clocking scheme alone provides 17-39ps worst-case slack improvement for typical resonant noise frequencies of 100MHz to 300MHz. By combining the two clock cycle modulation schemes, an additional 30-62ps improvement in worst-case slack can be achieved.



Fig. 12. Simulated worst-case slack for different clock cycle modulation schemes.

#### VI. CONCLUSION

We have presented two novel phase-shifted clock buffer designs that enhance the beneficial jitter effect in the presence of resonant supply noise. A 1.2V, 65nm test chip demonstrates an 8%-27% performance improvement in  $F_{max}$  for typical resonant noise frequencies from 100MHz to 300MHz. We have also shown that the proposed buffer designs can be combined with the adaptive clocking scheme to further improve chip performance. Compared with the prior *RC* filtered phase-shifted buffer, our design saves 85% of the clock buffer area while achieving similar performance improvement.

#### REFERENCES

- S. Pant and E. Chiprout, "Power Grid Physics and Implications for CAD," in *Proc. Design Autom. Conf.*, pp. 199-204, July 2007.
- [2] E. Hailu, D. Boerstler, K. Miki, et al., "A Circuit for Reducing Large Transient Current Effects on Processor Power Grids," in Proc. Int. Solid-State Circuits Conf., pp. 2238-2245, February 2006.
- [3] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas, and R. Kumar, "Next Generation Intel<sup>®</sup> Core<sup>™</sup> Micro-Architecture (Nehalem) Clocking," *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1121-1129, April 2009
- [4] M.D. Pant, P. Pant and D.S. Wills, "On-Chip Decoupling Capacitor Optimization Using Architectural Level Prediction," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 10, no. 3, pp. 319-326, June 2002.
- [5] J. Gu, R. Harjani, and C. H. Kim, "Distributed Active Decoupling Capacitors for On-Chip Supply Noise Cancellation in Digital VLSI Circuits," in *Proc. IEEE Symp. VLSI Circuits*, pp. 216-217, June 2006.
- [6] J. Xu, P. Hazucha, M. Huang, et al., "On-Die Supply-Resonance Suppression Using Band-Limited Active Damping," in Proc. Int. Solid-State Circuits Conf., pp.2238-2245, Feburary 2007.
- [7] J. Gu, H. Eom, and C. H. Kim, "A Switched Decoupling Capacitor Circuit for On-Chip Supply Resonance Damping," in *Proc. IEEE Symp. VLSI Circuits*, pp. 126-127, Jun. 2007.
- [8] T. Rahal-Arabi, G. Taylor, J. Barkatullah, et al., "Enhancing Microprocessor Immunity to Power Supply Noise with Clock/Data Compensation," in *Proc. IEEE Symp. VLSI Circuits*, pp. 16-19, June 2005.
- [9] K. L. Wong, T. Rahal-Arabi, M. Ma, et al., "Enhancing Microprocessor Immunity to Power Supply Noise With Clock-Data Compensation," *IEEE J. Solid-State Circuits*, vol. 41, no. 4, pp. 749-758, April 2006.
- [10] D. Jiao, J. Gu, P. Jain, and C.H. Kim, "Enhancing Benefitical Jitter Using Phase-Shifted Clock Distribution," *Int. Symp. on Low Power Electronics and Design (ISLPED)*, August 2008.
- [11] N. Kurd, J. Barkatullah, and P. Madland, "Adaptive frequency clock generation system," US Patent 7,042,259 B2, May 9, 2006.