Abstract—As the clock frequency and physical address space of 64b microprocessors continue to grow, one major critical path is the access to the on-die cache memory that includes a tag comparator, a tag SRAM and a data SRAM. To improve the delay of the tag comparator, a Diode Partitioned (DP) domino circuit is proposed. DP domino reduces the parasitic capacitance and enables a smaller keeper in high fan-in gates. The diode circuit is also improved by an enhanced diode that boosts up the gate voltage of the NMOS diode. Delay of a 40b tag comparator using the proposed scheme is 33% faster than an optimized complex domino circuit in 1.8V, 180nm CMOS technology.

Index Terms—High-speed domino circuit, keeper design, high-speed cache memory, tag comparator

I. INTRODUCTION

Demands for high performance computing have boosted the clock frequency over 1 GHz and physical address space has reached up to 50b for 64b microprocessors. Access to the on-die cache memory consisting of a tag comparator, a tag SRAM and a data SRAM is one of the major critical paths. Since a tag comparator provides the hit/miss information to the cache controller, it cannot be executed in parallel with accessing a tag SRAM. A 64b microprocessor requires a 40b tag comparator due to the 50b physical address, which has been increasing every generation. Domino circuit style is widely used in conventional tag comparator designs. Innovative keeper and multiple-stage designs have been proposed to improve performance of such high fan-in domino circuits [1–4].

In this paper, we propose a Diode Partitioned (DP) domino for fast tag comparators. After discussing basic operations of the circuit, implementation of a 40b tag comparator using the proposed DP domino in a 1.8V, 180nm, 4-metal CMOS technology is presented. Simulation results on delay, power and noise robustness are also compared to those of conventional domino circuits. Scaling implications of the proposed technique is explored using predictive 130nm, 100nm and 70nm technologies [5].

II. CONVENTIONAL TAG COMPARATOR DESIGN

Fig. 1 shows a 40b tag comparator that is composed of a 2-input XOR and a 40b OR gate. Inputs $A[39:0]$ are from the tag field of the address register and $D[39:0]$ are from the tag SRAM. Since all the output signals from the SRAM are pre-charged signals, tag comparator is suitable for a footless domino design. Fig. 2 shows a 4b tag comparator using a conventional footless domino circuit. Because each 2-input exclusive OR consists of 2 legs, the 4b comparator is composed of 8 legs. The large number of legs causes the parasitic capacitance on the dynamic domino node $E[0]$ to increase significantly. In the worst case input pattern, only one out of eight NMOS paths discharges the domino node $E[0]$. Capacitance on $E[0]$ is mainly due the drain capacitance of the parallel NMOS’s. In general, domino circuits are suitable for wide OR implementation. However, if the fan-in is very high, such as 80b parallel inputs for a 40b tag comparator, multiple stage design and strong keeper for a target noise robustness is needed to prevent the increased parasitic capacitance on $E[0]$ and a DC noise from the high fan-in, wide and parallel NMOS network.

Manuscript received August 7, 2006. This work was funded in part by Semiconductor Research Corporation under contract 1078.001.

H. Suzuki was from Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907 USA. He is now with Renesas Technology Corporation, Itami, Hyogo 664-0005 Japan (phone: +81-72-787-2338; fax:+81-72-789-3011; e-mail: suzuki.hiroaki@renesas.com).

C. H. Kim was from Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907 USA. He is now with Electrical and Computer Engineering Department, University of Minnesota, Minneapolis, MN 55455-0154 USA (e-mail: chriskim@umn.edu).

Kaushik Roy is with Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail: kaushik@ecn.purdue.edu).
III. DIODE PARTITIONED (DP) DOMINO CIRCUIT

In the proposed DP domino, the enhanced diode divides and reduces the parasitic capacitance on the domino node as shown in Fig. 3. The diodes separates $E[0]$ node into $E'[0]$ and two partitions of $E'[1]$ and $E'[2]$. With the worst case input, only $A[0] - D[0]$ path turns on and $E'[1]$ partition becomes active. The other NMOS paths are turned off. $D[0]$ path discharges $E'[1]$ and $E'[0]$ but does not affect $E'[2]$ due to the reverse connection of the diode. The parasitic capacitance of $E[0]$ is divided into $n$ ways via $n$ diodes. The DP domino does not only divide the parasitic capacitance but also divides the keeper transistors. The $1/n$-sized keepers are distributed on each partition. That is, total size of $K1$ keepers is equal to that of $K0$. As for a NMOS driver in a partition, the contention current of the DP domino becomes $1/n$ times smaller than that of the conventional domino. This $1/n$-sized keeper can meet the same input-noise robustness because the fan-in of parallel NMOS at each partition is also $1/n$ of the original domino circuit. DP domino requires additional keeper $Kw$. However, its size can be very small because $K1$ prevents the major noise current from the NMOS networks.

In CMOS design, diode is usually implemented with an NMOS transistor. However, the small forward-bias current per transistor width $W$ cannot be suitable for the proposed circuit because a large $W$ would cause extra parasitic capacitance on $E'[0]$. To improve the forward-bias diode current, we propose an enhanced diode circuit that boosts up the gate voltage of the diode NMOS at forward-bias mode. Fig. 4 shows the schematic

![Fig. 2. Conventional footless domino circuit.](image)

![Fig. 3. Proposed Diode Partitioned (DP) domino circuit.](image)

![Fig. 4. Symbol and circuit schematics of (a) conventional diode and (b) enhanced diode.](image)
of the conventional diode and proposed enhanced diode circuit. The gate node of the conventional NMOS diode is connected to the drain as shown in Fig. 4(a). To increase the forward bias current, an extra NMOS diode is inserted between the gate (Vp) and drain (Va) node in the enhanced diode circuit as shown in Fig. 4(b). The additional diode is also implemented using an NMOS with gate and drain connected. Node Vp must be pre-charged using the clock signal of the domino. Since node Vp is also pre-charged and dynamic node, it should be protected by a keeper or an additional capacitance. Since adding capacitance won’t change DC characteristics of the DP domino, we recommend to use a capacitance. Fig. 5(a) shows the enhanced diode implemented in the proposed DP domino circuit. The low to high transition of the CLK signal causes Vc to discharge through the NMOS footer. Nodes Va and Vc follow Vc as the diodes turn on. Because of the precharged NMOS gate node (Vp), node Va in the enhanced diode discharges quickly while the discharge of node Vc in the conventional diode slows down as Vc approaches Vthn which is approximately 0.4V. During the Va transition between 0.22 and 0.35ns, Va-Vc becomes 1.17V in maximum while it is 0V (shorted) in the conventional NMOS diode. This overdriving gate voltage increases the drain current and makes the sharp and fast transition of the proposed enhanced diode. In addition, it enables to support the pull-down for full swing on node Va, while node Va' on the conventional diode remains above 0.4V.

![Fig. 5. Enhanced diode in the DP domino circuit, (a) individual extra diode and (b) shared extra diode.](image)

![Fig. 6. Transient characteristic of the pre-charged enhanced diode.](image)

![Fig. 7. I-V curve characteristics of the enhanced and conventional diodes. Current of enhanced diode is 2.28 times higher than NMOS diode at 0.9V.](image)
IV. SIMULATION RESULTS AND IMPLEMENTATION

The DP domino and conventional domino circuits are simulated in 1.8V, 180nm CMOS technology with a FO4 output load. Keeper ratio \(=(W/L)_{\text{KEEPER}}/(W/L)_{\text{PULLDOWN}}\) of 5\% is used to meet a target DC noise robustness of VDD/4 [2]. Fig. 9 shows waveforms of each signal in Fig. 2 and Fig. 3 for 10b tag comparators. The trip point of the output inverter is skewed at 1.11V (=0.62*VDD) for fast sensing of E[0]. Due to the divided capacitance and keeper, E'[2] in the active partition sharply goes down. On the other hand, E'[1] in the inactive partition does not change. This isolation reduces delay and power by 22\% and 40\%, respectively. Here, power is evaluated for the same worst-case input pattern.

Fig. 10 compares delay of DP and conventional domino circuits for different fan-in sizes. For example, 12b tag comparator, which is composed of 24 legs, can be divided into 2 ways of 12-NMOS groups, 4 ways of 6-NMOS groups and so on. Here, the solid line of DP domino traces the smallest delays among n-way partitioned design on each fan-in. DP domino can operate at very large fan-in such as 48b while the conventional domino starts to fail at 20b. We simulated DP domino up to 120b and the circuit still operates at 387ps. The proposed DP domino can have several configurations in term of the number of partitions. For example, the 40b DP domino exclusive OR gate consisting of 80 legs can designed to be 10-way with 8 legs per partition or 20-way with 4 legs per partition. For the 40b tag comparator, the 20-way is the fastest design as shown in Fig. 10. To improve delay of DP domino, having many ways makes the distributed keeper small. That is, contention current of the keeper is reduced. On the other hand, having too many ways increases the parasitic capacitance on E'[0]. Although the 4 or 8 legs per partition were the fastest circuit configuration in our design, the best number of ways depends on the balance between the parasitic capacitance and keeper's contention current. That is, the optimal number of ways would be change depending on the process technology and target noise tolerance. For example, under the lower criteria of input DC noise tolerance, the required contention current of the keeper becomes smaller. In other words, one partition can contain more legs with the same size of a local keeper. Hence, having smaller number of ways can reduce the parasitic capacitance more effectively. In an SOI process with smaller source/drain capacitance, increased number of ways can effectively reduce contention current from the keepers while keeping the junction capacitance minimal. In addition, the impact on layout area must be considered when deciding the optimal number of partitions. Delay of a 16b comparator with 16-way partition for example, is very close to

![Fig. 8. AC simulation of the conventional and enhanced diode, (a) test circuit and (b) waveforms.](image)

![Fig. 9. Waveforms of 10b tag comparator.](image)

![Fig. 10. Delay Comparisons.](image)
that of an 8-way partition. Hence, the 8-way design would be preferable for compact area.

A 40b tag comparator with a 256-entry tag memory array using the proposed DP domino technique is implemented in a 1.8V, 180nm, 4-metal CMOS technology. Fig. 11 shows the tag memory layout with area of 363.0 µm x 696.0 µm. The tag comparator, “COMP”, is placed between the “I/F” circuits composed of data latches and selectors connecting to an instruction-fetch or a data-fetch unit. The tag comparator part is enlarged in Fig. 12. In order to minimize the parasitic capacitance at the dynamic node \( E'[0] \) inside the DP domino, we laid out all transistors of the NMOS network into a compact area of 35.5 µm x 97.0 µm. The metal lines that connect between the tag comparator and sense amps consume an additional 4.6% area due to the dual-rail-input circuit. This extra area can be significantly reduced when implemented in an advanced technology with more number of metal layers. The delay and power consumption of the proposed 40b tag comparator are 273 ps and 3.51 µW/MHz, respectively. Fig. 13 shows the layout of one partition cell. Four legs of NMOS network, or 2 sets of 2-input exclusive OR, is laid out on the left half. Inverters for the address signals are located on the top and bottom on the right. The enhanced diode and local keeper is laid out at the center of the right half. Although the area penalty due to the enhanced diode and local keeper circuit is 30% of each partition cell, the impact on total tag memory area is less than 2% as shown in Fig. 11.

We also compared the delay and the power consumption to the conventional circuits. Conventionally, a 40b tag comparator is designed using (i) 2-stage domino with 4b tag comparator and 10b OR, or (ii) complex domino with 10b tag comparator and 4-input NAND output driver because conventional domino circuit technologies make a multiple-stage structure faster than
one large fan-in gate [3-4]. Table 1 compares the 40b DP-domino comparator with the two conventional designs in terms of delay, power and DC noise tolerance. For fair comparison, we designed all three circuits to meet a target DC noise tolerance of $\frac{V_{DD}}{4}$. The DC noise tolerance is defined as the minimum input DC noise voltage that causes the output voltage to flip [2]. The DC noise tolerance was simulated using SPICE by giving a slow 1µs input ramp in the input voltage and measuring the change in output voltage. This slow transient simulation determined the keeper ratio such that each circuit in Table 1 meets a target DC noise tolerance of $\frac{V_{DD}}{4}$. As described in Chapter 3, the $\frac{1}{n}$-sized small keeper distributed in $n$ partitions meets this noise robustness for the proposed DP domino circuit. Among the conventional designs, complex domino turned out to be the fastest under the iso-robustness condition. The proposed DP domino is 33% faster than the optimized complex domino design. The power consumption was simulated using the worst case delay input vector. Power of DP domino is 23% smaller than that of complex domino and 5% larger than that of an optimized 2-stage domino. In general, the power consumption of the wide-input gate is not significant in a chip. Although it becomes critical path of the speed performance, it won't be used so frequently that the power consumption of a chip would increase or decrease significantly.

The benefits of the proposed technique will be significant enough to overcome the lower supply-voltage and the larger leakage current due to lowering the Vth and increasing the local variation in the future technologies. Fig. 14 shows delay time dependencies of domino circuits on supply-voltage (VDD). The DP domino has speed advantage in the practical range between 0.9 and 1.8 V. The week contention current by the divided keeper helps to improve the VDD-scalability while the DP domino has the stacked NMOS structure. Fig. 15 shows delay advantage of the DP domino over the complex domino estimated for future technologies [5]. Under the assumption of constant keeper ratio, the delay advantage decreases from 33% at 180nm to 23% at 70nm. However, future technology requires larger keeper ratio due to the aggravating pull-down leakage by the lower threshold NMOS transistors and the increasing local variation. Under the assumption of 1% increasing keeper ratio per generation, DP domino offers 58% improvement in speed in a 70nm technology. Under the assumption of 2% increasing keeper ratio, the DP domino gives us a greater speed advantage as shown in Fig. 15. The 10b tag comparator part of the complex domino at 70nm cannot operate due to the strong keeper competing against the NMOS evaluation current, while the small contention current from the small-sized distributed keeper of the DP domino offers the same DC noise tolerance with improved speed. Hence, the proposed technique can be a viable solution to resolve the large contention current problem due to increasing keepers in the future technologies [1-2].

### Table 1

<table>
<thead>
<tr>
<th>Circuit Structure</th>
<th>Delay (ps)</th>
<th>Power (Normalized µW/MHz)</th>
<th>Noise Robustness</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proposed DP domino TC</td>
<td>273</td>
<td>0.67</td>
<td>3.51</td>
</tr>
<tr>
<td>Complex domino TC w/ 4NAND</td>
<td>405</td>
<td>1.00</td>
<td>4.54</td>
</tr>
<tr>
<td>2-stage domino w/ 4b TC and 10b OR</td>
<td>414</td>
<td>1.02</td>
<td>3.34</td>
</tr>
</tbody>
</table>

Delay, power @1.8V, 25C, nominal
Noise Robustness @1.8V, 110C, fast, normalized to VDD

![Fig. 14. Delay time dependencies of domino circuits on supply-voltage.](image-url)

![Fig. 15. Delay advantage in future technologies. KP is assumed to increase 1% in every generation for “Increasing KP”](image-url)
V. CONCLUSIONS

A fast 40b tag comparator for 64b microprocessors using DP domino has been proposed. DP domino reduces the parasitic capacitance and keeper size of a high fan-in gate. DP domino also offers a 33% delay improvement over a conventional complex domino circuit for fast tag comparison in 1.8V, 180nm CMOS technology. Scaling implications of the proposed technique is also simulated in predictive 130nm, 100nm and 70nm technologies. Even for the future technologies, the proposed DP domino circuit has significant advantages over the conventional domino techniques.

REFERENCES


Hiroaki Suzuki received the B.S. and M.S. degrees in electrical engineering from Osaka Institute of Technology, Osaka, in 1989 and in 1991, respectively. He received MSEE degree from Purdue University, West-Lafayette, Indiana USA in 2003. In 1991 he joined the LSI Laboratory, Mitsubishi Electric Corporation, Hyogo, Japan. From 1991 to 1997, he worked on the research and development of high-speed CMOS/BCMOS logic LSIs and high-speed cores of floating-point arithmetic units. In 1997, he works for the research project of low power technology and SOI circuits. In 1998 he transferred to the System LSI Development Center of Mitsubishi Electric Corporation. From 1998 to 2001 he developed application specific processors and general purpose micro-controllers. In 2002 and 2003, he studied at graduate school of Purdue University, West-Lafayette, Indiana, USA by a scholarship program by Mitsubishi Elec. Corp.. In 2004 he transfered to Renesas Technology Corp.. Since then, he has been engaged in research and development of a high-speed and low-power DSP core.

Chris H. Kim (S’98) received the B.S. degree in electrical engineering and the M.S. degree in biomedical engineering from Seoul National University, Seoul, Korea, in 1998 and 2000, respectively. He has received the Ph.D. degree in electrical and computer engineering from Purdue University, West Lafayette, Indiana, USA. He has spent a year at Intel Corporation where he performed research on variation-tolerant circuits, on-die leakage sensor design and crosstalk noise analysis. He joined the electrical and computer engineering faculty at University of Minnesota, Minneapolis, MN, in 2004.

Mr. Kim is the recipient of the 2006 IBM Faculty Partnership Award, 2005 IEEE Circuits and Systems Society Outstanding Young Author Award, 2005 ISLPED Low Power Design Contest Award, 2003 Intel Ph.D. Fellowship Award, 2001 Magoon’s Award for Excellence in Teaching, and the best paper award in 1999 IEEE-EMBS APBME. He is a co-author of 30+ journal and conference papers and serves as a technical program committee member for ISLPED, ASSCC, ICCAD, ISQED, and ICICDT. His current research interests include theoretical and experimental aspects of VLSI circuit design in nanoscale technologies.

Kaushik Roy received B.Tech. degree in electronics and electrical communications engineering from the Indian Institute of Technology, Kharagpur, India, and Ph.D. degree from the Electrical and Computer Engineering department of the University of Illinois at Urbana-Champaign in 1990. He was with the Semiconductor Process and Design Center of Texas Instruments, Dallas, where he worked on FPGA architecture development and low-power circuit design. He joined the Electrical and Computer Engineering faculty at Purdue University, West Lafayette, IN, in 1993, where he is currently a Professor and holds the Roscoe H. George Professor of Electrical and Computer Engineering. His research interests include VLSI design/CAD for nano-scale Silicon and non-Silicon technologies, low-power electronics for portable computing and wireless communications, VLSI testing and verification, and reconfigurable computing. Dr. Roy has published more than 300 papers in refereed journals and conferences, holds 8 patents, and is a co-author of two books on Low Power CMOS VLSI Design (John Wiley & McGraw Hill).

Dr. Roy received the National Science Foundation Career Development Award in 1995, IBM faculty partnership award, ATT/Lucent Foundation award, best paper awards at 1997 International Test Conference, IEEE 2000 International Symposium on Quality of IC Design, 2003 IEEE Latin American Test Workshop, 2003 IEEE Nano, and 2004 IEEE International Conference on Computer Design. Dr. Roy is currently a Purdue University Faculty Scholar. He is the Chief Technical Advisor of Zenasis Inc. and Research Visionary Board Member of Motorola Labs (2002). He has been in the editorial board of IEEE Design and Test, IEEE Transactions on Circuits and Systems, and IEEE Transactions on VLSI Systems. He was Guest Editor for Special Issue on Low-Power VLSI in the IEEE Design and Test (1994) and IEEE Transactions on VLSI Systems (June 2000), IEE Proceedings -- Computers and Digital Techniques (July 2002). Dr. Roy is a fellow of IEEE.