# Clock Tree Synthesis For Multi-Chip Modules<sup>1</sup>

Daksh Lehther and Sachin S. Sapatnekar Department of Electrical & Computer Engineering Iowa State University, Ames, IA 50011

## Abstract

While designing interconnect for MCM's, one must take into consideration the distributed RLC effects, due to which signals may display nonmonotonic behavior and substantial ringing. This paper considers the problem of designing clock trees for MCM's. A fully distributed RLC model is utilized for AWE-based analysis and synthesis, and appropriate measures are taken to ensure adequate signal damping and for buffer insertion to satisfy constraints on the clock signal slew rate. Experimental results, verified by SPICE simulations, show that this method can be used to build clock trees with near-zero skews.

## 1 Introduction

Multi-Chip Modules (MCM's) provide a medium for integration of several bare dies on a multi-layer substrate. By virtue of a faster interconnect MCM's aim at alleviating, to a large extent, the bottleneck offered by conventional packaging. However the interconnection medium of a typical MCM substrate is characterized by a significant inductance and routing distances of several centimeters. Accurate analysis of MCM interconnect requires the use of lossy transmission line models. In comparison, the onchip interconnect design has traditionally been performed by modeling lines as lumped resistance-capacitance (RC) networks. The need for new methods which take into account the requirements of the MCM scenario while being comparable in efficiency to traditional IC design techniques is of prime importance. This paper addresses the problem of designing the clocking network for an MCM.

The design of the clock distribution network of an MCM is critical from the point of view of achieving desired system speed along with reliable operation. The clock distribution network on a MCM substrate can be considered to be a tree of lossy transmission-lines delivering the clock signal to the various dies placed on the substrate, with buffers inserted into the network to maintain performance constraints. The problem of constructing a clock tree for MCM has three primary considerations:

a) that the lines are either critically damped, or the overshoot is acceptably constrained.

b) that the clock skew is minimized.

c) that constraints on the slew rate are met.

The design of zero-skew clock-trees for IC's using an Elmore delay equalization algorithm was proposed by Tsay in [6]. This algorithm hierarchically merges zero-skew subtrees by selecting a tapping point on the line interconnecting the two trees such that the delay to the leaf nodes of the trees is equal. In this paper, we generalize the zero-skew design methodology in [6] to a higher order approximation of the voltage waveform, so as to facilitate the design of zero-skew clock-trees for an MCM scenario, while meeting the above mentioned objectives. We also use a *completely* distributed model for the interconnect here, rather than a cascade of lumped sections, as has been done previously, and we find that this gives more accurate estimates of delay values.

There has been relatively little work in using higher order models for interconnect optimization. Related research in [7, 8] present methods for designing interconnect exhibiting transmission line behavior.

This paper uses a second-order distributed parameter transmission-line model [5] to construct a zero-skew tree. The computational expense of the approach presented here is very low. The work presented in this paper aims at minimizing both the skew as well as ensuring an appropriate damping for clock trees. It is important to control ringing in the clocking network because this could lead to undesirable cross-talk, and because high overshoots/undershoots could also cause devices to switch incorrectly. However, we consider the possibility of allowing a small overshoot, as this can ensure an improved signal slew rate.

#### 2 Evaluation of Interconnect Response

The concept of asymptotic waveform evaluation (AWE) [3] has been used widely in recent years to simulate and design the interconnect. AWE involves approximating the response of a circuit by a lower order transfer function.

#### 2.1 Approximation of Transfer Function

A reduced order model of the interconnect is obtained here to evaluate the response of the RLC lines that compose the clock-tree by an AWE technique called reciprocal expansion (REX) [5]. The computational efficiency of a distributed-parameter formulation of REX is utilized to model the RLC lines. The response at the output of any leaf-node with respect to the root of a subtree in the following sections, is approximated by a transfer function

$$H(s) = \frac{V_{leaf}(s)}{V_{root}(s)} = \frac{1}{1 + b_1 s + b_2 s^2 + \dots + b_n s^n}$$
(1)

<sup>&</sup>lt;sup>1</sup>This work was supported in part by the National Science Foundation under award MIP-9502556.

REX differs from conventional AWE in respect that, it matches the inverse of the transfer function, i.e.,  $\frac{1}{H(s)}$  to the Maclaurin series. This proves useful in the computation of admittance as described below. It is also advantageous due to the fact that for a transfer function of the type (1) the Padé approximation step reduces to taking the reciprocal of the Maclaurin series.<sup>1</sup>

#### 2.2 Distributed Parameter REX

A second-order approximation to the transfer-function of RLC lines, presented in [5], is used in the design of clocktree in the following sections. The conductance to ground is assumed to be zero. The second-order distributed parameter REX is briefly introduced here for completeness.

Let the per unit resistance, capacitance, inductance parameters be represented as  $\alpha, \beta, \gamma$  respectively. Consider a line of length l, terminated in an admittance at the far end (x = 0). The admittance and the transfer function  $\frac{V_{x=l}(s)}{V_{x=0}(s)}$ , at the near end (x = l) are to be determined.

## 2.2.1 Admittance Computation

Assuming that Y(0) at x = 0, is the following polynomial in s ( $Y_0(x) = 0$ , because conductance to ground is assumed to be zero),

$$Y(0) = Y_1(0)s + Y_2(0)s^2$$
(2)

The second-order approximation of the input admittance  $Y(x) = Y_1(x)s + Y_2(x)s^2$  at x = l can be determined from equations (3) and (4).

$$Y_1(x) = Y_1(0) + \beta x$$
 (3)

$$Y_2(x) = Y_2(0) - \alpha x \left(\frac{(\beta x)^2}{3} + \beta x Y_1(0) + (Y_1(0))^2\right) \quad (4)$$

## 2.2.2 Voltage Computation

Given a second-order approximation to the voltage at x = 0 and  $V_{out}$  is the voltage at the leaf node of interest,

$$V(0) = (1 + b_1(0)s + b_2(0)s^2) V_{out}$$
(5)

We need to find the voltage at x = l

$$V(x) = (1 + b_1(x)s + b_2(x)s^2) V_{out}$$
(6)

The second-order approximation of V(x) can be computed using the Equations (7) and (8).

$$b_1(x) = b_1(0) + \frac{\alpha \beta x^2}{2} + \alpha Y_1(0)x \tag{7}$$

$$b_{2}(x) = b_{2}(0) + x(\alpha(Y_{2}(0) + Y_{1}(0)b_{1}(0)) + \gamma Y_{1}(0)) + \frac{x^{2}}{2}(\alpha\beta b_{1}(0) + \gamma\beta) + \frac{x^{3}}{6}\alpha^{2}\beta Y_{1}(0) + \frac{x^{4}}{24}\alpha^{2}\beta^{2}$$
(8)



3 Zero-Skew Clock Tree Construction

A bottom up recursive procedure as in [6], is used here to construct a zero-skew clock tree. The input consists of a description of the location and loading capacitance of each clock pin. We begin by assigning each clock pin to a separate subtree; initially, each subtree consists of just one node corresponding to the clock pin, and this node is considered to be the root of the subtree. At each step, we merge two subtrees. The merging process requires the determination of a point on the line interconnecting the trees such that the delay to all the leaf nodes is the same; this point is taken to be the root of the merged subtrees, and the process continues recursively. We describe one step of the recursive process here, involving the determination of a zero-skew merging point for two subtrees. The procedure comprises of three steps:

- Admittance computation at the root of a subtree
- Voltage computation at the root of a subtree
- Merging of a pair of subtrees

In general, consider the merging of two zero-skew subtrees. Let  $Y^X(0)$  and  $V^X(0)$  be the polynomials describing the input admittance and voltage at the root of what will be henceforth be referred to as "left" and "right" subtrees.

$$Y^{X}(0) = Y^{X}_{1}(0)s + Y^{X}_{2}(0)s^{2}$$
(9)  
$$V^{X}(0) = 1 + b^{X}_{1}(0)s + b^{X}_{2}(0)s^{2}$$
(10)

where, X = L represents left subtree and X = R the right subtree. This merging is depicted in Figure 1. Consider the two subtrees to be connected by a distributed RLC line of length l (i.e., the Manhattan distance between the root of the subtrees). Consider a point on the line which divides it in a ratio r such that the wire interconnecting the right and left subtrees now consists of two distributed RLC lines, i.e., from x = 0 to  $x = l_1$  and x = 0 to  $x = l_2$ . The length of the line from the merging-point to the root of right subtree is

$$l_1 = r \cdot l \tag{11}$$

and the length of the line to the root of left subtree is

$$l_2 = (1 - r) \cdot l \tag{12}$$

The process of zero-skew merging requires the determination of r such that the skew is zero. Since the admittance and voltage at the far end of the lines are known from equations (9),(10), the voltage at the merging point, i.e.,

$$V^{R}(l_{1}) = 1 + b_{1}^{R}(l_{1})s + b_{2}^{R}(l_{1})s^{2}$$
(13)

$$V^{L}(l_{2}) = 1 + b_{1}^{L}(l_{2})s + b_{2}^{L}(l_{2})s^{2}$$
(14)

<sup>&</sup>lt;sup>1</sup>It should also be pointed out that while Equation 1 is a [0/n]Padé approximant, it is possible to perform moment-matching to obtain a [m/n] Padé approximant using REX.



Figure 2: Damping of zero-skew tree

can be determined using equations (7) and (8).

We will consider the case where the first order moments (of  $\frac{1}{H(s)}$ ) is matched, i.e.,  $b_1^R(l_1) - b_1^L(l_2) = 0$ . Substituting equations (11) and (12) and solving for r,

$$r = \frac{\frac{\alpha\beta}{2}l^2 + \alpha l Y_1^L(0) + (b_1^L(0) - b_1^R(0))}{\alpha\beta l^2 + \alpha l (Y_1^L(0) + Y_1^R(0))}$$
(15)

The admittance of the tree formed as a result of merging the left and right subtrees can be calculated as a sum of their admittances given by equations (3) and (4). The voltage at the root of this tree is given by averaging the coefficients corresponding to the left and right subtrees. Hence, at the end of each stage of merging, there exist a set of subtrees whose second-order admittance and voltage estimates at the root are know.

For typical parameter values (from [2]) it was found that the first-order moments dominate the delay characteristics of the waveform in the final subtree. However it is essential to perform voltage computation up to at least the second order to determine a suitable termination.

## 4 Damping Of Zero-Skew Trees

Due to non-negligible inductance, the routing constructed on the MCM substrate results in under-damped behavior. The zero skew trees are damped appropriately here by series-termination at the source end. The secondorder estimate of voltage computed during the construction of the clock-tree is used to calculate the value of the series-termination for this tree. The series termination could be implemented either by sizing the clock driver for a specific output resistance or by including a discrete resistance in series with the driver.

To determine the termination resistance  $R_b$  consider a zero-skew tree (as shown in Figure 2), whose root node is labeled as u. The admittance and voltage at u are,

$$Y(u) = Y_1(u)s + Y_2(u)s^2$$
(16)

$$V(u) = 1 + b_1(u)s + b_2(u)s^2$$
(17)

Given that the voltage at the driver (node v) is

$$V(v) = 1 + b_1(v)s + b_2(v)s^2$$
(18)

further,

$$V(v) = (1 + Y(u)R_b)V(u)$$
(19)

It can be shown that  $R_b$ ,

$$R_b = \frac{\eta \pm \sqrt{\eta^2 - 4Y_1(u)^2(b_1(u)^2 - 4\zeta^2 b_2(u))}}{2Y_1(u)^2}$$
(20)

where,  $\eta = (4\zeta^2(Y_2(u) + b_1(u)Y_1(u)) - 2b_1(u)Y_1(u))$  and  $\zeta$  is the damping factor. The value of  $R_b$  for critical damping is obtained by setting  $\zeta = 1$  in the Equation (20).

Similarly any under-damped condition can be obtained for  $0 < \zeta < 1$ , and the corresponding maximum overshoot is given as,

$$V_{peak} = 1 + \exp^{\left(-\pi \frac{\zeta}{\sqrt{1-\zeta^2}}\right)}$$
(21)

Alternatively, substituting the value of  $\zeta$  in Equation (20), the clock-tree can be designed for any acceptable overshoot.

$$\zeta = \frac{ln(V_{peak} - 1)}{\pi\sqrt{1 + (ln(V_{peak} - 1)/\pi)^2}}$$
(22)

## 5 Insertion Of Buffer Stages

The phase delay of large clock-trees can be significant and may limit the system clock-speed. In order to reduce the phase delay, buffer stages are introduced as part of the clock-tree construction. Buffers also perform the role of regenerating the signals, by supplying the current necessary to drive the subtrees. In addition to achieving these advantages, we consider sizing of buffers to hierarchically damp the clock-tree. It should be pointed out that in the MCM scenario the buffers can be fabricated only on the dies placed on the substrate. We assume the availability of two buffers per die and include the detour to a buffer as part of our delay computation. A linear model for a buffer is assumed. The computation of buffer resistance is as shown in the previous section. The load seen by the buffers is balanced to minimize skew introduced due to the nonlinearity of the buffers. The concept is similar to buffer relocation [1] or delay equalization [4] and hence is not elaborated here.

#### 6 Summary of the Algorithm

The following pseudo-code summarizes the procedure for clock-tree synthesis.

```
MCM_Clock-Tree() {
/* Initialization */
/* Each pin is a subtree */
/* Voltage is a unit source */
/* Admittance is the capacitance at each pin */
Level_number = 0;
Number_of_subtrees = Number_of_pins;
    While (Number_of_subtrees >= 1){
        Zero-Skew Merge();
        Compute_Admittance();
        Compute_Voltage();
          If (Delay > Delay_constraint){
              Equalize_Load();
              Insert_Buffers();
          }
   Level_number ++;
    Number_of_subtrees = Number_of_subtrees/2;
    }
Series_Termination();
}
```

The routine Zero-Skew\_Merge finds a merging point which minimizes the skew for two given subtrees. The routines Compute\_Admittance, Compute\_Voltage calculate the input-admittance and the voltage at the root of the new subtree. Buffers are inserted at the root of subtrees at appropriate levels by the Insert\_Buffers routine based on the delay requirements. The Equalize\_Load routine balances the load seen by the buffers. The recursive application of these routines on a set of pins results in the desired zeroskew clock-tree. The termination at the root of the tree is computed by the *Series\_Termination* routine.

#### 7 Experimental Results

The procedure described above was tested on a set of examples which portray a typical MCM routing scenario. A substrate with area 10 cm x 10 cm was assumed. The distribution of the clock pins and the loading capacitances at each pin was generated randomly. The number of clock pins varies from 8 to 128 in the examples MCM-1 through MCM-5. The clock-trees were designed assuming MCM-L process parameters obtained from [2]  $(\alpha = 0.24\Omega/cm, \beta = 7.2nH/cm, \gamma = 0.76pF/cm)$ . A constant width of 10  $\mu$ m and a Manhattan geometry is assumed for the routing. The clock-trees were simulated with SPICE to verify accuracy.

#### 7.1 Clock Tree Construction

The characteristic parameters of the clock-trees constructed using our approach are presented in Table 1, corresponding to termination condition  $\zeta = 1.0$ . These clock-trees have a driver at the root and no additional buffers inserted. The Figure 3, 5, 6 show waveforms corresponding to the undamped (large, low resistance driver), critically damped, and overshoot controlled case  $(\zeta = \frac{1}{\sqrt{2}})$  for MCM-5. The clock-skews were measured at the 50% delay point. The maximum observed skew is less than 1% of a clock-period of 10nS (100 MHz).

#### 7.2 Damping Of Clock Trees

The damping of the clock-tree was performed with series termination at the source end. The effect of a range of damping conditions ( $\zeta = 0.5$  to 1.0) was studied. Figure 4 shows the step response at a particular sink node of MCM-2 for the range of  $\zeta$ . Table 2 summarizes the variation of rise-time ( $t_r$ , 10%  $V_{dd}$  to 90%  $V_{dd}$ ), delay ( $t_d$ , 50%  $V_{dd}$ ), peak overshoot ( $V_{peak}$ ), driver resistance  $R_b$ ) with the damping factor ( $\zeta$ ).



Figure 3: Undamped

Figure 4:  $\zeta = 0.5$  to 1.0

Table 1: Termination with critical damping

| Example |       | $\zeta = 1.0$   |           |           |        |  |
|---------|-------|-----------------|-----------|-----------|--------|--|
| Name    | Sinks | $R_{h}(\Omega)$ | $t_d(nS)$ | $t_r(nS)$ | % Skew |  |
| MC M-1  | 8     | 4.326           | 0.9156    | 1.028     | 0.0128 |  |
| MC M-2  | 16    | 4.042           | 1.402     | 1.832     | 0.0627 |  |
| MC M-3  | 32    | 3.283           | 1.296     | 1.674     | 0.0052 |  |
| MC M-4  | 64    | 1.647           | 1.902     | 2.842     | 0.3218 |  |
| MC M-5  | 128   | 1.390           | 2.419     | 3.874     | 0.1624 |  |

| Table 2: Termination varying damping factor |           |            |               |                           |  |  |  |
|---------------------------------------------|-----------|------------|---------------|---------------------------|--|--|--|
| ς                                           | $t_d(nS)$ | $t_r (nS)$ | $R_b(\Omega)$ | V <sub>peak</sub> (Volts) |  |  |  |
| 0.5                                         | 1.193     | 0.950      | 2.001         | 1.153                     |  |  |  |
| 0.6                                         | 1.233     | 1.033      | 2.408         | 1.090                     |  |  |  |
| 0.7                                         | 1.273     | 1.202      | 2.816         | 1.042                     |  |  |  |
| 0.8                                         | 1.311     | 1.401      | 3.225         | 1.014                     |  |  |  |
| 0.9                                         | 1.358     | 1.612      | 3.633         | 1.009                     |  |  |  |
| 1.0                                         | 1.402     | 1.832      | 4.042         | 1.000                     |  |  |  |

## 7.3 Buffer Insertion

The effect of improvement in delay and slew rate was studied by inserting buffers as described in Section 6. The sharpest slew rate is observed for the level closest to the leaf nodes and as the level of insertion is moved toward the root the slew rate and delay become worse. A choice of one or more buffer levels for a tree could be made depending on the timing constraints and buffer availability.



Figure 5: Critical damped Figure 6: Over-shoot control 8 Conclusion

A distributed-parameter AWE-based technique for MCM clock tree construction has been developed. A second-order approximation is found to be adequate for practical circuits, is used to recursively build a clock tree and to select series terminations with either critical damping or controlled overshoot. The algorithms are computationally very efficient and experimental results exhibit low values of skew and appropriate waveform damping.

#### References

- J. Chung and C-K. Cheng, "Skew Sensitivity Minimization of Buffered Clock Tree," Proc. ICCAD, pp. 280-283, 1994.
- [2] R. C. Frye, "Physical Scaling and Interconnection Delays in Multichip Modules," *IEEE Transactions on Components*, *Packaging, and Manufacturing Technology*, vol. 17, pp. 30-37, 1994.
- [3] L. T. Pillage and R. A. Rohrer, "Asymptotic Waveform Evaluation for Timing Analysis," *IEEE Trans. CAD*, vol. 9, pp. 352-365, 1990.
- [4] S. Pullela, N. Menezes, J. Omar, and L. T. Pillage, "Skew and Delay Optimization for Reliable Buffered Clock Trees," *Proc. ICCAD*, pp. 556-562, 1993.
- [5] M. Sriram and S. M. Kang, *Physical Design Of Multi-Chip Modules*, Kluwer Academic Publishers, Boston, MA, 1994.
- [6] R.-S. Tsay, "An Exact Zero Skew Clock Routing Algorithm," *IEEE Transactions on Computer-Aided Design of Integrated Circuit and Systems*, vol. 12, pp. 242-249, 1993.
- [7] J. S. H. Wang and W. W-M. Dai, "Optimal Design of Self Damped Transmission Lines for Multi-chip Modules," Proc. ICCAD, pp. 594-598, 1994.
- [8] Q. Zhu, W. W-M. Dai, and J. G. Xi, "Optimal Sizing of High-Speed Clock Networks Based on Distributed RC and Lossy Transmission Line Models," *Proc. ICCAD*, pp. 628-633, 1993.