# Moment-based techniques for RLC clock tree construction

Daksh Lehther and Sachin S. Sapatnekar Department of Electrical & Computer Engineering Iowa State University

## 1 Introduction

Multi-Chip Modules (MCM's) provide a medium for integration of several bare dies on a multi-layer substrate. MCM technology has gained popularity in recent years with the promise of orders of magnitude reduction in inter-chip delay and power dissipation over single chip packaging [2]. By virtue of a faster interconnect MCM's aim at alleviating, to a large extent, the bottleneck offered by conventional packaging. However the interconnection medium of a typical MCM substrate is characterized by a significant inductance and routing distances of several centimeters. Accurate analysis of MCM interconnect design has been traditionally been performed by modeling lines as lumped resistance-capacitance (RC) networks. The need for new methods which take into account the requirements of the MCM scenario while being comparable in efficiency to traditional IC design techniques is of prime importance. This paper addresses the problem of designing the clocking network for an MCM.

The design of the clock distribution network of an MCM is critical from the point of view of achieving desired system speed along with reliable operation. The clock distribution network on a MCM substrate can be considered to be a tree of lossy transmission-lines delivering the clock signal to the various dies placed on the substrate, with buffers inserted into the network to maintain performance constraints. The interconnect typically shows considerable inductive effects, and it is not enough to model its behavior using Elmore time constants, as has been done for IC's (for example, in [17]). Therefore, the problem of constructing a clock tree for MCM has three primary considerations:

- a) that the lines are either critically damped, or the overshoot is acceptably constrained.
- b) that the clock skew is minimized.
- c) that constraints on the slew rate of the clocking waveform are met at each clock pin.

The design of zero-skew clock-trees for IC's using an Elmore delay equalization algorithm was proposed by Tsay in [17]. This algorithm hierarchically merges zero-skew subtrees by selecting a tapping point on the line interconnecting the two trees such that the delay to the leaf nodes of the trees is equal. Subsequent refinements such as [3, 5, 6, 8, 9, 13], have attempted to improve the results by adding criteria for selecting the zero-skew subtrees to be merged in each step, using wire width optimization, finding planar routings, etc. However, all of these have restricted themselves to the IC environment, and have not considered inductive effects. In this paper, we generalize the zero-skew design methodology in [17] to a higher order approximation of the voltage waveform, so as to facilitate the design of zero-skew clock-trees for an MCM scenario, while meeting the above mentioned objectives for clocktree design for MCM. We also use a *completely* distributed model for the interconnect here, rather than a cascade of lumped sections, as has been done previously, and we find that this gives more accurate estimates of delay values. Further refinements to this work, such as those in references [3, 5, 6, 8, 9, 13], are possible under the same framework but are not addressed here.

Some related research in [18] presents a method for the design of interconnect exhibiting transmission line behavior using the S-parameter model. The method involves the selection of appropriate widths for various branches of the given network such that the delay to the various sinks is minimized and the waveforms exhibit a specified damped behavior. This wiresizing strategy involves minimization of the maximum delay and the damping ratio error of all the sinks nodes as a nonlinear programming problem. The approach presented in [18] was aimed at the design of general signal networks and did not encompass the objective of clock-tree design, where the minimization of clock-skew requires that the delays to all sinks be exactly equal. The design of a clock-tree under lossy-transmission model was considered in [20]. This involves a strategy for minimizing the clock-skew by wire-sizing using a modified Gauss-Marquardt's least-square estimation technique. General RLC and transmission-lines were modeled as a part of this method, but it did not address the problem of ensuring a specific damping condition for the RLC lines of the clock-tree.

This paper uses a second-order distributed parameter transmission-line model [15] to construct a zero-skew tree. The computational expense of the approach presented here is very low. In comparison to the methods mentioned in the previous paragraph, the work presented in this paper aims at minimizing both the skew as well as ensuring an appropriate damping for clock trees. It is important to control ringing in the clocking network because this could lead to undesirable cross-talk, and because high overshoots/undershoots could also cause devices to switch incorrectly. However, we consider the possibility of allowing a small overshoot of controlled magnitude, as this can ensure an improved signal slew rate [1, 10, 19]. The specific contributions of this work are the use of a higher-order moment matching to reduce the skew, the design of the clock-tree with controlled damping criterion to ensure signal integrity while improving the timing performance, and a strategy for the use of buffered clock-trees in the MCM scenario. Our method has the advantage of low computation while being able to meet the design objectives. The accuracy of the models used has been validated by SPICE simulations of the clocking networks obtained from the algorithm.

A brief introduction to the second-order distributed model and its use to design zero-skew trees is presented in Section 2 and Section 3 respectively. Section 4 presents the concept of extending the first order delay moment matching to higher order moments. A method for damping clock trees is presented in Section 5, and involves the selection of a suitable series-termination. The procedure of selecting a suitable termination resistance to control the damping of the waveforms at the sinks of the clock-tree is also presented here. Next, the construction of buffered clock-trees is considered with the goal of meeting delay requirements and satisfying the damping criterion in Section 6. Section 7 provides an overview of the method and the results of the procedure on typical MCM examples are presented in Section 8.

## 2 Evaluation of Interconnect Response

The concept of asymptotic waveform evaluation (AWE) [12] has been used widely in recent years to simulate and design the interconnect. AWE involves approximation of the exacttransfer function of a circuit by a lower order transfer function. This process consists of two steps:

a) moment computation to the represent the original transfer-function as a Maclaurin series.b) moment matching the series to a lower order transfer-function by Padé approximation.

#### 2.1 Approximation of Transfer Function

A reduced order model of the interconnect is obtained here to evaluate the response of the RLC lines that compose the clock-tree by an AWE technique called reciprocal expansion (REX) [15]. The computational efficiency of a distributed-parameter formulation of REX is utilized to model the RLC lines. The response at the output of any leaf-node with respect to the root of a subtree in the following sections, is approximated by a transfer function of the form

$$H(s) = \frac{V_{leaf}(s)}{V_{root}(s)} = \frac{1}{1 + b_1 s + b_2 s^2 + b_3 s^3 + \dots + b_n s^n}$$
(1)

REX differs from conventional AWE in respect that, it matches the inverse of the transfer function, i.e.,  $\frac{1}{H(s)}$  to the Maclaurin series. This proves useful in the computation of admittance as described below. It is also advantageous due to the fact that for a transfer function of the type (1) the Padé approximation step reduces to taking the reciprocal of the Maclaurin series.<sup>1</sup> REX is a good choice for the kind of higher order approximation we use for the purpose of clock tree construction.

#### 2.2 Distributed Parameter REX

A second-order approximation to the transfer-function of RLC lines, presented in [15], is used in the design of clock-tree in the following sections. The conductance to ground is assumed to be zero. It should be pointed out that the frequency of operation of contemporary MCMs is

<sup>&</sup>lt;sup>1</sup>It should also be pointed out that while Equation 1 is a [0/n] Padé approximant, it is possible to perform moment-matching to obtain a [m/n] Padé approximant using REX.



Figure 1: Distributed-parameter transmission Line Model

well below the resonance frequency of the RLC lines for current technology parameters. Hence the second order approximation used here is accurate for the design in the frequency range that is of interest. A brief description of this procedure is presented here for completeness.

Let the per unit resistance, capacitance, inductance parameters be represented as  $\alpha, \beta, \gamma$ respectively. Consider a line of length l, terminated in an admittance at the far end (x = 0). The admittance and the transfer function  $\frac{V_{x=l}(s)}{V_{x=0}(s)}$ , at the near end (x = l) are to be determined. This is illustrated in Figure 1.

#### 2.2.1 Admittance Computation

Considering an infinitesimal section dx of the line at position x, the admittance at (x + dx) is as follows:

$$Y(x+dx) = \frac{Y(x) + s\beta dx}{1 + dx(\alpha + s\gamma)(Y(x) + s\beta dx)}$$
(2)

Assuming that Y(x) is known to be the following polynomial in s ( $Y_0(x) = 0$ , because conductance to ground is assumed to be zero),

$$Y(x) = Y_1(x)s + Y_2(x)s^2$$
(3)

The admittance at (x + dx) is

$$Y(x+dx) = Y_1(x+dx)s + Y_2(x+dx)s^2$$
(4)

The second-order approximation to the admittance at (x + dx) is given by expanding Equation (2) as a Maclaurin series,

$$Y(x + dx) = (Y_1(x) + \beta dx)s + (Y_2(x) - \alpha dx(Y_1(x) + \beta dx)^2)s^2$$
(5)

Comparing (4) and (5) gives the following differential equations,

$$\frac{dY_1(x)}{dx} = \beta \tag{6}$$

$$\frac{dY_2(x)}{dx} = -\alpha (Y_1(x))^2$$
(7)

Solving Equation (6) yields

$$Y_1(x) = Y_1(0) + \beta x$$
(8)

Substituting this equation in Equation (7) and integrating we have

$$Y_2(x) = Y_2(0) - \alpha x \left(\frac{(\beta x)^2}{3} + \beta x Y_1(0) + (Y_1(0))^2\right)$$
(9)

Hence given Y(0) at the end of a line of length l, the second order approximation of the looking-in admittance Y(l) can be determined from equations (8) and (9).

#### 2.2.2 Coefficient Computation

Given a second order approximation to the voltage at x = 0 and  $V_{out}$  is the voltage at the leaf node of interest,

$$V(0) = (1 + b_1(0)s + b_2(0)s^2) V_{out}$$
(10)

We need to find the voltage at x = l

$$V(l) = (1 + b_1(l)s + b_2(l)s^2) V_{out}$$
(11)

If second order approximation to the voltage at x is known, the voltage at (x + dx) is given as

$$V(x + dx) = V(x)(1 + dx(\alpha + s\gamma)((Y_1(x) + \beta dx)s + Y_2(x)s^2))$$
(12)

$$\frac{dV(x)}{dx} = (1 + b_1(x)s + b_2(x)s^2)(\alpha Y_1(x)s + (\alpha Y_2(x) + \gamma Y_1(x))s^2)$$
(13)

$$= \alpha Y_1(x)s + (\alpha Y_1(x)b_1(x) + \alpha Y_2(x) + \gamma Y_1(x))s^2$$
(14)

$$\frac{db_1(x)}{dx} = \alpha Y_1(x) = \alpha Y_1(0) + \alpha \beta x \tag{15}$$

$$\frac{db_2(x)}{dx} = \alpha Y_1(x)b_1(x) + \alpha Y_2(x) + \gamma Y_1(x)$$
(16)

Solving (15) we get,

$$b_1(x) = b_1(0) + \frac{\alpha\beta x^2}{2} + \alpha Y_1(0)x$$
(17)

Substituting Equations (8), (9), and (17) in (16) and integrating we get

$$b_{2}(x) = b_{2}(0) + x(\alpha(Y_{2}(0) + Y_{1}(0)b_{1}(0)) + \gamma Y_{1}(0)) + \frac{x^{2}}{2}(\alpha\beta b_{1}(0) + \gamma\beta) + \frac{x^{3}}{6}\alpha^{2}\beta Y_{1}(0) + \frac{x^{4}}{24}\alpha^{2}\beta^{2}$$
(18)

Hence the given voltage V(0), the second-order approximation to V(l) can be computed using the Equations (17) and (18).

## 3 Zero-Skew Clock Tree Construction for RLC Lines

The process used here to construct a zero-skew clock tree involves a bottom-up recursive process, as in [17]. The input consists of a description of the location and loading capacitance of each clock pin. We begin by assigning each clock pin to a separate subtree; initially, each subtree consists of just one node corresponding to the clock pin, and this node is considered to be the root of the subtree. At each step, we merge two subtrees. The merging process requires the determination of a point on the line interconnecting the trees such that the delay to all the leaf nodes is the same; this point is taken to be the root of the merged subtrees, and the process continues recursively. We describe one step of the recursive process here, involving the determination of a zero-skew merging point for two subtrees. The procedure comprises of three steps:

- Admittance Computation at the root of a subtree
- Voltage Computation at the root of a subtree
- Merging of a pair of subtrees

In general, consider the merging of two zero-skew subtrees. Let  $Y^{L}(0)$  and  $Y^{R}(0)$  be the polynomials describing the looking-in admittance of what will be referred to as "left" and "right" subtrees in the following sections.

$$Y^{L}(0) = Y_{1}^{L}(0)s + Y_{2}^{L}(0)s^{2}$$
(19)



Figure 2: Zero-skew merging under distributed parameter transmission line model

$$Y^{R}(0) = Y_{1}^{R}(0)s + Y_{2}^{R}(0)s^{2}$$
(20)

Let the voltage at the root of left (x = 0) and right  $(\dot{x} = 0)$  subtrees (see Figure 2) be described by polynomials of the form

$$V^{L}(0) = b_{0}^{L}(0) + b_{1}^{L}(0)s + b_{2}^{L}(0)s^{2}$$
(21)

$$V^{R}(0) = b_{0}^{R}(0) + b_{1}^{R}(0)s + b_{2}^{R}(0)s^{2}$$
(22)

Let  $\alpha, \beta, \gamma$  be the per unit resistance, capacitance and inductance of the wire respectively. The process of zero-skew merging involves determination of a ratio r which divides the the line connecting the subtrees such that the the length of the line from the tapping-point to the root of right subtree is

$$l_1 = r \cdot l \tag{23}$$

and the length of the line to the root of left subtree is

$$l_2 = (1-r) \cdot l \tag{24}$$

The merging is depicted in Figure 2. The positions  $\dot{x} = 0$  and x = 0, shown in this figure represents the far ends of the lines with lengths  $l_1$  and  $l_2$  respectively. Consider the wire interconnecting the right and left subtrees to consist of two distributed RLC lines, i.e., from  $\dot{x} = 0$  to  $\dot{x} = l_1$  and x = 0 to  $x = l_2$ . Since the admittance and voltage at the far end of the lines are known from equations (19), (20) and (21), (22) respectively, the voltage at the merging point, i.e.,

$$V^{R}(l_{1}) = b_{0}^{R}(l_{1}) + b_{1}^{R}(l_{1})s + b_{2}^{R}(l_{1})s^{2}$$

$$(25)$$

$$V^{L}(l_{2}) = b_{0}^{L}(l_{2}) + b_{1}^{L}(l_{2})s + b_{2}^{L}(l_{2})s^{2}$$

$$(26)$$

can be determined using the distributed-parameter REX. The first and second-order coefficients of the voltages at the other end of the lines is given by using equations (17) and (18).

We will first consider the case where the first order moments (of  $\frac{1}{H(s)}$ ) is matched; the generalized case is considered in the next section

$$b_1^R(l_1) - b_1^L(l_2) = 0 (27)$$

i.e.,

$$(b_1^R(0) - b_1^L(0)) + \frac{\alpha\beta}{2}(l_1^2 - l_2^2) + \alpha(l_1Y_1^R(0) - l_2Y_1^L(0)) = 0$$
(28)

substituting equations (24) and (23) and solving for r,

$$r = \frac{\frac{\alpha\beta}{2}l^2 + \alpha l Y_1^L(0) + (b_1^L(0) - b_1^R(0))}{\alpha\beta l^2 + \alpha l (Y_1^L(0) + Y_1^R(0))}$$
(29)

The admittance of the left and right branches of the tree formed by this merging can be calculated by using equations (8) and (9). The looking-in admittance of the new tree formed after the merge is the sum of the admittance of the left and right branches of the tree. The voltage at the root of this tree is given by averaging the coefficients corresponding to the left and right subtrees. Hence starting from the leaf nodes (clock pins), at the end of each stage of merging a set of subtrees with their second-order admittance and voltage estimates are available for the next stage of the merging process.

## 4 Selection of Merging Point based on Higher Order Moments

In the previous section the selection of a merging point was based on matching the first order moments. If a tapping point could be selected matching the higher moments up to some order n, the waveforms at the leaf nodes would be identical to a higher degree of accuracy. This in principle should minimize the skew. Consider the waveforms at the root of the left and right subtrees which are to be merged.

$$V^{R} = m_{0}^{R} + m_{1}^{R}s + m_{2}^{R}s^{2} + \dots + m_{o}^{R}s^{n}$$
(30)

$$V^{L} = m_{0}^{L} + m_{1}^{L}s + m_{2}^{L}s^{2} + \dots + m_{o}^{L}s^{n}$$
(31)

The moments in the equation below are represented in terms of r.

$$f = w_0 (m_0^R - m_0^L)^2 + w_1 (m_1^R - m_1^L)^2 + w_2 (m_2^R - m_2^L)^2 + \dots + w_n (m_n^R - m_n^L)^2$$
(32)

Minimum skew requires that the above function be minimized. The value of r which minimizes f is given by

$$\frac{df}{dr} = 0 \tag{33}$$

The solution for r is computationally inexpensive. For example in case of matching moments up to the second-order, Equation (32) is a sixth order polynomial and Equation (33) is a fifth order polynomial. Hence determining r for minimum skew involves finding the roots of a low order polynomial. It is to be noted that the moments display a large shift of order between one moment and the next. In the case of a typical interconnect network, one moment could be  $10^9$  larger than the next. Weights  $w_0 \cdots w_n$  are introduced so that the contribution from each moment is reflected equally or in a desired proportion in the objective function.

For typical parameter values (from [7]) it was found that the first-order moments dominate the delay characteristics of the waveform in the final subtree. Increasing the weight of the higher order moments with the intention of matching higher order moments resulted in the first order moments to differ slightly. This, however, manifests itself as a noticeable skew. Therefore, we prefer to use the first order moment for matching purposes.

An explanation for this can be found in the fact that our procedure suitably damps the final subtree using a terminating resistance, so that the response is close to monotone. For a monotone response, it is known that the first-order moments dominate the response, and hence matching these moments contributes to ensuring near-zero skews. However, this does not mean that our computations of the second-order moments has no purpose; we use this information in ensuring that the network is suitably damped.

# 5 Damping Of Zero-Skew Trees

Due to non-negligible inductance, the routing constructed on the MCM substrate results in under-damped behavior. The zero skew trees are damped appropriately here by series-



Figure 3: Damping of a zero-skew tree

termination at the source end. The clock tree construction as described in the previous section provides a second-order approximation of the waveform at all the sink nodes. This second-order estimate is used to calculate the value of the series-termination for this tree of transmission lines. The series termination could implemented either by sizing the clock driver for a specific output resistance or by including a resistance in series with the driver. This process is illustrated in Figure 3.

Consider a driver with output resistance  $R_b$  driving a tree whose root node is labeled as u. The downstream admittance at u is Y(u) and the voltage is V(u). The secondorder polynomials corresponding to these quantities computed as a part of the zero-skew-tree construction are given by (conductance to ground is assumed zero)

$$Y(u) = Y_1(u)s + Y_2(u)s^2$$
(34)

$$V(u) = 1 + b_1(u)s + b_2(u)s^2$$
(35)

Let the voltage at the input of the driver (node labeled v) be

$$V(v) = 1 + b_1(v)s + b_2(v)s^2$$
(36)

$$V(v) = (1 + Y(u)R_b)V(u)$$
(37)

Substituting equations (34) and (35) in (37),

$$V(v) = \frac{1}{H(s)} = (1 + (Y_1(u)s + Y_2(u)s^2)R_b)(1 + b_1(u)s + b_2(u)s^2)$$
(38)

$$= (1 + (R_b Y_1(u) + b_1(u))s + (b_2(u) + R_b Y_2(u) + b_1(u)Y_1(u)R_b)s^2)$$
(39)

The characteristic equation of a second-order system is given by [11],

$$\Delta = s^2 + 2\zeta \omega_n s + \omega_n^2 = 0 \tag{40}$$

where  $\zeta$  is the damping factor. Comparing equations (39) and (40)

$$2\zeta\omega_n = \frac{R_b Y_1(u) + b_1(u)}{(b_2(u) + R_b Y_2(u) + b_1(u)Y_1(u)R_b)}$$
(41)

and

$$\omega_n^2 = \frac{1}{(b_2(u) + R_b Y_2(u) + b_1(u) Y_1(u) R_b)}$$
(42)

Substituting Equation (42) in (41)

$$\frac{2\zeta}{\omega_n} = R_b Y_1(u) + b_1(u) \tag{43}$$

Squaring on both sides of the above equation and substituting Equation (42).

$$4\zeta^2(b_2(u) + R_b Y_2(u) + b_1(u)Y_1(u)R_b) = (R_b Y_1(u) + b_1(u))^2$$
(44)

Let

$$\eta = (4\zeta^2(Y_2(u) + b_1(u)Y_1(u)) - 2b_1(u)Y_1(u))$$
(45)

Solving for  $R_b$ ,

$$R_b = \frac{\eta \pm \sqrt{\eta^2 - 4Y_1(u)^2(b_1(u)^2 - 4\zeta^2 b_2(u))}}{2Y_1(u)^2}$$
(46)

The value of  $R_b$  for critical damping is obtained by setting  $\zeta = 1$  in the equation above. Similarly an under-damped condition can be obtained with  $0 < \zeta < 1$ , the maximum overshoot corresponding to this is given in [11] as

$$V_{peak} = 1 + \exp^{\left(-\pi \frac{\zeta}{\sqrt{1-\zeta^2}}\right)} \tag{47}$$

or,

$$\zeta = \frac{\ln(V_{peak} - 1)}{\pi\sqrt{1 + (\ln(V_{peak} - 1)/\pi)^2}}$$
(48)

Substituting the value for  $\zeta$  in Equation (46), the clock-tree can be designed for any acceptable overshoot. In general, under-damped condition implies a lower driver resistance which, in turn can, be interpreted as either the fact that longer lines can be driven, or that wires of lower widths can be used to increase the packing density. It can been seen that the delay  $t_d$ , defined as the time to reach 50% amplitude, is not affected significantly by  $\zeta$ . However the rise-time  $t_r$ , defined as the time taken for the signal to rise from 10% to 90% of the final amplitude, is strongly dependent on the damping ratio, as illustrated in the section on



Figure 4: Linear model of a buffer

results. This parameter  $(t_r)$  has a significant impact on the circuit timing as these voltage levels determine the switching of CMOS devices. It has been illustrated in [1] that choosing  $\zeta = \frac{1}{\sqrt{2}}$  results in a substantial decrease in  $t_r$  while the overshoot is minimal (4%). Hence the choice of driver using this method can improve the rise time characteristics of the tree while maintaining an acceptable waveform at the clock-tree sink nodes.

### 6 Insertion of Buffer Stages

The phase delay of large clock-trees can be significant and may limit the system clock-speed. In order to reduce the phase delay, buffer stages are introduced as part of the clock-tree construction. Buffers also perform the role of regenerating the signals, by supplying the current necessary to drive the sub-trees. In addition to achieving these advantages, we consider sizing of buffers to hierarchically damp the clock-tree. It should be pointed out that in the MCM scenario the buffers can be fabricated only on the dies placed on the substrate. We assume the availability of two buffers per die and include the detour to a buffer as part of our delay computation.

A linear model for a buffer (as shown in Figure 4) is assumed. The computation of buffer resistance is as shown in the previous section. The appropriately sized buffers are introduced at all nodes at a particular level of the clock-tree (as shown in Figure 5). Even though a level by level insertion scheme may not be globally optimal, it is known to work well for full binary trees [21] (zero-skew merging will always result in full binary trees). Further techniques such as shown in [3] can be adopted to reduce imbalance in subtrees being merged. Hence a level



Figure 5: Insertion of buffers



Figure 6: Buffer load equalization

by level buffer-insertion solution will not be far from optimality in reality. The effect of level at which buffers are inserted is presented in Section 8.

At any stage of the clock-tree construction of the zero skew tree construction, there are a set of n subtrees which are zero-skew-merged to construct  $\frac{n}{2}$  trees. A difference in the input admittance of two sub-trees is expected as part of the merging process. This difference in the admittance of the subtrees implies a unequal loading of the buffers. This will lead to skew due to the non-linearity of the buffers. We attempt to minimize the effects of this by ensuring that the admittance seen by all buffers at a stage is the same thereby ensuring similar switching behavior for all buffers. The concept presented below is similar to buffer relocation [4] or delay equalization [13].

Consider the case of inserting buffers at a level i of the tree. Let the original admittance of the tree be  $Y(u_K)$  (as shown in Figure 6).

$$Y(u_K) = Y_1(u_K)s + Y_2(u_K)s^2$$
(49)

Consider a target admittance  $Y(u_T)$ , taken to be the admittance of the subtree at the  $i^{\text{th}}$  level with the largest delay.

$$Y(u_T) = Y_1(u_T)s + Y_2(u_T)s^2$$
(50)

Where  $Y_1(u_T), Y_2(u_T)$  are the maximum of the first and second-order coefficients of the admittance polynomial among all the sub-trees at a specific level of the clock-tree. The admittance of all the sub-trees are equalized to  $Y(u_T)$  by introducing a wire of appropriate length and width between the root of the tree and the buffer. A lumped model for this wire segment is assumed here, since the lengths of the wire-segments involved are small and the computation is simplified as a result. The dimensions of a wire segment is calculated as follows,

We require,

$$Y(u_T) = \frac{1}{R_w + 1/(Y(u_K) + C_w s)}$$
(51)

Where  $R_w$  and  $C_w$  are the resistance and capacitance of the wire segment to be added to a tree rooted at node k. Substituting equations (50) and (49) in (51) and matching the powers of s we get

$$C_w = Y_1(u_T) - Y_1(u_K)$$
(52)

and

$$R_w = \frac{Y_2(u_T) - Y_2(u_K)}{(Y_1(u_K) + C_{wk})Y_1(u_T)}$$
(53)

The length l and width w of the wire segment is

$$l = \sqrt{\frac{R_w C_w}{\alpha \beta}} \tag{54}$$

$$w = \sqrt{\frac{\alpha C_w}{\beta R_w}} \tag{55}$$

This procedure was experimentally seen to be very effective when the subtree admittances were very different from  $Y(u_T)$ , and was not applied when the admittance value was close to  $Y(u_T)$ .

# 7 Summary of the Algorithm

The following pseudo-code summarizes the procedure for clock-tree synthesis.

```
MCM_Clock-Tree() {
/* Initialization */
Level_number = 0;
Number_of_subtrees = Number_of_pins;
/* Assume a unit voltage source */
/* Input admittance equals the capacitance at each pin */
    While (Number_of_subtrees >= 1)
    {
        Zero-Skew Merge();
        Compute_Admittance();
        Compute_Voltage();
          If (Delay > Delay_constraint)
          {
              Equalize_Load();
              Insert_Buffers();
          }
   Level_number ++;
   Number_of_subtrees = Number_of_subtrees/2;
    }
Series_Termination();
}
```

The routine Zero-Skew\_Merge finds a tapping point which minimizes the skew for two given sub-trees. The routines Compute\_Admittance, Compute\_Voltage calculate the input-admittance and the voltage at the root of the new sub-tree formed as a result of the Zero-Skew\_Merge procedure. To meet delay specifications for the clock-tree, buffers are inserted at the root of sub-trees at appropriate levels by the Insert\_Buffers routine. This routine calculates the size of the buffer for the specified damping criterion. The Equalize\_Load routine calculates the appropriate length and width of the wire feeding the root of a sub-tree so that the load of the buffers inserted at a level are nearly equal. The recursive application of these routines on a set of pins results in the desired zero-skew clock-tree. The termination at the root of the tree is calculated by the *Series\_Termination* routine. In the case where the number of pins is not a power of two, this procedure is applied recursively.

### 8 Experimental Results

The procedure described above was tested on a set of examples which portray a typical MCM routing scenario. A substrate with area 10 cm x 10 cm was assumed. The distribution of the clock pins and the loading capacitances at each pin was generated randomly. The number of clock pins varies from 8 to 128 in the examples MCM-1 through MCM-5. The examples were tested by constructing the clock-trees using MCM-L process parameters obtained from [7]  $(\alpha = 0.24\Omega/cm, \beta = 7.2nH/cm, \gamma = 0.76pF/cm)$ . The quarter wavelength for a line with these parameters is about 2 meters, and therefore our assumption about electrically short lines is a valid one.

A constant width of 10  $\mu$ m and a Manhattan geometry is assumed for the routing. The clock-trees constructed, were studied with SPICE to verify the accuracy of the routing procedure. The wires were represented by multiple RLC segments to model the distributed nature of the lines. Some of the SPICE simulations are presented in the figures in following section.

#### 8.1 Clock Tree Construction

The characteristic parameters of the clock-trees constructed using our approach are presented in Table 1, Table 2, corresponding to termination conditions  $\zeta = 1.0$ ,  $\zeta = \frac{1}{\sqrt{2}}$ , respectively. These clock-trees have a driver at the root and no additional buffers inserted. The Figures 7, 8 show the undamped waveforms at specific clock pins in the case of a large (low resistance) driver. Figures 9, 11 show the critically damped waveforms for the three examples. The overshoot controlled case ( $\zeta = \frac{1}{\sqrt{2}}$ ) is shown in Figures 10, 12 for MCM-1, MCM-5 respectively. The study of the effect of damping on the waveforms is presented in the next section. The reduction in the driver resistance ( $R_b$ ) and the corresponding improvement in rise-time  $(t_r, 10\% V_{dd}$  to 90%  $V_{dd})$  are as tabulated. The clock-skews were measured at the 50% point of the step response from the SPICE simulation of the waveforms at the sinks of the trees. The examples exhibit a low skew values (as measured with SPICE) under both critically damped and under-damped conditions. The maximum observed skew is less than 1% of a clock-period of 10nS (100 MHz).

#### 8.2 Damping of Clock Trees

The damping of the clock-tree was performed with series termination at the source end. The effect of a range of damping conditions ( $\zeta = 0.5$  to 1.0) was studied. Figure 13 shows the step response at a particular sink node of MCM-2 for the range of  $\zeta$ . Table 3 summarizes the study of the variation of rise-time  $(t_r)$ , delay  $(t_d, 50\% V_{dd})$ , peak overshoot  $(V_{peak})$ , driver resistance  $R_b$ ) with the damping factor ( $\zeta$ ). The variation of these parameters are depicted in Figure 14. The figure also shows the strong dependence of  $t_r$  on  $\zeta$  as compared to  $t_d$ . A significant improvement in delay and rise-time is observed under under-damped conditions.



Figure 7: Undamped waveforms

Figure 8: Undamped waveforms

| Example |       | $\zeta = 1.0$ |           |           |        |  |
|---------|-------|---------------|-----------|-----------|--------|--|
| Name    | Sinks | $R_b(\Omega)$ | $t_d(nS)$ | $t_r(nS)$ | % Skew |  |
| MCM-1   | 8     | 4.326         | 0.9156    | 1.028     | 0.0128 |  |
| MCM-2   | 16    | 4.042         | 1.402     | 1.832     | 0.0627 |  |
| MCM-3   | 32    | 3.283         | 1.296     | 1.674     | 0.0052 |  |
| MCM-4   | 64    | 1.647         | 1.902     | 2.842     | 0.3218 |  |
| MCM-5   | 128   | 1.390         | 2.419     | 3.874     | 0.1624 |  |

Table 1: Termination with critical damping

Table 2: Termination with overshoot of 4%

| Example |       | $\zeta = \frac{1}{\sqrt{2}}$ |           |           |        |  |
|---------|-------|------------------------------|-----------|-----------|--------|--|
| Name    | Sinks | $R_b(\Omega)$                | $t_d(nS)$ | $t_r(nS)$ | % Skew |  |
| MCM-1   | 8     | 3.022                        | 0.823     | 0.817     | 0.0261 |  |
| MCM-2   | 16    | 2.816                        | 1.273     | 1.202     | 0.1048 |  |
| MCM-3   | 32    | 2.281                        | 1.176     | 1.169     | 0.2559 |  |
| MCM-4   | 64    | 1.145                        | 1.693     | 1.767     | 0.3684 |  |
| MCM-5   | 128   | 0.965                        | 2.125     | 2.449     | 0.1898 |  |

Table 3: Termination of MCM-2 under different damping conditions

|         |                 |                          |                | 1 0                |
|---------|-----------------|--------------------------|----------------|--------------------|
| $\zeta$ | $Delay t_d(nS)$ | $Rise-time \ t_r \ (nS)$ | $R_b~(\Omega)$ | $V_{peak}$ (Volts) |
| 0.5     | 1.193           | 0.950                    | 2.001          | 1.153              |
| 0.6     | 1.233           | 1.033                    | 2.408          | 1.090              |
| 0.7     | 1.273           | 1.202                    | 2.816          | 1.042              |
| 0.8     | 1.311           | 1.401                    | 3.225          | 1.014              |
| 0.9     | 1.358           | 1.612                    | 3.633          | 1.009              |
| 1.0     | 1.402           | 1.832                    | 4.042          | 1.000              |









Figure 9: Critically damped waveforms

Figure 10: Over-shoot controlled waveforms



MCM-5 v(1) v(128) 1.05 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 -0.05 10-9 0.00 5.00 10.00 15.00

Figure 11: Critically damped waveforms



Figure 12: Over-shoot controlled waveforms



Figure 13: Response with  $\zeta = 0.5$  to 1.0

Figure 14: Variation of  $t_d$ ,  $t_r$ ,  $V_{peak}$  with  $\zeta$ 

#### 8.3 Buffer Insertion

The effect of improvement in delay and slew rate was studied by inserting buffers as described in Section 6. The effect of level at which the buffers are inserted is shown in Figure 15 for the example MCM-3. The sharpest slew rate is observed for the level closest to the leaf nodes and as the level of insertion is moved toward the root the slew rate and delay become worse. There is a trade off between the delay characteristics and the number of buffers required. For example lowest level in this case requires 16 buffers where as the highest level requires only two. A choice of one or more buffer levels for a tree could be made depending on the timing constraints and criterion such as buffer availability.



Figure 15: Rise-time with buffer stages

## 8.4 $Pentium^{TM}$ based MCM - Study of an "industrial" example



Figure 16: Layout of MCM-Pentium



The design of the clock-distribution for a MCM with a  $Pentium^{TM}$  Central processing unit (CPU) and a 512K byte secondary cache was presented in [14]. The MCM consists of a  $Pentium^{TM}$  CPU chip, a cache controller chip (CC), and 18 static RAM (SRAM) chips. The module has dimensions of 7.52 cm x 3.68 cm. The general placement of the dies is as shown in Figure 16. Assumptions about the exact pin locations and loading capacitances have been made since these data were not available in [14]. The module has a supply voltage rated at 3.3 V. The clock distribution network specifications are as specified in [14]. Figure 17 symbolically illustrates the process of clock-tree construction for this example, however it should be noted that this not the final physical routing.



Figure 18: Critically damped waveforms

Figure 19: Overshoot controlled waveforms

 Table 4: Summary of MCM-Pentium under different damping conditions

| $\zeta$ | Skew (pS) | Propagation Delay(nS) | $Rise-time t_r (nS)$ | $R_b (\Omega)$ | $V_{peak}$ (Volts) |
|---------|-----------|-----------------------|----------------------|----------------|--------------------|
| 1.0     | 44.444    | 0.6829                | 1.071                | 20.984         | 3.333              |
| 0.6     | 30.031    | 0.4302                | 0.9259               | 12.560         | 3.452              |

| Table 5: Execution times |       |       |       |       |       |             |  |
|--------------------------|-------|-------|-------|-------|-------|-------------|--|
| Example                  | MCM-1 | MCM-2 | MCM-3 | MCM-4 | MCM-5 | MCM-Pentium |  |
| Time (Sec)               | 0.105 | 0.171 | 0.199 | 0.386 | 0.378 | 0.183       |  |

The SPICE simulations are shown in Figures 18, and 19 for critical, under-damped conditions respectively. These figures depict the clock-signal at the output of the clock-tree driver, and the waveforms at the clock-pins of various dies of the MCM. It may be noted at this resolution the skew at the clock-pins is indistinguishable for both cases. The details of these characteristics are shown in Table 4. The skew is the maximum difference in delay to clock pins of the 20 dies. The parameters skew, propagation delay, transition time, were measured as per their definitions described in section 8.1. The simulations show that the specifications are met with sufficient margin. It may be pointed out that the design of the clock-tree has been dealt in [14] by a procedure involving repeated SPICE simulation and analysis.

The execution times for the construction of clock-trees of examples studied here are tabulated in Table 5.

# 9 Conclusion

This work considers the design of clock-tree for MCM environment. A distributed-parameter AWE-based technique was used to estimate the response of clock lines for the construction of zero-skew trees. A second-order approximation was found to provide sufficient accuracy for selecting a series-termination for the clock-tree. The series termination could be selected so that the responses at all sinks were critically damped. An alternative to using critical damping, the idea of designing the clock-tree with controlled overshoot with the aim of improving slew rates, was also investigated. The procedure also incorporates the introduction of buffered clock trees to meet constraints on the slew rate of the clock. The algorithms presented here are computationally very efficient, and the experimental results exhibit low values of skew for clock-trees constructed and the damping of sink waveforms was achieved as desired.

## References

- J. R. Brews, "Overshoot-Controlled RLC Interconnections," *IEEE Transactions on Elec*tron Devices, vol. 38, pp. 76-81, January 1991.
- [2] J. B. Burr, J. R. Burnham, and A. M. Peterson, "System-wide Energy Optimization in the MCM Environment," *Proceedings of the IEEE Multi-Chip Conference*, pp. 66-83, 1991.
- [3] T. H. Chao, Y. C. Hsu and J. M. Ho, "Zero-skew Clock Net Routing," ACM/IEEE Design Automation Conference, pp. 518-523, 1992.
- [4] J. Chung and C-K. Cheng, "Skew Sensitivity Minimization of Buffered Clock Tree," Proceeding of IEEE/ACM International Conference on Computer-Aided Design, pp. 280-283, 1994.
- [5] M. Edahiro, "A Clustering-based Optimization Algorithm in Zero-Skew Routing," ACM/IEEE Design Automation Conference, pp. 612-616, 1993.
- [6] M. Edahiro, "Delay Minimization for Zero-Skew Routing," IEEE/ACM International Conference on Computer-Aided Design, pp. 563-566, 1993.
- [7] R. C. Frye, "Physical Scaling and Interconnection Delays in Multichip Modules," *IEEE Transactions on Components, Packaging, and Manufacturing Technology*, vol. 17, pp. 30-37, February 1994.
- [8] Y.-M. Li and M. A. Jabri, "A Zero-Skew Clock Routing Scheme for VLSI Circuits," IEEE/ACM International Conference on Computer-Aided Design, pp. 458-463, 1992.
- [9] A. B. Kang and C.-W. A. Tsao, "Low-Cost Single-Layer Clock Trees with Exact Zero Elmore Delay Skew," *IEEE/ACM International Conference on Computer-Aided Design*, pp. 213-218, 1994.
- [10] B. Krauter, R. Gupta, J. Willis and L. T. Pileggi, "Transmission Line Synthesis," Proceedings of the ACM/IEEE Design Automation Conference, pp. 358-363, 1995.

- [11] B. C. Kuo, Automatic Control Systems, 6th edition, Prentice Hall, Englewood Cliffs, NJ, 1991.
- [12] L. T. Pillage and R. A. Rohrer, "Asymptotic Waveform Evaluation for Timing Analysis," *IEEE Transactions on Computer-Aided Design*, vol. 9, pp. 352-365, 1990.
- [13] S. Pullela, N. Menezes, J. Omar, and L. T. Pillage, "Skew and Delay Optimization for Reliable Buffered Clock Trees," *IEEE/ACM International Conference on Computer-Aided Design*, pp. 556-562, 1993.
- [14] R. M. Reinschmidt and D. H. Leuthold, "Clocking Considerations for a Pentium-based CPU Module with 512K Byte Secondary Cache," In *Proceedings of IEEE Multi-Chip Module Conference*, pp. 26-30, 1994.
- [15] M. Sriram and S. M. Kang, Physical Design Of Multi-Chip Modules, Kluwer Academic Publishers, Boston, MA, 1994.
- [16] G. E. Téllez and M. Sarrafzadeh, "Clock Period Constrained Minimal Buffer Insertion in Clock Trees," *IEEE/ACM International Conference on Computer-Aided Design*, pp. 219-223, 1994.
- [17] R.-S. Tsay, "An Exact Zero Skew Clock Routing Algorithm," *IEEE Transactions on Computer-Aided Design of Integrated Circuit and Systems*, vol. 12, pp. 242-249, Feb. 1993.
- [18] J. S. H. Wang and W. W-M. Dai, "Optimal Design of Self Damped Transmission Lines for Multi-chip Modules," Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 594-598, 1994.
- [19] D. Zhou, F. Tsui, and J. S. Cong, "A Distributed RLC model for MCM layout," Proceedings of IEEE Multi-Chip Module Conference, pp. 191-197, 1993.
- [20] Q. Zhu, W. W.-M. Dai, and J. G. Xi, "Optimal Sizing of High-Speed Clock Networks Based on Distributed RC and Lossy Transmission Line Models," Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 628-633, 1993.

[21] J. G. Xi, and W. W.-M. Dai, "Buffer Insertion and Sizing Under Process Variations for Low Power Clock Distribution," 32nd ACM/IEEE Design Automation Conference, 1995.