### Chapter 7: Systolic Architecture Design

Keshab K. Parhi

- Systolic architectures are designed by using linear mapping techniques on regular dependence graphs (DG).
- Regular Dependence Graph : The presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG.
- DG corresponds to space representation  $\rightarrow$  no time instance is assigned to any computation  $\Rightarrow$  t=0.
- Systolic architectures have a space-time representation where each node is mapped to a certain processing element(PE) and is scheduled at a particular time instance.
- Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture.
- Mapping of N-dimensional DG to (N-1) dimensional systolic array is considered.

• Definitions :

> Projection vector (also called iteration vector),  $d = \begin{pmatrix} d_1 \\ d_2 \end{pmatrix}$ 

Two nodes that are displaced by d or multiples of d are executed by the same processor,

executed by the same processor. Processor space vector,  $p^T = (p_1 \ p_2)$ Any node with index  $I^T = (i,j)$  would be executed by processor;

$$p^{T}I = \begin{pmatrix} p_{1} & p_{2} \end{pmatrix} \begin{pmatrix} i \\ j \end{pmatrix}$$

- Scheduling vector,  $s^T = (s_1 s_2)$ . Any node with index I would would be executed at time,  $s^T I$ .
- Hardware Utilization Efficiency, HUE = 1/|S<sup>T</sup>d|. This is because two tasks executed by the same processor are spaced |S<sup>T</sup>d| time units apart.

Processor space vector and projection vector must be orthogonal to each other  $\Rightarrow p^T d = 0$ .

- If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e., S<sup>T</sup>I<sub>A</sub> ≠ S<sup>T</sup>I<sub>B</sub>, i.e., S<sup>T</sup>d ≠ 0.
- Edge mapping : I f an edge e exists in the space representation or DG, then an edge p<sup>T</sup>e is introduced in the systolic array with s<sup>T</sup>e delays.
- A DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension. For a 2-D DG, the general transformation is described by i' = t = 0, j' = p<sup>T</sup>I, and t' = s<sup>T</sup>I, i.e.,

$$\begin{pmatrix} i' \\ j' \\ t' \end{pmatrix} = T \begin{pmatrix} i \\ j \\ t \end{pmatrix} = \begin{pmatrix} 0 & 0 & 1 \\ p' & 0 \\ s' & 0 \end{pmatrix} \begin{pmatrix} i \\ j \\ t \end{pmatrix}$$

 $j' \Rightarrow$  processor axis

 $t' \Rightarrow$  scheduling time instance

#### FIR Filter Design B<sub>1</sub>(Broadcast Inputs, Move Results, Weights Stay)

 $d^{\top} = (1 \ 0), \ p^{\top} = (0 \ 1), \ s^{\top} = (1 \ 0)$ 

> Any node with index  $I^{T} = (i, j)$ 

 $\succ$  is mapped to processor  $p^T I = j$ .

 $\succ$  is executed at time s<sup>T</sup>I =i.

- Since  $s^Td=1$  we have HUE =  $1/|s^Td| = 1$ .
- Edge mapping : The 3 fundamental edges corresponding to weight, input, and result can be mapped to corresponding edges in the systolic array as per the following table:

| е            | р⊤е | s⊤e |
|--------------|-----|-----|
| wt(1 0)      | 0   | 1   |
| i/p(0 1)     | 1   | 0   |
| result(1 –1) | -1  | 1   |





Design B<sub>2</sub>(Broadcast Inputs, Move Weights, Results Stay)

 $d^{T} = (1 - 1), p^{T} = (1 1), s^{T} = (1 0)$ 

>Any node with index  $I^{T} = (i, j)$ 

≻ is mapped to processor  $p^T I = i+j$ .

 $\succ$  is executed at time s<sup>T</sup>I = i.

Since  $s^T d=1$  we have HUE =  $1/|s^T d| = 1$ .

| е            | р⊤е | s⊤e |
|--------------|-----|-----|
| wt(1 0)      | 1   | 1   |
| i/p(0 1)     | 1   | 0   |
| result(1 –1) | 0   | 1   |



Low-level implementation of B<sub>2</sub> design

• Applying space time transformation we get :

 $j' = p^{T}(i j)^{T} = i + j$  $t' = s^{T}(i j)^{T} = i$ 



#### Design F(Fan-In Results, Move Inputs, Weights Stay)

 $d^{T} = (1 \ 0), p^{T} = (0 \ 1), s^{T} = (1 \ 1)$ 

Since  $s^Td=1$  we have  $HUE = 1/|s^Td| = 1$ .

| е            | р⊤е | s <sup>⊤</sup> e |
|--------------|-----|------------------|
| wt(1 0)      | 0   | 1                |
| i/p(0 1)     | 1   | 1                |
| result(1 –1) | -1  | 0                |





# Design R<sub>1</sub>(Results Stay, Inputs and Weights Move in Opposite Direction)

$$d^{T} = (1 - 1), p^{T} = (1 - 1), s^{T} = (1 - 1)$$

Since  $s^T d=2$  we have HUE =  $1/|s^T d| = \frac{1}{2}$ .

| е            | р⊤е | s⊤e |
|--------------|-----|-----|
| wt(1 0)      | 1   | 1   |
| i/p(0 -1)    | -1  | 1   |
| result(1 –1) | 0   | 2   |





Low-level implementation of R<sub>1</sub> design

Note :  $R_1$  can be obtained from  $B_2$  by 2-slow transformation and then retiming after changing the direction of signal x.

## <u>Design R<sub>2</sub> and Dual R<sub>2</sub>(Results Stay, Inputs and Weights Move in Same Direction but at Different Speeds)</u>

$$d^{\top} = (1 - 1), p^{\top} = (1 - 1),$$

 $R_2$ :  $s^T = (2 1)$ ; Dual  $R_2$ :  $s^T = (1 2)$ ;

Since  $s^Td=1$  for both of them we have HUE =  $1/|s^Td| = 1$  for

both.

≻Edge mapping :

| R <sub>2</sub> |     | Dual R <sub>2</sub> |               |     |     |
|----------------|-----|---------------------|---------------|-----|-----|
| е              | р⊤е | s⊤e                 | е             | р⊤е | s⊤e |
| wt(1, 0)       | 1   | 2                   | wt(1, 0)      | 1   | 1   |
| i/p(0,1)       | 1   | 1                   | i/p(0,1)      | 1   | 2   |
| result(1, -1)  | 0   | 1                   | result(-1, 1) | 0   | 1   |

Note : The result edge in design dual  $R_2$  has been reversed to Guarantee  $s^T e \ge 0$ .

# Design W<sub>1</sub> (Weights Stay, Inputs and Results Move in Opposite Directions)

 $d^{T} = (1 \ 0), p^{T} = (0 \ 1), s^{T} = (2 \ 1)$ 

➤Since s<sup>T</sup>d=2 for both of them we have HUE = 1/|s<sup>T</sup>d| = ½.
➤Edge mapping :

| е            | р⊤е | s⊤e |
|--------------|-----|-----|
| wt(1 0)      | 0   | 2   |
| i/p(0 -1)    | 1   | 1   |
| result(1 –1) | -1  | 1   |

#### <u>Design W<sub>2</sub></u> and Dual W<sub>2</sub>(Weights Stay, Inputs and Results Move in Same Direction but at Different Speeds)

 $d^{T} = (1 \ 0), p^{T} = (0 \ 1),$ 

 $W_2$ :  $s^T = (1 2)$ ; Dual  $W_2$ :  $s^T = (1 - 1)$ ;

Since  $s^Td=1$  for both of them we have HUE =  $1/|s^Td| = 1$  for both.

| W2            |     |     | Dual W <sub>2</sub> |     |     |
|---------------|-----|-----|---------------------|-----|-----|
| е             | р⊤е | s⊤e | е                   | р⊤е | s⊤e |
| wt(1, 0)      | 0   | 1   | wt(1, 0)            | 0   | 1   |
| i/p(0,1)      | 1   | 2   | i/p(0,-1)           | -1  | 1   |
| result(1, -1) | 1   | 1   | result(1, -1)       | -1  | 2   |

- Relating Systolic Designs Using Transformations :
  - FIR systolic architectures obtained using the same projection vector and processor vector, but different scheduling vectors, can be derived from each other by using transformations like edge reversal, associativity, slow-down, retiming and pipelining.
- Example 1 : R<sub>1</sub> can be obtained from B<sub>2</sub> by slowdown, edge reversal and retiming.

• Example 2:



Derivation of design F from B<sub>1</sub> using cutset retiming

Selection of s<sup>T</sup> based on scheduling inequalities: For a dependence relation X  $\rightarrow$ Y, where  $I_x^T = (i_x, j_x)^T$  and  $I_y^T = (i_y, j_y)^T$  are respectively the indices of the nodes X and Y. The scheduling inequality for this dependence is given by,

$$S_y \ge S_x + T_x$$

where  $T_x$  is the computation time of node X. The scheduling equations can be classified into the following two types :

≻ Linear scheduling, where

$$S_{x} = S^{T} I_{x} = (S_{1} S_{2})(i_{x} j_{x})^{T}$$
$$S_{y} = S^{T} I_{y} = (S_{1} S_{2})(i_{y} j_{y})^{T}$$

➤ Affine Scheduling, where

$$\begin{split} \mathbf{S}_{\mathbf{x}} &= \mathbf{S}^{\mathsf{T}} \mathbf{I}_{\mathbf{x}} + \gamma_{\mathbf{x}} = (\mathbf{S}_{1} \ \mathbf{S}_{2})(\mathbf{i}_{\mathbf{x}} \ \mathbf{j}_{\mathbf{x}})^{\mathsf{T}} + \gamma_{\mathbf{x}} \\ \mathbf{S}_{\mathbf{x}} &= \mathbf{S}^{\mathsf{T}} \mathbf{I}_{\mathbf{x}} + \gamma_{\mathbf{y}} = (\mathbf{S}_{1} \ \mathbf{S}_{2})(\mathbf{i}_{\mathbf{x}} \ \mathbf{j}_{\mathbf{x}})^{\mathsf{T}} + \gamma_{\mathbf{y}} \end{split}$$

So scheduling equation for affine scheduling is as follows:

$$\boldsymbol{S}^{\top} \boldsymbol{\boldsymbol{\mathsf{I}}}_{\boldsymbol{\mathsf{x}}} + \boldsymbol{\boldsymbol{\gamma}}_{\boldsymbol{\mathsf{y}}} \geq \boldsymbol{S}^{\top} \boldsymbol{\boldsymbol{\mathsf{I}}}_{\boldsymbol{\mathsf{x}}} + \boldsymbol{\boldsymbol{\gamma}}_{\boldsymbol{\mathsf{x}}} + \boldsymbol{\boldsymbol{\mathsf{T}}}_{\boldsymbol{\mathsf{x}}}$$

Each edge of a DG leads to an inequality for selection of the scheduling vectors which consists of 2 steps.

- Capture all fundamental edges. The reduced dependence graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RLA) description of the corresponding problem is used to construct RDGs.
- Construct the scheduling inequalities according to

$$\mathbf{S}^{\mathsf{T}} \mathbf{I}_{\mathsf{x}} + \gamma_{\mathsf{y}} \geq \mathbf{S}^{\mathsf{T}} \mathbf{I}_{\mathsf{x}} + \gamma_{\mathsf{x}} + \mathbf{T}_{\mathsf{x}}$$

and solve them for feasible  $s^{T}$ .

- RIA Description : The RIA has two forms
  - $\Rightarrow$  The RIA is in standard input RIA form if the index of the inputs are the same for all equations.
  - $\Rightarrow$  The RIA is in standard output RIA form if all the output indices are the same.
- For the FIR filtering example we have,

$$W(i+1, j) = W(i, j)$$
  

$$X(i, j+1) = X(i, j)$$
  

$$Y(i+1, j-1) = Y(i, j) + W(i+1, j-1)X(i+1, j-1)$$

The FIR filtering problem cannot be expressed in standard input RIA form. Expressing it in standard output RIA form we get,

$$W(i, j) = W(i-1, j)$$
  
X(i, j) = X(i, j-1)  
Y(i, j) = Y(i-1, j+1) + W(i, j)X(i, j)

• The reduced DG for FIR filtering is shown below.



Example :

$$\begin{split} T_{mult} &= 5, \ T_{add} = 2, \ T_{com} = 1 \\ \text{Applying the scheduling equations to the five edges of the above figure we get ;} \\ W-->Y : e &= (0 \ 0)^{\top}, \ \gamma_x - \gamma_w \geq 0 \\ X & ->X : e &= (0 \ 1)^{\top}, \ s_2 + \gamma_x - \gamma_x \geq 1 \\ W-->W: e &= (1 \ 0)^{\top}, \ s_1 + \gamma_w - \gamma_w \geq 1 \\ X & ->Y : e &= (0 \ 0)^{\top}, \ \gamma_y - \gamma_x \geq 0 \\ Y & -->Y : e &= (1 \ -1)^{\top}, \ s_1 - s_2 + \gamma_y - \gamma_y \geq 5 + 2 + 1 \\ \text{For linear scheduling } \gamma_x = \gamma_y = \gamma_w = 0. \ \text{Solving we get}, \ s_1 \geq 1, \\ s_2 \geq 1 \ \text{and} \ s_1 - s_2 \geq 8. \end{split}$$

Taking s<sup>T</sup> = (9 1), d = (1 -1) such that s<sup>T</sup>d ≠ 0 and p<sup>T</sup> = (1,1) such that p<sup>T</sup>d = 0 we get HUE = 1/8. The edge mapping is as follows :

| е            | р⊤е | s⊤e |
|--------------|-----|-----|
| wt(1 0)      | 1   | 9   |
| i/p(0 1)     | 1   | 1   |
| result(1 –1) | 0   | 8   |



Systolic architecture for the example

#### Matrix-Matrix multiplication and 2-D Systolic Array Design



- Applying scheduling inequality with  $T_{mult-add} = 1$ , and  $T_{com} = 0$  we get  $s_2 \ge 0$ ,  $s_1 \ge 0$ ,  $s_3 \ge 1$ ,  $\gamma_c \gamma_a \ge 0$ and  $\gamma_c - \gamma_b \ge 0$ . Take  $\gamma_a = \gamma_b = \gamma_c = 0$ for linear scheduling.
- Solution 1 :

$$s^{T} = (1,1,1), d^{T} = (0,0,1), p_{1} = (1,0,0),$$
  
 $p_{2} = (0,1,0), P^{T} = (p_{1},p_{2})^{T}$ 









| Sol. 1     |        | Sol. 2 |            |        |     |
|------------|--------|--------|------------|--------|-----|
| е          | р⊤е    | s⊤e    | е          | р⊤е    | s⊤e |
| a(0, 1, 0) | (0, 1) | 1      | a(0, 1, 0) | (0, 1) | 1   |
| b(1, 0, 0) | (1, 0) | 1      | b(1, 0, 0) | (1, 0) | 1   |
| C(0, 0, 1) | (0, 0) | 1      | C(0, 0, 1) | (1, 1) | 1   |