## The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration

#### Ulya Karpuzcu, Brian Greskamp, Josep Torrellas University of Illinois

http://iacoma.cs.uiuc.edu/







The BubbleWrap Many-Core



**Ideal Scaling** 



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

 $P_{DYN} = #Devices x$ 



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

P<sub>DYN</sub> = #Devices **x** Frequency of switching **x** 



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

P<sub>DYN</sub> = #Devices **x** Frequency of switching **x** Energy per switching



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

P<sub>DYN</sub> = #Devices **x** Frequency of switching **x** Energy per switching



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

 $P_{DYN} = #Devices x$  Frequency of switching x Energy per switching



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

 $P_{DYN} = #Devices x$  Frequency of switching x Energy per switching



The BubbleWrap Many-Core



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

P<sub>DYN</sub> = #Devices x Frequency of switching x Energy per switching

#### **Practical Scaling**

Vdd has been scaling down slower than ideally



The BubbleWrap Many-Core



 $\propto$ Vdd<sup>2</sup>

### **Ideal Scaling**

• Dynamic (switching) power density remains constant

P<sub>DYN</sub> = #Devices x Frequency of switching x Energy per switching

### **Practical Scaling**

- Vdd has been scaling down slower than ideally
- Historically: Higher Vdd = Higher performance



The BubbleWrap Many-Core



 $\propto$  Vdd<sup>2</sup>

### **Ideal Scaling**

• Dynamic (switching) power density remains constant

 $P_{DYN} = \#$ Devices X Frequency of switching X Energy per switching  $\sim Vdd^2$ 

### **Practical Scaling**

- Vdd has been scaling down slower than ideally
- Historically: Higher Vdd = Higher performance
- Recently: Practically stagnated Vth scaling to control static power



### **Ideal Scaling**

• Dynamic (switching) power density remains constant

 $P_{DYN} = \# Devices X$  Frequency of switching X Energy per switching  $\sim Vdd^2$ 

#### **Practical Scaling**

- Vdd has been scaling down slower than ideally
- Historically: Higher Vdd = Higher performance
- Recently: Practically stagnated Vth scaling to control static power

### **Dynamic Power Density is increasing**









I Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

3



Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** group



👖 Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

3



👖 Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

3





The BubbleWrap Many-Core

**ITRS 2008** 

3



II Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

4



👖 Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

4



Ulya Karpuzcu

The BubbleWrap Many-Core

**ITRS 2008** 

4





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life





- Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life
  - Base: A homogeneous many-core







- Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life
  - Base: A homogeneous many-core



Throughput Cores





- Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life
  - Base: A homogeneous many-core



- Throughput Cores
  - Most energy-efficient cores





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life





- Most energy-efficient cores
- Run parallel sections





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life



- Base: A homogeneous many-core
- Throughput Cores
  - Most energy-efficient cores
  - Run parallel sections
  - Operate at nominal V/f





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life



Jlya Karpuzcu

- Base: A homogeneous many-core
- Throughput Cores
  - Most energy-efficient cores
  - Run parallel sections
  - Operate at nominal V/f
- Expendable Cores



• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life



- Base: A homogeneous many-core
- Throughput Cores
  - Most energy-efficient cores
  - Run parallel sections
  - Operate at nominal V/f
- Expendable Cores
  - Run sequential sections





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life



- Base: A homogeneous many-core
- Throughput Cores
  - Most energy-efficient cores
  - Run parallel sections
  - Operate at nominal V/f
- Expendable Cores
  - Run sequential sections
  - Operate at elevated V/f





• Exploit dormant cores to accelerate sequential sections at the cost of a shorter per-core service life



- Base: A homogeneous many-core
- Throughput Cores
  - Most energy-efficient cores
  - Run parallel sections
  - Operate at nominal V/f
- Expendable Cores
  - Run sequential sections
  - Operate at elevated V/f
  - Discarded early due to shorter service life (Popped like BubbleWrap)

5



## **Core Aging**





## **Core Aging**

 Manifestation: Progressive slow-down in logic as the core is being used




# **Core Aging**

- Manifestation: Progressive slow-down in logic as the core is being used
- Main contributor: Bias Temperature Instability (BTI)





# **Core Aging**

- Manifestation: Progressive slow-down in logic as the core is being used
- Main contributor: Bias Temperature Instability (BTI)
  - Induces increase in critical path delays  $\propto$  time<sup>const.<1</sup>





# **Core Aging**

- Manifestation: Progressive slow-down in logic as the core is being used
- Main contributor: Bias Temperature Instability (BTI)
  - Induces increase in critical path delays  $\propto$  time<sup>const.<1</sup>
  - Aging rate: Exponential dependence on Vdd and T







S<sub>NOM</sub> time

0



0

The BubbleWrap Many-Core



S<sub>NOM</sub> time



























• BTI-induced increase in critical path delays  $\propto$  time<sup>const.<1</sup>







- BTI-induced increase in critical path delays  $\propto$  time<sup>const.<1</sup>
- f<sub>NOM</sub> set by the delay at the end of the service-life (S<sub>NOM</sub>)

$$f_{NOM} = 1/\tau_D$$





- BTI-induced increase in critical path delays  $\propto$  time<sup>const.<1</sup>
- f<sub>NOM</sub> set by the delay at the end of the service-life (S<sub>NOM</sub>)

$$f_{NOM} = 1/\tau_D$$

















• Higher Vdd: Vdd<sub>OP</sub> >> Vdd<sub>NOM</sub>

Ulya Karpuzcu





• Higher Vdd: Vdd<sub>OP</sub> >> Vdd<sub>NOM</sub>

Ulya Karpuzcu





- Higher Vdd: Vdd<sub>OP</sub> >> Vdd<sub>NOM</sub>
- Result: Lower critical path delay; higher aging rate







• Higher Vdd: Vdd<sub>OP</sub> >> Vdd<sub>NOM</sub>

Jlya Karpuzcu

- Result: Lower critical path delay; higher aging rate
- Run at constant  $f_{OP} = 1/\tau_{OP}$  until S<sub>SHORT</sub>; then discard





• Higher Vdd: Vdd<sub>OP</sub> >> Vdd<sub>NOM</sub>

Ulya Karpuzcu

- Result: Lower critical path delay; higher aging rate
- Run at constant  $f_{OP} = 1/\tau_{OP}$  until S<sub>SHORT</sub>; then discard









Contribution: DVS for Aging Management (DVSAM)







Contribution: DVS for Aging Management (DVSAM)



 Change Vdd with time to compensate for critical path degradation





Contribution: DVS for Aging Management (DVSAM)



- Change Vdd with time to compensate for critical path degradation
- Enforce minimum Vdd needed for any f-target





Contribution: DVS for Aging Management (DVSAM)



JIya Karpuzcu

- Change Vdd with time to compensate for critical path degradation
- Enforce minimum Vdd
  needed for any f-target

• DVSAM-Pow: Turn wasted opportunity to power efficiency



Contribution: DVS for Aging Management (DVSAM)



- Change Vdd with time to compensate for critical path degradation
- Enforce minimum Vdd needed for any f-target

- DVSAM-Pow: Turn wasted opportunity to power efficiency
- DVSAM-Perf: Turn wasted opportunity to higher frequency







Idea: Minimize power consumption at  $f_{NOM} = 1/\tau_D$ 













• Critical path delays are kept at  $au_{D}$  until S<sub>NOM</sub>: Run at f<sub>NOM</sub>







• Critical path delays are kept at  $\tau_{\text{D}}$  until S<sub>NOM</sub>: Run at f<sub>NOM</sub>







- Critical path delays are kept at  $\tau_{\text{D}}$  until S\_{\text{NOM}}: Run at f\_{NOM}
- Start with low Vdd and increase slowly





- Critical path delays are kept at  $\tau_{\text{D}}$  until S\_{\text{NOM}}: Run at f\_{NOM}
- Start with low Vdd and increase slowly





- Critical path delays are kept at  $\tau_{\text{D}}$  until S\_{\text{NOM}}: Run at f\_{NOM}
- Start with low Vdd and increase slowly





- Critical path delays are kept at  $\tau_{\text{D}}$  until S\_{\text{NOM}}: Run at  $f_{\text{NOM}}$
- Start with low Vdd and increase slowly







The BubbleWrap Many-Core



11



• Vdd < Vdd<sub>NOM</sub> and  $f = f_{NOM}$  throughout S<sub>NOM</sub>



The BubbleWrap Many-Core



11



- Vdd < Vdd<sub>NOM</sub> and  $f = f_{NOM}$  throughout S<sub>NOM</sub>
- Power savings due to Vdd < Vdd<sub>NOM</sub>



The BubbleWrap Many-Core



11
#### **DVSAM-Pow**



- Vdd < Vdd<sub>NOM</sub> and  $f = f_{NOM}$  throughout S<sub>NOM</sub>
- Power savings due to Vdd < Vdd<sub>NOM</sub>
  - ➡ More cores active for the same P-budget



The BubbleWrap Many-Core



#### **DVSAM-Pow**



- Vdd < Vdd<sub>NOM</sub> and  $f = f_{NOM}$  throughout S<sub>NOM</sub>
- Power savings due to Vdd < Vdd<sub>NOM</sub>
  - More cores active for the same P-budget
  - Increased throughput



The BubbleWrap Many-Core







Idea: Maximize frequency for the same service life





Idea: Maximize frequency for the same service life







Idea: Maximize frequency for the same service life



• Shorter critical path delay  $\tau_{\text{OP}}$  until S\_{\text{NOM}}: Run at higher f = 1 /  $\tau_{\text{OP}}$ 





Idea: Maximize frequency for the same service life



- Shorter critical path delay  $\tau_{\text{OP}}$  until S\_{\text{NOM}}: Run at higher f = 1 /  $\tau_{\text{OP}}$
- Start with low Vdd and increase rapidly

Ulya Karpuzcu

Idea: Maximize frequency for the same service life



- Shorter critical path delay  $\tau_{\text{OP}}$  until S\_{\text{NOM}}: Run at higher f = 1 /  $\tau_{\text{OP}}$
- Start with low Vdd and increase rapidly

Jlya Karpuzcu

Idea: Maximize frequency for the same service life



- Shorter critical path delay  $\tau_{\text{OP}}$  until S\_{\text{NOM}}: Run at higher f = 1 /  $\tau_{\text{OP}}$
- Start with low Vdd and increase rapidly

Ulya Karpuzcu



Idea: Maximize frequency for the same service life



- Shorter critical path delay  $\tau_{\text{OP}}$  until S\_{\text{NOM}}: Run at higher f = 1 /  $\tau_{\text{OP}}$
- Start with low Vdd and increase rapidly

Ulya Karpuzcu





Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance





Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance







Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance







Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance







Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance







Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance



• Even higher frequency than DVSAM-Perf for short service life



The BubbleWrap Many-Core

Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance



• Even higher frequency than DVSAM-Perf for short service life



The BubbleWrap Many-Core

Idea: Aggressive DVSAM-Perf for a short service life to get even higher performance



• Even higher frequency than DVSAM-Perf for short service life



The BubbleWrap Many-Core

#### **Throughput Cores**



**Expendable Cores** 







Two choices for Throughput Cores





Throughput Cores



**Expendable Cores** 

- Two choices for Throughput Cores
  - Nominal operation





**Throughput Cores** 



**Expendable Cores** 

- Two choices for Throughput Cores
  - Nominal operation
  - Use DVSAM-Pow and expand the set of throughout cores for the same power budget





#### **Throughput Cores**



**Expendable Cores** 

Jlya Karpuzcu

- Two choices for Throughput Cores
  - Nominal operation
  - Use DVSAM-Pow and expand the set of throughout cores for the same power budget
- Two choices for Expendable Cores



#### **Throughput Cores**



• Two choices for Throughput Cores

- Nominal operation
- Use DVSAM-Pow and expand the set of throughout cores for the same power budget
- Two choices for Expendable Cores
  - Higher, constant Vdd until S<sub>SHORT</sub>; then discard





#### **Throughput Cores**



**Expendable Cores** 

- Two choices for Throughput Cores
  - Nominal operation
  - Use DVSAM-Pow and expand the set of throughout cores for the same power budget
- Two choices for Expendable Cores
  - Higher, constant Vdd until S<sub>SHORT</sub>; then discard
  - DVSAM-Perf until S<sub>SHORT</sub>; then discard









• No change in the core architecture





- No change in the core architecture
- Need circuits to measure aging





- No change in the core architecture
- Need circuits to measure aging
- Need high-precision DVS





- No change in the core architecture
- Need circuits to measure aging
- Need high-precision DVS
- Clock and power distribution





- No change in the core architecture
- Need circuits to measure aging
- Need high-precision DVS
- Clock and power distribution
  - Two separate V/f domains: One for Expendable and one for Throughput Cores









• 32 core chip:  $N_T = 16$  Throughput and  $N_E = 16$  Expendable cores





- 32 core chip:  $N_T = 16$  Throughput and  $N_E = 16$  Expendable cores
- 22nm high-k metal-gate process





- 32 core chip:  $N_T = 16$  Throughput and  $N_E = 16$  Expendable cores
- 22nm high-k metal-gate process
- Multiprogrammed workload synthesized from SPEC2000




# **BubbleWrap Evaluation**

- 32 core chip:  $N_T = 16$  Throughput and  $N_E = 16$  Expendable cores
- 22nm high-k metal-gate process
- Multiprogrammed workload synthesized from SPEC2000
- SESC enhanced by a power & thermal model























Sequential Fraction (LSEQ)







Sequential Fraction (LSEQ)

• Large f gains are feasible







Sequential Fraction (LSEQ)

- Large f gains are feasible
- f increases with smaller sequential section





Sequential Fraction (LSEQ)

- Large f gains are feasible
- f increases with smaller sequential section
- For DVSAM-Perf, each expendable core runs for  $L_{\text{SEQ}}/N_{\text{E}} \; x \; S_{\text{NOM}}$







• Each Expendable core has max P budget of two cores





Each Expendable core has max P budget of two cores



II Ulya Karpuzcu

The BubbleWrap Many-Core

19

Each Expendable core has max P budget of two cores









I Ulya Karpuzcu





• Tolerable power cost for the frequency gains



The BubbleWrap Many-Core

20

The BubbleWrap Many-Core:

Exploiting dormant cores for sequential acceleration







The BubbleWrap Many-Core: Exploiting dormant cores for sequential acceleration



• Simple homogeneous design





The BubbleWrap Many-Core: Exploiting dormant cores for sequential acceleration



- Simple homogeneous design
- No architectural or software changes





The BubbleWrap Many-Core: Exploiting dormant cores for sequential acceleration



- Simple homogeneous design
- No architectural or software changes
- Improves sequential and parallel performance





The BubbleWrap Many-Core: Exploiting dormant cores for sequential acceleration



- Simple homogeneous design
- No architectural or software changes
- Improves sequential and parallel performance
  - Fully sequential applications at 16% higher f





The BubbleWrap Many-Core: Exploiting dormant cores for sequential acceleration



- Simple homogeneous design
- No architectural or software changes
- Improves sequential and parallel performance
  - Fully sequential applications at 16% higher f
  - Fully parallel applications at 30% higher throughput

21

## The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration

#### Ulya Karpuzcu, Brian Greskamp, Josep Torrellas University of Illinois

http://iacoma.cs.uiuc.edu/



