



## Bandwidth & Latency Challenges for Multi-Core Server Performance

Fayé Briggs, PhD Chief Performance Architect Digital Enterprise Group Intel Corporation





## Legal Disclaimer

#### • LEGAL INFORMATION:

- THIS DOCUMENT AND RELATED MATERIALS AND INFORMATION ARE PROVIDED "AS IS" WITH NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION, OR SAMPLE. INTEL ASSUMES NO RESPONSIBILITY FOR ANY ERRORS CONTAINED IN THIS DOCUMENT AND HAS NO LIABILITIES OR OBLIGATIONS FOR ANY DAMAGES ARISING FROM OR IN CONNECTION WITH THE USE OF THIS DOCUMENT.
- Performance & functionality will vary depending on (i) the specific hardware & software you use & (ii) the feature enabling/system configuration by your system vendor. See www.intel.com/ for information on Intel Technology or consult your system vendor for more information.
- All dates provided are subject to change without notice.
- Intel, Pentium, Xeon, Itanium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States & other countries.
- \*Other names & br&s may be claimed as the property of others.
- Copyright © 2005, Intel Corporation







- Multi-Core Momentum
- Multi-Core Performance Challenges
- • Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary





### **Microprocessor Design Model**



### **OBJECTIVE: Sustained Technology Leadership**



12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

4 -



## Intel Core Micro-architecture Five Key Innovations

Intel<sup>®</sup> Wide Dynamic Execution • 4 wide issue, retire • macrofusion Intel<sup>®</sup> Advanced

Intel<sup>®</sup> Advanced Digital Media Boost • 128 bit wide SSE



Intel<sup>®</sup> Intelligent Power Capability • clock gating • split buses

Intel<sup>®</sup> Smart Memory Access • enhanced prefetch • memory disambiguation

Intel<sup>®</sup> Advanced Smart Cache • large shared cache

## **Multi-Core Products**



More multi-core products expected in the future

- Multi-Core Momentum
- Multi-Core Performance Challenges
- Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary



ducation

- 7 -

### Performance Challenges in Multi-Core Platforms

- Extracting thread level parallelism in most workloads
  - How much?
- Ability to generate code with lots of threads & performance scaling
  - New tools available
- Power limitations
- Platform latencies (idle and loaded)
- On-chip interconnect/cache infrastructure
  - Adequate on-die bandwidths & reduced miss rates
- Memory and I/O bandwidth required





ducation

- 8 -

### **Performance Scaling**



#### Parallel software key to Multi-core success







## **DRAM Timing Improvements**

Improvement rate of DRAM core timings from DDR-200 to DDR3-1600

(logarithmic trends based on specification data)



## Increasing gap between DRAM data clock cycle time & memory constraint timings







#### 2S OLTP Average Last Level Cache (LLC) Miss Latency



#### Latency reduction continues but approaching a lower bound



12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

- 11 -



#### Bandwidth Drivers – increased parallelism

Normalized Performance vs. initial Intel® Pentium® 4 Processor



#### Greater parallelism drives abrupt increase in BW requirements







## Memory BW Gap



Busses have become wider to deliver necessary memory BW (10 to 30 GB/sec)

Yet, memory BW is not enough

Many Core System will demand 100 GB/sec memory BW

- 13 -

#### How do you feed the beast?





## **IO** Pins and Power



State of the art: 100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec × 25mw/Gb/sec = 25 Watts Bus-width = 1,000/5 = 200, about 400 pins (differential)

Too many signal pins, too much power



- 14 -



- Multi-Core Momentum
- Multi-Core Performance Challenges
- Orum **Platform Architectures & Performance**
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary



## **2S Platform Architecture**

#### Supports Quad-Core Intel® Xeon® Processors



Enhanced Platform Capabilities Deliver Required Bandwidth for Quad-Core Performance Leadership











#### **2S Clovertown (QC) Platform Performance**

Comparison on a range of workloads



#### FSB-1333 enables Quad-Core Clovertown to deliver excellent gains on Multithreaded workloads

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or they are considering purchasing. Copyright © 2006, Intel Corporation. To the EMERAIN Cracter may be rearring to a performance of others.



June 12-14, Budapest Hungary





#### 4S Caneland Platform Overview Quad-Core Intel® Xeon® Processor



#### Enhanced Platform Capabilities Deliver Required Bandwidth for Quad-Core Performance Leadership



12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

- 18 -



nte

inside

Xeon

- Multi-Core Momentum
- Multi-Core Performance Challenges
- Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary





## 2S SPEC CPU2006 – Core Sensitivity



## Most BW Sensitive SIR2006 Components

 Xalancbmk, gcc, mcf, omnetpp, libquantum

- BW Sensitive SFR2006 Components Scaling
  - CactusADM, soplex, wrf, sphinx3, GemsFDTD, leslie3d, milc, bwaves

#### Dual-Core to Quad-Core Scaling demands adequate bandwidth







## 2S SPEC CPU2006 – FSB Sensitivity\*



#### 1600 FSB & CPU Frequencies are lab experiment environment

- Most Bandwidth Sensitive SIR2006
  Components
  - Xalancbmk, gcc, mcf, omnetpp, libquantum

- Most Bandwidth Sensitive SFR2006
  Components
  - CactusADM, soplex, wrf, sphinx3, GemsFDTD, leslie3d, milc, bwaves

#### Increasing FSB bandwidth improves performance



12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

- 21 -



- Multi-Core Momentum
- Multi-Core Performance Challenges
- Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary







#### Multi-core\* based on Traditional & Simple cores

#### Assume Area (1 Large core) = Area (4 Small cores)



8 interconnected computing nodes, z MB Cache blocks, one per node, for 8z MB total on-die LLC

This is Hypothetical with no Product plans



Large-core



Small-core

Large-core-MC: 8 Large cores (LcMC) Small-core-MC: 8 x4 = 32 Small cores (ScMC)

#### **1Socket Platform Configuration**



- Rest of uncore not shown
- 130W socket power envelope

- Assume no on-die bottlenecks
  - All queues, trackers etc will be sufficiently sized
  - Adequate on-die interconnect & cache BW, etc.

## FORUM





#### **OLTP Performance: Unconstrained vs. Constrained**



 ScMC-4T constrained to the same platform (pin-count, power & memory-size) as LcMC has ~1.6X the performance of LcMC.

With memory channels near saturation







#### **2S OLTP Performance Sensitivity to LLC Miss** Latency



• As core clock speed increases, latency impact on performance increases, but latency hiding techniques can lower latency impact on performance.

• By 2012, 1.4 ns of latency could have ~1% performance impact. Note on-die interconnect throughput can increase to over 4 Terabytes/sec.

## EMEA ACADEMIC

12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

- 25 -



GB/

- Multi-Core Momentum
- Multi-Core Performance Challenges
- Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis

Summary







### Workload-Based Platform Performance

- Execution Time is the product of
  - Path Length
  - Cycles Per Instruction (CPI)
  - Cycle Time
- CPI is the sum of
  - infinite-cache core cpi
  - miss rate \* effective (*loaded*) memory latency
- Effective(loaded) memory latency is sum of
  - Idle latency
  - Queuing latency: driven by bandwidth
- Bad (good) news is that performance does not scale up (down) linearly with frequency

## Three major components of performance drivers: core-cpi, latencies & bandwidths





- 27 -

### SPECjbb2005 Misses Per Instruction



SPECjbb2005 has very little sharing between threads. The shared code footprint is small (128KB) and most of LLC is partitioned between non-overlapping data sets per thread.

MPI sensitivity to cache size is very high up to 2-3MB. A segment of compulsory misses persists even for very large caches. The performance effect of these compulsory misses EMERINGRY damaging at high latencies. 12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary - 28 -



# Effect of Cache Size and Miss Latency on SPECjbb2005 Performance



Single thread performance assuming identical core CPI and frequency.







### Effect of Cache Size and Miss Latency on SPECjbb2005 Bandwidth Demand



Single thread demand assuming identical core CPI and frequency.







### SPECjbb2005 Performance Drivers



#### Assumes a single-thread per cache



12<sup>th</sup> EMEA Academic Forum June 12-14, Budapest Hungary

- 31 -



- Multi-Core Momentum
- Multi-Core Performance Challenges
- Platform Architectures & Performance
- SPEC CPU2006 Sensitivity to Bandwidth
  Bandwidth & Performance Implications of Increasing Core-Count
  - Workload Based Analysis
  - Summary







### Summary

- Performance growth will be driven by more multi-core products
  - Supported by great software tools to enable better application parallelization.
- Power will continue to be a major challenge for more performance delivery
- $\overset{\circ}{\Box}$  I/O and on-die interconnects may dominate socket power.
- $\cong$  Power reduction techniques research is critical
- On-die interconnects must scale to support the BW growth with min latency.
  - Latency sensitivity critical as ~1.4ns in latency will have a 1% perf impact in '12
- Platform bandwidth demand will continue to grow as more cores are added to the platform.
  - Many multi-threaded workloads demand higher bandwidth with multicores
  - May need to increase socket pin counts to mitigate slow BW/pin growth.





ducation

- 33 -

## Acknowledgements

Shameem Akhter, Shekhar Borkar, Bruce Christenson, Kathy Debnath, Ravi Iyer, Adrian Moga, Chitra Natarajan





