## Shared Memory Multiprocessors CS 418 Lectures 12-14

The Cache Coherence Problem
Snoopy Coherence Protocols



## A Coherent Memory System: Intuition Reading a location should return latest value written (by any process) Easy in uniprocessors • Except for I/O: coherence between I/O devices and processors • But infrequent so software solutions work - uncacheable operations, flush pages, pass I/O data through caches Would like same to hold when processes run on different processors • E.g. as if the processes were interleaved on a uniprocessor The coherence problem is more pervasive and performancecritical in multiprocessors • has a much larger impact on hardware design

## Problems with the Intuition

#### Recall:

 $\cdot$  Value returned by read should be last value written

But "last" is not well-defined!

#### Even in sequential case:

- "last" is defined in terms of program order, not time
  - Order of operations in the machine language presented to processor
  - "Subsequent" defined in analogous way, and well defined

#### In parallel case:

- 4 -

 program order defined within a process, but need to make sense of orders across processes

CS 418 5'04 =

#### Must define a meaningful semantics

• the answer involves both "cache coherence" and an appropriate "memory consistency model" (to be discussed in a later lecture)

































### Dragon Write-Back Update Protocol 4 states • Exclusive-clean or exclusive (E): I and memory have it • Shared clean (Sc): I, others, and maybe memory, but I'm not owner • Shared modified (Sm): I and others but not memory, and I'm the owner

- Sm and Sc can coexist in different caches, with only one Sm Modified or dirty (D): I and nobody else

#### No invalid state

- 24 -

- If in cache, cannot be invalid
- If not present in cache, can view as being in not-present or invalid state

#### New processor events: PrRdMiss, PrWrMiss

• Introduced to specify actions when block not present in cache

#### New bus transaction: BusUpd

• Broadcasts single word written on bus; updates other relevant caches

CS 418 5'04











# Impact of Block Size on Miss Rate Results shown only for default problem size: varied behavior • Need to examine impact of problem size and p as well (see text)

CS 418 5'04

CS 418 5'04



#### Software

- · Improve spatial locality by better data structuring
- · Compiler techniques

- 32 -

#### Hardware

- Retain granularity of transfer but reduce granularity of coherence - use subblocks: same tag but different state bits
  - one subblock may be valid but another invalid or dirty
- Reduce both granularities, but prefetch more blocks on a miss
- · Proposals for adjustable cache size
- More subtle: delay propagation of invalidations and perform all at once

- But can change consistency model: discuss later in course

• Use update instead of invalidate protocols to reduce false sharing effect

## Impact of Block Size on Traffic Traffic affects performance indirectly through contention Address bus Address bu Dete bus 8.2.2.2.2.2 • Results different than for miss rate: traffic almost always increases • When working sets fits, overall traffic still small, except for Radix · Fixed overhead is significant component - So total traffic often minimized at 16-32 byte block, not smaller • Working set doesn't fit: even 128-byte good for Ocean due to capacity = - 31 -CS 418 5'04





