Computer Architecture Lecture 3: Memory Hierarchy Design

Computer Architecture
Lecture 3: Memory Hierarchy Design (Chapter 2, Appendix B)
Chih‐Wei Liu 劉志尉
National Chiao Tung University
cwliu@twins.ee.nctu.edu.tw
Introduction
Since 1980, CPU has outpaced DRAM…
CPU
60% per yr
2X in 1.5 yrs
DRAM
Gap grew 50% per
9% per yr
year
2X in 10 yrs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
2
Introduction
Introduction
• Programmers want unlimited amounts of memory with low latency
• Fast memory technology is more expensive per bit than slower memory
• Solution: organize memory system into a hierarchy
– Entire addressable memory space available in largest, slowest memory
– Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
• Temporal and spatial locality insures that nearly all references can be found in smaller memories
– Gives the allusion of a large, fast memory being presented to the processor
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
3
Introduction
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial with recent multi‐core processors:
– Aggregate peak bandwidth grows with # cores:
• Intel Core i7 can generate two references per core per clock
• Four cores and 3.2 GHz clock
– 25.6 billion 64‐bit data references/second + 12.8 billion 128‐bit instruction references
= 409.6 GB/s!
• DRAM bandwidth is only 6% of this (25 GB/s)
• Requires:
– Multi‐port, pipelined caches
– Two levels of cache per core
– Shared third‐level cache on chip
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
4
Memory Hierarchy
• Take advantage of the principle of locality to:
– Present as much memory as in the cheapest technology
– Provide access at speed offered by the fastest technology
Processor
Control
1s
Size (bytes): 100s
On-Chip
Cache
Speed (ns):
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM/
FLASH/
PCM)
Secondary
Storage
(Disk/
FLASH/
PCM)
10s-100s
100s
10,000,000s
(10s ms)
Ks-Ms
Ms
Gs
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Tertiary
Storage
(Tape/
Cloud
Storage)
10,000,000,000s
(10s sec)
Ts
5
Multi‐core Architecture
Processing Node
Processing Node
Processing Node
CPU
CPU
CPU
Local memory hierarchy
(optimal fixed size)
Local memory hierarchy
(optimal fixed size)
Local memory hierarchy
(optimal fixed size)
Processing Node
Processing Node
Processing Node
CPU
CPU
CPU
Local memory hierarchy
(optimal fixed size)
Local memory hierarchy
(optimal fixed size)
Local memory hierarchy
(optimal fixed size)
Interconnection
network
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
6
The Principle of Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• HW relied on locality for speed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
7
Introduction
Memory Hierarchy Basics
• When a word is not found in the cache, a miss occurs:
– Fetch word from lower level in hierarchy, requiring a higher latency reference
– Lower level may be another cache or the main memory
– Also fetch the other words contained within the block
• Takes advantage of spatial locality
– Place block into cache in any location within its set, determined by address
• block address MOD number of sets
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
8
Hit and Miss
• Hit: data appears in some block in the upper level (e.g.: Block X) – Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 ‐ (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
9
Cache Performance Formulas
(Average memory access time) =
(Hit time) + (Miss rate)×(Miss penalty)
Tacc  Thit  f missT miss
• The times Tacc, Thit, and T+miss can be all either:
–
–
Real time (e.g., nanoseconds)
Or, number of clock cycles
• In contexts where cycle time is known to be a constant
• Important:
–
T+miss means the extra (not total) time for a miss
• in addition to Thit, which is incurred by all accesses
Hit time
CPU
Cache
Miss penalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Lower levels
of hierarchy
10
Four Questions for Memory Hierarchy
• Consider any level in a memory hierarchy.
– Remember a block is the unit of data transfer.
• Between the given level, and the levels below it
• The level design is described by four behaviors:
– Block Placement:
• Where could a new block be placed in the level?
– Block Identification:
• How is a block found if it is in the level?
– Block Replacement:
• Which existing block should be replaced if necessary?
– Write Strategy:
• How are writes to the block handled?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
11
Q1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache:
– Fully associative, direct mapped, 2‐way set associative
– S.A. Mapping = Block Number Modulo Number Sets
Full Mapped
Direct Mapped
(12 mod 8) = 4
2‐Way Assoc
(12 mod 4) = 0
01234567
01234567
01234567
Cache
1111111111222222222233
01234567890123456789012345678901
Memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
12
Q2: How is a block found if it is in the upper level?
Block Address
Index
Tag
•
Index Used to Lookup Candidates
– Index identifies the set in cache
•
Set Select
Tag used to identify actual copy
– If no candidates match, then declare cache miss
•
Block
offset
Data Select
Block is minimum quantum of caching
– Data select field used to select data within block
– Many caching applications don’t have data select field
•
Larger block size has distinct hardware advantages:
– less tag overhead
– exploit fast burst transfers from DRAM/over wide busses
•
Disadvantages of larger block size?
– Fewer blocks  more conflicts. Can waste bandwidth
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
13
Review: Direct Mapped Cache
31
Cache Tag
Ex: 0x50
Valid Bit
:
Cache Tag
4
0
Byte Select
Ex: 0x00
9
Cache Index
Ex: 0x01
0x50
Cache Data
Byte 31
Byte 1 Byte 0 0
Byte 63
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
:
•
Direct Mapped 2N byte cache:
– The uppermost (32 ‐ N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)
Example: 1 KB Direct Mapped Cache with 32 B Blocks
– Index chooses potential block
– Tag checked to verify block
– Byte select chooses byte within block
: :
•
Byte 992 31
14
Direct‐Mapped Cache Architecture
Tags
Block frames
Address
Tag Frm# Off.
Decode & Row Select
Compare Tags
?
Hit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Mux
select
Data Word
15
Review: Set Associative Cache
•
•
N‐way set associative: N entries per Cache Index
– N direct mapped caches operates in parallel
Example: Two‐way set associative cache
– Cache Index selects a “set” from the cache
– Two tags in the set are compared to input in parallel
– Data is selected based on the tag result
31
Cache Tag
Valid
Cache Tag
:
:
Compare
Cache Data
Cache Block 0
8
Cache Index
4
0
Byte Select
Cache Data
Cache Block 0
Cache Tag
Valid
:
:
:
:
Sel1 1
Mux
0 Sel0
Compare
OR
CA-Lec3
cwliu@twins.ee.nctu.edu.tw
Hit
Cache Block
16
Review: Fully Associative Cache
•
Fully Associative: Every block can hold any line
– Address does not include a cache index
– Compare Cache Tags of all Cache Entries in Parallel
Example: Block Size=32B blocks
– We need N 27‐bit comparators
– Still have byte select to choose from within block
4
31
Cache Tag (27 bits long)
Cache Tag
0
Byte Select
Ex: 0x01
Valid Bit Cache Data
Byte 31
Byte 1 Byte 0
Byte 63
Byte 33 Byte 32
: :
•
=
=
=
=
=
:
:
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
:
17
Concluding Remarks
• Direct‐mapped cache = 1‐way set‐associative cache
• Fully associative cache: there is only 1 set
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
18
Cache Size Equation
• Simple equation for the size of a cache:
(Cache size) = (Block size) × (Number of sets)
× (Set Associativity)
• Can relate to the size of various address fields:
(Block size) = 2(# of offset bits)
(Number of sets) = 2(# of index bits)
(# of tag bits) = (# of memory address bits)  (# of index bits)  (# of offset bits)
Memory address
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
19
Q3: Which block should be replaced on a miss?
• Easy for direct‐mapped cache
– Only one choice
• Set associative or fully associative
– LRU (least recently used)
• Appealing, but hard to implement for high associativity – Random
• Easy, but how well does it work?
– First in, first out (FIFO)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
20
Q4: What happens on a write?
Write-Through
Write-Back
Data written to cache
block
also written to lowerlevel memory
Write data only to
the cache
Update lower level
when a block falls
out of the cache
Debug
Easy
Hard
Do read misses
produce writes?
No
Yes
Do repeated
writes make it to
lower level?
Yes
No
Policy
Additional option -- let writes to an un-cached address
allocate a new cache line (“write-allocate”).
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
21
Write Buffers Cache
Processor
Lower
Level
Memory
Write Buffer
Holds data awaiting write-through to
lower level memory
Q. Why a write buffer ?
A. So CPU doesn’t stall
Q. Why a buffer, why
not just one register ?
A. Bursts of writes are
common.
Q. Are Read After Write A. Yes! Drain buffer before
(RAW) hazards an issue next read, or check write
buffers for match on reads
for write buffer?CA-Lec3 cwliu@twins.ee.nctu.edu.tw
22
More on Cache Performance Metrics
• Can split access time into instructions & data:
Avg. mem. acc. time =
(% instruction accesses) × (inst. mem. access time) + (% data accesses) × (data mem. access time)
• Another formula from chapter 1:
CPU time = (CPU execution clock cycles + Memory stall clock cycles) ×
cycle time
– Useful for exploring ISA changes
• Can break stalls into reads and writes:
Memory stall cycles = (Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
23
Sources of Cache Misses
• Compulsory (cold start or process migration, first reference): first access to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Conflict (collision):
– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Coherence (Invalidation): other process (e.g., I/O) updates memory CA-Lec3 cwliu@twins.ee.nctu.edu.tw
24
Introduction
Memory Hierarchy Basics
• Six basic cache optimizations:
– Larger block size
• Reduces compulsory misses
• Increases capacity and conflict misses, increases miss penalty
– Larger total cache capacity to reduce miss rate
• Increases hit time, increases power consumption
– Higher associativity
• Reduces conflict misses
• Increases hit time, increases power consumption
– Higher number of cache levels
• Reduces overall memory access time
– Giving priority to read misses over writes
• Reduces miss penalty
– Avoiding address translation in cache indexing
• Reduces hit time
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
25
1. Larger Block Sizes
• Larger block size  no. of blocks 
• Obvious advantages: reduce compulsory misses
– Reason is due to spatial locality
• Obvious disadvantage
– Higher miss penalty: larger block takes longer to move
– May increase conflict misses and capacity miss if cache is small
• Don’t let increase in miss penalty outweigh the
decrease in miss rate
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
26
2. Large Caches
• Cache size  miss rate; hit time
• Help with both conflict and capacity misses
• May need longer hit time AND/OR higher HW cost
• Popular in off‐chip caches
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
27
3. Higher Associativity
• Reduce conflict miss
• 2: 1 Cache rule of thumb on miss rate
– 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB)
• Greater associativity comes at the cost of increased hit time
• Lengthen the clock cycle
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
28
4. Multi‐Level Caches
• 2‐level caches example
– AMATL1 = Hit‐timeL1 + Miss‐rateL1 Miss‐penaltyL1
– AMATL2 = Hit‐timeL1 + Miss‐rateL1  (Hit‐timeL2 + Miss‐
rateL2  Miss‐penaltyL2)
• Probably the best miss‐penalty reduction method
• Definitions:
– Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Miss‐rate‐L2)
– Global miss rate: misses in this cache divided by the total number of memory accesses generated by CPU (Miss‐rate‐L1 x Miss‐rate‐L2) – Global Miss Rate is what matters
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
29
Multi‐Level Caches (Cont.)
• Advantages:
– Capacity misses in L1 end up with a significant penalty reduction
– Conflict misses in L1 similarly get supplied by L2
• Holding size of 1st level cache constant:
– Decreases miss penalty of 1st‐level cache.
– Or, increases average global hit time a bit:
• hit time‐L1 + miss rate‐L1 x hit time‐L2
– but decreases global miss rate
• Holding total cache size constant:
– Global miss rate, miss penalty about the same
– Decreases average global hit time significantly!
• New L1 much smaller than old L1
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
30
Miss Rate Example
•
Suppose that in 1000 memory references there are 40 misses in the first‐level cache and 20 misses in the second‐level cache
– Miss rate for the first‐level cache = 40/1000 (4%)
– Local miss rate for the second‐level cache = 20/40 (50%)
– Global miss rate for the second‐level cache = 20/1000 (2%)
•
Assume miss‐penalty‐L2 is 200 CC, hit‐time‐L2 is 10 CC, hit‐time‐L1 is 1 CC, and 1.5 memory reference per instruction. What is average memory access time and average stall cycles per instructions? Ignore writes impact.
– AMAT = Hit‐time‐L1 + Miss‐rate‐L1  (Hit‐time‐L2 + Miss‐rate‐L2 Miss‐penalty‐L2) = 1 + 4%  (10 + 50%  200) = 5.4 CC
– Average memory stalls per instruction = Misses‐per‐instruction‐L1  Hit‐time‐L2 + Misses‐per‐instructions‐L2Miss‐penalty‐L2
= (401.5/1000)  10 + (201.5/1000) 200 = 6.6 CC
– Or (5.4 – 1.0)  1.5 = 6.6 CC
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
31
5. Giving Priority to Read Misses Over SW R3, 512(R0)
;cache index 0
R1, 1024(R0) ;cache index 0
Writes LW
LW R2, 512(R0)
;cache index 0
•
In write through, write buffers complicate memory access in that they R2=R3 ?
might hold the updated value of location needed on a read miss
– RAW conflicts with main memory reads on cache misses
•
•
•
Read miss waits until the write buffer empty  increase read miss penalty Check write buffer contents before read, and if no conflicts, let the read priority over write
memory access continue
Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead, copy the dirty block to a write buffer, then do the read, and then do the write
– CPU stall less since restarts as soon as do read
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
32
6. Avoiding Address Translation during Indexing of the Cache
Virtual
Address
Physical
Address
Address
Translation
$ means cache
• Virtually addressed caches
TLB
PA
$
PA
MEM
Conventional
Organization
CPU
CPU
CPU
VA
Cache
Indexing
VA
Tags
VA
$
VA
VA
VA
Tags
$
TLB
TLB
L2 $
PA
MEM
MEM
PA
Overlap $ access with VA
Virtually Addressed Cache
translation: requires $
Translate only on miss
index to remain invariant
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Synonym (Alias) Problem across translation
33
Why not Virtual Cache?
• Task switch causes the same VA to refer to different PAs
– Hence, cache must be flushed
• Hugh task switch overhead
• Also creates huge compulsory miss rates for new process
• Synonyms or Alias problem causes different VAs which map to the same PA
– Two copies of the same data in a virtual cache
• Anti‐aliasing HW mechanism is required (complicated)
• SW can help
• I/O (always uses PA)
– Require mapping to VA to interact with a virtual cache
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
34
Advanced Cache Optimizations
• Reducing hit time
1. Small and simple caches
2. Way prediction
• Increasing cache bandwidth
3. Pipelined caches
4. Multibanked caches
5. Nonblocking caches
• Reducing Miss Penalty
6. Critical word first
7. Merging write buffers
• Reducing Miss Rate
8. Compiler optimizations
• Reducing miss penalty or miss rate via parallelism
9. Hardware prefetching
10. Compiler prefetching
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
35
Advanced Optimizations
1. Small and Simple L1 Cache
• Critical timing path in cache:
– addressing tag memory, then comparing tags, then selecting correct set
– Index tag memory and then compare takes time
• Direct‐mapped caches can overlap tag compare and transmission of data
– Since there is only one choice
• Lower associativity reduces power because fewer cache lines are accessed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
36
Advanced Optimizations
L1 Size and Associativity
Access time vs. size and associativity
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
37
Advanced Optimizations
L1 Size and Associativity
Energy per read vs. size and associativity
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
38
2. Fast Hit times via Way Prediction
•
•
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2‐way SA cache? Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access. – Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data – Miss  1st check other blocks for matches in next clock cycle
Hit Time
Miss Penalty
Way-Miss Hit Time
•
•
Accuracy  85%
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
– Used for instruction caches vs. data caches
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
39
Advanced Optimizations
Way Prediction
• To improve hit time, predict the way to pre‐set mux
– Mis‐prediction gives longer hit time
– Prediction accuracy
• > 90% for two‐way
• > 80% for four‐way
• I‐cache has better accuracy than D‐cache
– First used on MIPS R10000 in mid‐90s
– Used on ARM Cortex‐A8
• Extend to predict block as well
– “Way selection”
– Increases mis‐prediction penalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
40
Advanced Optimizations
3. Increasing Cache Bandwidth by Pipelining
• Pipeline cache access to improve bandwidth
– Examples:
• Pentium: 1 cycle
• Pentium Pro – Pentium III: 2 cycles
• Pentium 4 – Core i7: 4 cycles
• Makes it easier to increase associativity
• But, pipeline cache increases the access latency
– More clock cycles between the issue of the load and the use of the data
• Also Increases branch mis‐prediction penalty
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
41
4. Increasing Cache Bandwidth: Non‐Blocking Caches
•
Non‐blocking cache or lockup‐free cache allow data cache to continue to supply cache hits during a miss
– requires F/E bits on registers or out‐of‐order execution
– requires multi‐bank memories
•
•
“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
42
Advanced Optimizations
Nonblocking Cache Performances
• L2 must support this
• In general, processors can hide L1 miss penalty but not L2 miss CA-Lec3 cwliu@twins.ee.nctu.edu.tw
penalty
43
6: Increasing Cache Bandwidth via Multiple Banks
• Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses
– E.g.,T1 (“Niagara”) L2 has 4 banks
• Banking works best when accesses naturally spread themselves across banks  mapping of addresses to banks affects behavior of memory system
• Simple mapping that works well is “sequential interleaving”
– Spread block addresses sequentially across banks
– E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; …
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
44
Advanced Optimizations
5. Increasing Cache Bandwidth via Multibanked Caches
• Organize cache as independent banks to support simultaneous access (rather than a single monolithic block)
– ARM Cortex‐A8 supports 1‐4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2
• Banking works best when accesses naturally spread themselves across banks
– Interleave banks according to block address
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
45
6. Reduce Miss Penalty:
Critical Word First and Early Restart
•
•
Processor usually needs one word of the block at a time
Do not wait for full block to be loaded before restarting processor
– Critical Word First – request the missed word first from memory and send it to the processor as soon as it arrives; let the processor continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
– Early restart ‐‐ as soon as the requested word of the block arrives, send it to the processor and let the processor continue execution
•
Benefits of critical word first and early restart depend on
– Block size: generally useful only in large blocks
– Likelihood of another access to the portion of the block that has not yet been fetched
• Spatial locality problem: tend to want next sequential word, so not clear if benefit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
46
7. Merging Write Buffer to Reduce Miss Penalty
• Write buffer to allow processor to continue while waiting to write to memory
• If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry. If so, new data are combined with that entry
• Increases block size of write for write‐through cache of writes to sequential words, bytes since multiword writes more efficient to memory
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
47
Advanced Optimizations
Merging Write Buffer
• When storing to a block that is already pending in the write buffer, update write buffer
• Reduces stalls due to full write buffer
No write buffering
Write buffering
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
48
8. Reducing Misses by Compiler Optimizations
•
•
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
•
Data
– Loop Interchange: swap nested loops to access data in order stored in memory (in sequential order)
– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows • Instead of accessing entire rows or columns, subdivide matrices into blocks
• Requires more memory accesses but improves locality of accesses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
49
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
50
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
Perform different
computations on the
common data in two loops
 fuse the two loops
2 misses per access to a & c vs. one miss per access; improve spatial locality
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
51
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
52
Snapshot of x, y, z when N=6, i=1
White: not yet touched
Light: older access
Dark: newer access
Before….
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
53
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
• B called Blocking Factor
• Capacity Misses from 2N3 + N2 to 2N3/B +N2
• Conflict Misses Too?
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
54
The Age of Accesses to x, y, z when B=3 Note in contrast to previous Figure, the smaller number of elements accessed
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
55
9. Reducing Miss Penalty or Miss Rate by Hardware Prefetching of Instructions & Data
•
•
Prefetching relies on having extra memory bandwidth that can be used without penalty
Instruction Prefetching
– Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. – Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer
Data Prefetching
– Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages – Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes
1.97
gr
id
eq
ua
ke
SPECfp2000
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
1.49
1.40
m
1.32
ap
pl
u
1.21
1.29
sw
im
3d
wu
pw
is
e
fa
m
cf
SPECint2000
1.20
ga
lg
el
fa
ce
re
c
1.18
1.16
1.26
lu
ca
s
1.45
m
2.20
2.00
1.80
1.60
1.40
1.20
1.00
ga
p
Performance Improvement
•
Intel Pentium 4
56
10. Reducing Miss Penalty or Miss Rate by Compiler‐Controlled Prefetching Data
•
Prefetch instruction is inserted before data is needed
•
Data Prefetch
– Register prefetch: load data into register (HP PA‐RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;
a form of speculative execution
•
Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
– Combine with software pipelining and loop unrolling
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
57
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Advanced Optimizations
Summary
58
Memory Technology
Memory Technology
• Performance metrics
– Latency is concern of cache
– Bandwidth is concern of multiprocessors and I/O
– Access time
• Time between read request and when desired word arrives
– Cycle time
• Minimum time between unrelated requests to memory
• DRAM used for main memory, SRAM used for cache
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
59
Memory Technology
Memory Technology
• SRAM: static random access memory
– Requires low power to retain bit, since no refresh
– But, requires 6 transistors/bit (vs. 1 transistor/bit)
• DRAM
– One transistor/bit
– Must be re‐written after being read
– Must also be periodically refreshed
• Every ~ 8 ms
• Each row can be refreshed simultaneously
– Address lines are multiplexed:
• Upper half of address: row access strobe (RAS)
• Lower half of address: column access strobe (CAS)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
60
DRAM Technology
• Emphasize on cost per bit and capacity
• Multiplex address lines  cutting # of address pins in half
– Row access strobe (RAS) first, then column access strobe (CAS)
– Memory as a 2D matrix – rows go to a buffer
– Subsequent CAS selects subrow
• Use only a single transistor to store a bit
– Reading that bit can destroy the information
– Refresh each bit periodically (ex. 8 milliseconds) by writing back • Keep refreshing time less than 5% of the total time
• DRAM capacity is 4 to 8 times that of SRAM
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
61
DRAM Logical Organization (4Mbit)
Column Decoder
…
11
A0…A10
Sense Amps & I/O
D
Memory Array
(2,048 x 2,048)
Q
Storage
Word Line Cell
• Square root of bits per RAS/CAS
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
62
DRAM Technology (cont.)
• DIMM: Dual inline memory module
– DRAM chips are commonly sold on small boards called DIMMs
– DIMMs typically contain 4 to 16 DRAMs
• Slowing down in DRAM capacity growth
– Four times the capacity every three years, for more than 20 years
– New chips only double capacity every two year, since 1998
• DRAM performance is growing at a slower rate
– RAS (related to latency): 5% per year
– CAS (related to bandwidth): 10%+ per year
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
63
Memory Technology
RAS Improvement
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
64
Quest for DRAM Performance
1.
Fast Page mode – Add timing signals that allow repeated accesses to row buffer without another row access time
– Such a buffer comes naturally, as each array will buffer 1024 to 2048 bits for each access
2.
Synchronous DRAM (SDRAM)
– Add a clock signal to DRAM interface, so that the repeated transfers would not bear overhead to synchronize with DRAM controller
3.
Double Data Rate (DDR SDRAM)
– Transfer data on both the rising edge and falling edge of the DRAM clock signal  doubling the peak data rate
– DDR2 lowers power by dropping the voltage from 2.5 to 1.8 volts + offers higher clock rates: up to 400 MHz
– DDR3 drops to 1.5 volts + higher clock rates: up to 800 MHz
– DDR4 drops to 1.2 volts, clock rate up to 1600 MHz
•
Improved Bandwidth, not Latency
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
65
DRAM name based on Peak Chip Transfers / Sec
DIMM name based on Peak DIMM MBytes / Sec
Fastest for sale 4/06 ($125/GB)
Standard
Clock Rate
(MHz)
M transfers /
second
DRAM Name
Mbytes/s/
DIMM
DDR
133
266
DDR266
2128
PC2100
DDR
150
300
DDR300
2400
PC2400
DDR
200
400
DDR400
3200
PC3200
DDR2
266
533
DDR2-533
4264
PC4300
DDR2
333
667
DDR2-667
5336
PC5300
DDR2
400
800
DDR2-800
6400
PC6400
DDR3
533
1066
DDR3-1066
8528
PC8500
DDR3
666
1333
DDR3-1333
10664
PC10700
DDR3
800
1600
DDR3-1600
12800
PC12800
x2
DIMM
Name
x8
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
66
Memory Technology
DRAM Performance
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
67
Graphics Memory
• GDDR5 is graphics memory based on DDR3
• Graphics memory:
– Achieve 2‐5 X bandwidth per DRAM vs. DDR3
• Wider interfaces (32 vs. 16 bit)
• Higher clock rate
– Possible because they are attached via soldering instead of socketed DIMM modules
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
68
Memory Technology
Memory Power Consumption
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
69
SRAM Technology
• Cache uses SRAM: Static Random Access Memory
• SRAM uses six transistors per bit to prevent the information from being disturbed when read
 no need to refresh
– SRAM needs only minimal power to retain the charge in the standby mode  good for embedded applications
– No difference between access time and cycle time for SRAM
• Emphasize on speed and capacity
– SRAM address lines are not multiplexed
• SRAM speed is 8 to 16x that of DRAM
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
70
ROM and Flash
• Embedded processor memory
• Read‐only memory (ROM)
–
–
–
–
Programmed at the time of manufacture
Only a single transistor per bit to represent 1 or 0
Used for the embedded program and for constant
Nonvolatile and indestructible
• Flash memory: Must be erased (in blocks) before being overwritten
Nonvolatile but allow the memory to be modified
Reads at almost DRAM speeds, but writes 10 to 100 times slower
DRAM capacity per chip and MB per dollar is about 4 to 8 times greater than flash
– Cheaper than SDRAM, more expensive than disk
– Slower than SRAM, faster than disk
–
–
–
–
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
71
Memory Technology
Memory Dependability
• Memory is susceptible to cosmic rays
• Soft errors: dynamic errors
– Detected and fixed by error correcting codes (ECC)
• Hard errors: permanent errors
– Use sparse rows to replace defective rows
• Chipkill: a RAID‐like error recovery technique
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
72
Virtual Memory ?
• The limits of physical addressing
– All programs share one physical address space
– Machine language programs must be aware of the machine organization
– No way to prevent a program from accessing any machine resource
• Recall: many processes use only a small portion of address space
• Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes
• With virtual memory, the processor produces virtual address that are translated by a combination of HW and SW to physical addresses (called memory mapping or address translation). CA-Lec3 cwliu@twins.ee.nctu.edu.tw
73
Virtual Memory: Add a Layer of Indirection
“Physical Addresses”
“Virtual Addresses”
A0-A31
Virtual
Physical
Address
Translation
CPU
D0-D31
A0-A31
Memory
D0-D31
Data
User programs run in an standardized
virtual address space
Address Translation hardware
managed by the operating system (OS)
maps virtual address to physical memory
Hardware supports “modern” OS features:
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
Protection,
Translation, Sharing
74
Virtual Memory
Mapping by a
page table
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
75
Virtual Memory (cont.)
•
•
Permits applications to grow bigger than main memory size
Helps with multiple process management
Each process gets its own chunk of memory
Permits protection of 1 process’ chunks from another
Mapping of multiple chunks onto shared physical memory
Mapping also facilitates relocation (a program can run in any memory location, and can be moved during execution)
– Application and CPU run in virtual space (logical memory, 0 – max)
– Mapping onto physical space is invisible to the application
–
–
–
–
•
Cache vs. virtual memory
– Block becomes a page or segment
– Miss becomes a page or address fault
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
76
3 Advantages of VM
•
Translation:
– Program can be given consistent view of memory, even though physical memory is scrambled
– Makes multithreading reasonable (now used a lot!)
– Only the most important part of program (“Working Set”) must be in physical memory.
– Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later.
•
Protection:
– Different threads (or processes) protected from each other.
– Different pages can be given special behavior
• (Read Only, Invisible to user programs, etc).
– Kernel data protected from User programs
– Very important for protection from malicious programs
•
Sharing:
– Can map same physical page to multiple users
(“Shared memory”)
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
77
• Protection via virtual memory
– Keeps processes in their own memory space
• Role of architecture:
– Provide user mode and supervisor mode
– Protect certain aspects of CPU state
– Provide mechanisms for switching between user mode and supervisor mode
– Provide mechanisms to limit memory accesses
– Provide TLB to translate addresses
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
78
Virtual Memory and Virtual Machines
Virtual Memory
Page Tables Encode Virtual Address Spaces
Page Table
Physical
Memory Space
frame
frame
A virtual address space
is divided into blocks
of memory called pages
frame
frame
virtual
address
OS
manages
the page
table for
each ASIDA
A machine usually
supports
pages of a few
sizes
(MIPS R4000):
A page table is indexed by a
virtual address
valid page table entry codes physical memory
“frame”
address
for the page
CA-Lec3
cwliu@twins.ee.nctu.edu.tw
79
Details of Page Table
Page Table
Physical
Memory Space
Virtual Address
12
offset
frame
frame
V page no.
frame
Page Table
frame
virtual
address
Page Table
Base Reg
index
into
page
table
V
Access
Rights
PA
table located
in physical P page no.
memory
offset
12
Physical Address
• Page table maps virtual page numbers to physical
frames (“PTE” = Page Table Entry)
• Virtual memory => treat memory  cache for disk
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
80
Page Table Entry (PTE)?
•
What is in a Page Table Entry (or PTE)?
– Pointer to next‐level page table or to actual page
– Permission bits: valid, read‐only, read‐write, write‐only
•
Example: Intel x86 architecture PTE:
– Address same format previous slide (10, 10, 12‐bit offset)
– Intermediate page tables called “Directories”
PWT
P: W: U: PWT:
PCD:
A: D: L: Free
0 L D A
UW P
(OS)
11-9 8 7 6 5 4 3 2 1 0
PCD
Page Frame Number
(Physical Page Number)
31-12
Present (same as “valid” bit in other architectures) Writeable
User accessible
Page write transparent: external cache write‐through
Page cache disabled (page cannot be cached)
Accessed: page has been accessed recently
Dirty (PTE only): page has been modified recently
L=14MB page (directory only).
Bottom 22 bits of virtual address serve as offset
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
81
Cache vs. Virtual Memory
• Replacement
– Cache miss handled by hardware
– Page fault usually handled by OS
• Addresses
– Virtual memory space is determined by the address size of the CPU
– Cache space is independent of the CPU address size
• Lower level memory
– For caches ‐ the main memory is not shared by something else
– For virtual memory ‐ most of the disk contains the file system
• File system addressed differently ‐ usually in I/O space
• Virtual memory lower level is usually called SWAP space
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
82
The same 4 questions for Virtual Memory
•
Block Placement
– Choice: lower miss rates and complex placement or vice versa
• Miss penalty is huge, so choose low miss rate  place anywhere
• Similar to fully associative cache model
•
Block Identification ‐ both use additional data structure
– Fixed size pages ‐ use a page table
– Variable sized segments ‐ segment table
•
Block Replacement ‐‐ LRU is the best
– However true LRU is a bit complex – so use approximation
• Page table contains a use tag, and on access the use tag is set
• OS checks them every so often ‐ records what it sees in a data structure ‐ then clears them all
• On a miss the OS decides who has been used the least and replace that one
•
Write Strategy ‐‐ always write back
– Due to the access time to the disk, write through is silly
– Use a dirty bit to only write back pages that have been modified
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
83
Techniques for Fast Address Translation
• Page table is kept in main memory (kernel memory)
– Each process has a page table
• Every data/instruction access requires two memory accesses
– One for the page table and one for the data/instruction
– Can be solved by the use of a special fast‐lookup hardware cache
called associative registers or translation look‐aside buffers (TLBs)
• If locality applies then cache the recent translation
– TLB = translation look‐aside buffer
– TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
84
Translation Look‐Aside Buffers
• Translation Look‐Aside Buffers (TLB)
– Cache on translations
– Fully Associative, Set Associative, or Direct Mapped
hit
PA
VA
CPU
Translation
with a TLB
TLB
miss
miss
Cache
hit
Translation
• TLBs are:
data
– Small – typically not more than 128 – 256 entries
– Fully Associative
Main
Memory
The TLB Caches Page Table Entries
Physical and virtual
pages must be the
same size!
TLB caches
page table
entries.
virtual address
page
Physical
frame
address
for ASID
off
Page Table
2
0
1
3
physical address
TLB
frame page
2
2
0
5
page
off
V=0 pages either
reside on disk or
have not yet been
allocated.
OS handles V=0
“Page fault”
Caching Applied to Address Translation
CPU
Virtual
Address
TLB
Cached?
Yes
No
Translate
(MMU)
Data Read or Write
(untranslated)
Physical
Address
Physical
Memory
Virtual Memory and Virtual Machines
Virtual Machines
• Supports isolation and security
• Sharing a computer among many unrelated users
• Enabled by raw speed of processors, making the overhead more acceptable
• Allows different ISAs and operating systems to be presented to user programs
– “System Virtual Machines”
– SVM software is called “virtual machine monitor” or “hypervisor”
– Individual virtual machines run under the monitor are called “guest VMs”
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
88
• Each guest OS maintains its own set of page tables
– VMM adds a level of memory between physical and virtual memory called “real memory”
– VMM maintains shadow page table that maps guest virtual addresses to physical addresses
• Requires VMM to detect guest’s changes to its own page table
• Occurs naturally if accessing the page table pointer is a privileged operation
CA-Lec3 cwliu@twins.ee.nctu.edu.tw
89
Virtual Memory and Virtual Machines
Impact of VMs on Virtual Memory
Download PDF
Similar pages