Memory Hierarchy(2)

Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

관리 메뉴

수정입니다

Memory Hierarchy(2) - Cache_1 본문

전공/컴퓨터구조

Memory Hierarchy(2) - Cache_1

nongdamgom 2023. 12. 28. 12:38

전 포스팅에 메모리에는 계층이 있고, CPU는 언제나 가장 빠른 메모리를 이용한다고 했다.

Cache memory

The level of the 'memory hierarchy' closest to the CPU (fastest)

cache가 가장 빠른 메모리인건 알겠는데,

내가 찾고자 하는 정보가 cache의 대체 어디에 저장 되어있을까?

(simply, word 단위의 데이터, block = 1word = 4bytes 가정)

Where do we look? and put?

Direct-mapped
Fully associative
M-way Set associative

Direct Mapped Cache

Location determined by memory address
Direct mapped : only one choice - 각 메모리의 위치가 cache내 정확히 한 곳에 위치

--> (Block address) % (number of Blocks in cache)

Block address = (memory byte address) / (number of byte in a block)

ex) block address = 01101 , number of block = 8

-> 01101 % 01000 = 00101 (13%8 = 5)

즉, block 수가 2의 거듭제곱 꼴이면, 2^3, 01101에서 하위 3bit만 취하는 방식

원하는 block의 주소가 01101 일 때,

CPU가 알아서 계산을 해서 하위 3bit를 취하고, 그 정보가 cache 내에 존재하는지, 아닌지 확인

하지만, 하위 3 bit가 101인 여러 memory가 cache내에 101 block에 저장될 수 있다.

=> ambiguity

Tags and Vaild Bits

그러면 특정 block이 cache내에 저장 됐는지, 안됐는지 어떻게 알 수 있을까

--> only need the high-order bits : tag

--> + vaild bit : 있으면 1, 없으면 0 -> 0으로 초기화 되어있다

ex) 8-blocks, 1word/block, directed mapped

cache 내로 들어왔을 때, 3가지 상황 발생

vaild bit = 0 (miss)
vaild bit = 1, right tag (hit)
vaild bit = 1, different tag (miss)

vaild bit = 0 (MISS)

맞는 하위 3bit block으로 방문
vaild bit 확인 == 아무것도 없다
main memory 가서 찾는 memory copy
copy한 memory를 cache에 적고
비어있던 Tag bit를 채워줌 (하위 3bit 제외 나머지 bit - 상위 2bit)
vaild bit를 1로 setting
CPU로 얻은 memory를 보내기

vaild bit = 1, right tag (HIT)

맞는 하위 3bit block으로 방문
vaild bit 확인 == 뭔가 있다
내가 찾을 address의 상위 2bit랑 tag bit 를 compare -> 맞으면 hit
cache 에 저장된 memory를 CPU로 보내기

vaild bit = 1, different tag (MISS)

맞는 하위 3bit block으로 방문
vaild bit 확인 == 뭔가 있다
내가 찾을 address의 상위 2bit랑 tag bit 를 compare -> 다르면 miss
main memroy 가서 get data 하고
새롭게 얻은 data를 cache에 copy
기존에 있던 data를 새로 얻은걸로 변경함 (cache는 매우 작기 때문에 여러개 저장하기 힘들다)
tag bit도 새로운걸로 수정
그리고 back to the CPU

** 넘어가기 전 틈새 상식

byte offset : byte 내의 위치 구분

(1 block = 4 bytes라면, 1 block은 4개의 byte(인덱스 0, 1, 2, 3)로 이루어져있고 ,

block 단위로 읽어들이기 때문에 block내에서 몇번 째 byte인지 명시 -> 하위 2bit가 (00, 01, 10,11)로 구분해준다)

==> byte offset ( byte 개수 = 2^n 이면, offset = n bit)

Address Subdivison

ex) 1 block = 4bytes , 32bit memory, cache has 1024 blocks

tag ( 20 bits )

index ( 10 bits )

Byte

offset

(2 bits)

byte offset = 2bits(1block = 4bytes = 2^2 bytes니까)

cache가 1024 block을 가졌다? -> 2^10 = 1024, 10개의 bit로 1024개의 정보를 나타낼 수 있다

tag = memory bit - index bit - offset

= 32 - 10 - 2 = 20 bits

AND gate : Tag bit 1 AND Vaild bit 1 -> HIT

EX: Larger Block Size

지금까지는 1 block = 1 word = 4byte로 가정
사실 이건 바뀔 수 있다.
위 예시랑 비교해서, block size는 늘고, block 개수는 줄어들음

32bits memory, 64blocks, 16bytes/block, memory address 1200

1 block = 4 word = 16bytes
1block = 16bytes
mem addr 1200 = word addr 300 = block addr 75
block number = 75 % 64 = 11

tag ( 22 bits )

index ( 6 bits )

Byte

offset

(4 bits)

byte offset = 2^4bytes/bits = 4bits

index = 2^6 block, 6bits

tag = 32 - 6 - 4 = 22 bits

Block Size Considerations

Larger blocks should reduce miss rate due to spatial locality

--> block 사이즈가 크면, 한번에 크게크게 main memory에서 복사해오기 때문에, miss 일 때 왔다갔다 하는 시간이 단축됨

--> 만약에 block이 세세하게 나뉘어 있으면, 매우 느린 main memory에서 copy를 더 많이 해야함

But in a fixed-sized cache ( 돈 이슈로 cache 크기가 한정되어 있다)

--> block 사이즈가 크면, 한정된 크기 안에서, 저장해야할 정보들의 competition이 심해짐(swap이 잦다)

--> 이 역시 miss rate를 증가 시킴

너무 커도 작아도 miss rate를 증가 시키니... 최적의 수를 찾아야 한다.. -> 어려운 문제

Larger miss penalty

--> block 사이즈가 크면, 한번에 copy할 때 많이 함 ->근데 그게 다 필요 없으면 copy 시간이 낭비된다...

아무튼 너무 커도 안좋고 너무 작아도 안좋으니 중도를 찾자..

Cache (Read) Misses

On cache hit, CPU proceeds "normally"

--> pipelining is moving without ansy stop and stall

On cache miss -> stall the CPU pipeline
Fetch block from next level of hierarchy

--> main memory에서 read 해야하고, 이 접근이 완료될 때까지 기다려야 함

--> 위에서 읽은걸 copy 해서 cache에 write 해야함(data, tag, vaild bit)

Restart cache access (memory 도 inst, data 있듯이, 이 memory도 각각 cache 존재)

--> For instrction cache miss

: Send original PC value(PC - 4) and restart instruction fetch

--> For data cache miss

: Complete data access (그냥 밑에서 데이터 가져오는거 기다리면 됨)

Cache (Write) Misses

read misses 랑 비슷하지만 extra steps 이 필요함
What should happen on a write miss?

--> On data-write hit : could just update the block in cache

--> BUT 만약 'store' 에 뭔가 쓰려고 할 때, cache 에다가만 쓰면 main memory는 다른 값을 가지게 됨(old value)

=> inconsistency

how to solve? (쓰고자 하는 블록이 cache에 있을 때, inconsistency를 해결하는..)

Write-through
Write-back

Write - Through

always update both the cache and the memory

--> 매우 느림

ex) base CPI = 1, 10% of inst : store, write to memory takes 100 cycles

-> read : 1 cycle , write-through : 1 + 100 cycles (write-through 과정에 이미 read가 포

-> effective CPI = 1 + 0.1*100 = 11

solution : write buffer

--> Holds data waiting to be written to memory

--> CPU continues immediately

--> main memory에 안가고 wirte buffer에 쓸거 잔뜩 일단 쌓아놓고 바로 CPU로 돌아감(buffer 가 다 찼을 때만 stall 발생)

Write - Back

Update only the cache and write the modified block to memory when it is replaced

--> hit 일 때는 그냥 cache만 계속 update함 : 이러다보면 그 block 이 dirty 해진다..

--> 이 dirty block을 replace 할 때만 memory에 가서 적음

replace? : cache의 x라는 공간이 dirty해져서 y로 바꾸고 싶을 때, 이 x에 적혀있는 last value를 그때서야 main memory에 write back 하고 main memory 에 있는 y의 값을 cache의 x 공간에 copy 한다.

write buffer 사용도 허용됨
usually faster, but much more complex

Wirte Allocation

On a write miss, should we fetch the memory into cache?
For write-through

--> Allocate on miss : fetch the block (main memory 가서 cache에 복사)

--> Wirte around : just wirte directly to memory

For wirte-back : usually fetch block

Intrinsity FastMATH

그냥 MIPS process의 한 예시
12 stage pipeline, Instruction and data access on each cycle
Split cache : seperate I-cache, D-cache (to avoid structural hazard)

--> Each 16KB: 256 blocks x 16 words/block

: 16 KB = 24*210 B = 28blocks * 24words/block(26bytes/block)

--> D-cache : write-through or write-back

SPEC2000 miss rates

--> I-cache : 0.4%, D-cache: 11.4%, Weighted avg : 3.2%

=> Instruction cache miss가 훨씬 더 적다(명령을 더 sequentially 하게함 -> locality가 훨씬 낫다)

Measuring Cache performance

CPU time = (CPU execution time) + (Memory-stall time)

--> execution time : includes cache hit time

--> stall cycles : cache misses

assuming similar cycels for read and write (ignoring wirte buffer stalls)

Cache performance Example

I-cache miss rate = 2%

D-cache miss rate = 4%

Miss penalty = 100 cycles

Base CPI = 2

Load & stores are 36% of instructions

Miss cycles per instruction

I-cache : 0.02*100 = 2

D-cache : 0.36*0.04*100 = 1.44

Actual avg CPI = 2(base) + 2(I) + 1.44(D) = 5.44

(Ideal CPU is 5.44/2 = 2.72 time faster)

(If base CPI =1, 4.44/1 = 4.44 times faster)

(base CPI가 적을수록 miss penalty가 더 중요해짐)

Summary

Decreasing base CPI : memory stall이 시간 소모에 더 큰 비중
Increasing clock rate : memory stall 이 더 많은 CPU cycle 차지

--> ex) 1GHz -> 2 GHz means penalty 100 cycles -> 200 cycles

(사실 miss penalty는 cycle이라기보다는 access time이다. clock rate를 높인다고해서 miss penalty에 아무런 영향 x 오히려 cycle 개수가 늘어나면 그만큼 접근도 많아질테니 접근 지연시간도 늘어나게 됨)

cache가 performance에 정~~~~말 중요하다

Average memory access time

Hit time is also important for performance
Agerage memory access time(AMAT)

--> AMAT = Hit time + (Miss rate x Misspenalty)

ex) CPU with 1ns clock, hit time = 1cycle, miss penalty = 20cycles, I-cache miss rate = 5%

: AMAT = 1 + 0.05*20 = 2ns ( 2cycles per instruction)

'전공 > 컴퓨터구조' 카테고리의 다른 글

Memory Hierarchy(4) - virtual memory (0)	2023.12.28
Memory Hierarchy(3) - Cache_2, Hamming code (0)	2023.12.28
Memory Hierarchy(1) - Disk access time (0)	2023.12.28
Pipelining(2) (0)	2023.12.28
Pipelining(1) (1)	2023.12.28

'전공/컴퓨터구조' Related Articles

수정입니다

Memory Hierarchy(2) - Cache_1 본문

Memory Hierarchy(2) - Cache_1

'전공 > 컴퓨터구조' 카테고리의 다른 글

티스토리툴바