멀코 중간 정리

전공/멀티코어컴퓨팅

멀코 중간 정리

nongdamgom 2024. 4. 21. 20:41

Multicore Processor

Multiple cores run multiple instructions at the same time(concurrently)
Increase overall program speed(performance)
strongly dependent on the software algorithms and implementation

CUDA

Compute Unified Deviced Architecture
NVIDIA에서 만든 parallel computing platform

GPGPU

General Purpose Graphics Processing Unit

Parallel Applicaitions

huge amount of independen task면 easier to parallelize

Process

run in separate memory space

Thread

part of the process
Run in shared memory space in a process

Multithread Program

a program running with multiple threads that is executed simultaneously

Parallel Computing

using multiple processors in parallel to solve problems more quickly than a single processor
goal = performance

cluster computer that contains multiple PCs combined together with a high speed network
shared memory multiprocessor by connecting multiple processors to a single memory system
Chip Multi-Processor contains multiple processors on a single chip

Parallel Programming

solve one single problem quickly

Concurrent Programming

correctly and efficiently controlling access by multiple threads to shared resources
preventing a bad interleaving

Parallel Programming Techniques

Shared memory
Distributed Memory (view one computer system)
hybrid
GPU parallel programming

Cluster Computing

a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.
connected by network, homogeneous, memory not shared
good price to performance ratio
easy to break down ( bad maintenance )

Grid Computing

federation of computer resources from multiple locations to reach a common goal
a large scale distributed system
grids tend to be more loosely coupled and heterogeneous, geographically dispersed

Cloud Computing

wherever we go any place, any time, we have access our sotrage file...
shares networked computing resources
쓰는 이유 - 서버 구축하는데 돈이 너무 많이 듬
cloud - users only pay for what they use
(이미 cloud provider가 다 제공해줌)

Good Parallel Program

Correct(Result)
Good Performance
scalability
Load balance
Portability
Hardware specific Utilization

Distributed System

Resource Sharing
Openness
Concurrency (단순히 동시 시행이 가능하다.. 정도의 의미)
Scalability
Fault tolerance
Transparency(invisible)
Heterogeneity

Network latency
Distributed coordination
Data consistency (동시 시행할 때, 어떻게 일관성을 맞출건지)
high start up cost

The Concurrency Revolution

Free lunch : automatically speed up that coms to up clock speed
clock speed, execution optimization, cache
=> concurrency-agnostic
obstacles : we couldn't increase clock speed anymore.
The performance gains are going to be accomplished in fundamentally different ways, and most current applications will no longer benefit from the free ride without significant redesign.
new chips : Hyperthreading, multicore, cache
programming language and systems will increasingly be forced to deal well with concurrency

Moore's Law

Doubling of the number of transistors on integrated circuits roughly every two years.

Computer Hardware Trend

power consumption and heat generation is too high to be tolerated
No more hidden parallelism to be found
Transistors number still rising
Clock speed flattning sharply
need multicore programming to utillize more transistor

Generic SMP

symmetric Multiprocessor
Single logical memory
Shared bus often bottleneck

Multi-core processors represent an important new trend in computer architecture

Decreased power consumption and heat generation
Minimized wire lengths and interconnect latencies.

Why writing parallel programs is hard

Finding enough parallelism(Amdahl's law)
granularity
locality
load balance
coordination and syncronization

Amdahl's law

p = number of processors
s = fraction of work done sequentially
speedup(p) = time(1)/time(p) <= 1(s+(1-s)/p) <= 1/s
even if the parallel part speeds up perfectly, performance is limited by the sequential part

Overhead of Parallelism

cost of starting a thread or process
cost of communicating shared data
cost of syncronizing
extra redundant computation

SISD (Single Instruction, Single Data)

SIMD (Single Instruction, Multiple Data) - high degree of reqularity, graphics/image

MISD (Multiple Instruction, Single Data)

MIMD (Multiple Instruction, Multiple Data)

Creating a Parallel Program

decomposition - domain decomposition(block, cyclic => locality good), functional decomposition
=> coverage of parallelism in algorithm
assignment- balanced workload, reduced communication costs, syncronization cost
=>granularity of partitioning among processors
orchestration - structuring communication and synchronization
mapping
o&m => locality of computation and communication

Performance Scalability

the capability of a system to increase total throughput under an increased load when resources are added.

Granularity

is a qulitative measure of the ratio of computaion to communication

Fine-grain Parallelism

low computation to communication ratio
small amounts of computational work between communication stages
high communication overhead

Coarse-grain Parallelism

high computation to communication ratio
large amount of computational work between communication events

Load Balancing

distributing approximately equal amounts of work among tasks so that all tasks are kept busy all of the time.

Static load balancing

low run time overhead
Works well for homogeneous multicores

Dynamic load balancing

High run time overhead
Ideal for codes where work is uneven, unpredictable and in heterogeneous multicore

Synchronization

coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected condition
Barrier
lock
semaphore

Locality

placement of data affect performance a lot
parallel computaion is serialized due to memory contention and lack of bandwidth
memroy contention과 lack of bandwidth 때문에 직렬화 된다고

Uniform Memory Access (UMA)

Centrally located memory (shared)
모든 프로세스 메모리 접근 시간 같음

Non-Uniform Memory Access (NUMA)

any processor can access any data
physically partition memory, logically one memory system
placement of data affects performacne

Cache Coherence

the uniformity of shared resource data that ends up stored in multiple local caches
problem : when a processor modifies a shared variable in local cache, different processors may have different value of the variable
Snooping cache coherence
=> send all requests for data to all processors
directory-based cache coherence

Shared Memory Architecture

all processors to access all memory as global address space(UMA, NUMA)
adv : global address space provide user-friendly, data sharing fast
disadv : lack of scalability between memory and CPUs

Distributed Memory Architecture

adv : scalable
disadv : programmer resposibility of data communication

UNIX process

the only way make process is fork()
=> first process는 예외적으로 OS가 생성
fork() => child process가 parent process copy , 그 시점부터 동시실행

Threads

code section, data section, OS resource 공유
각각 registers 랑 stack 가짐

Multi process vs Multi thread

multi process : Don't have worry about concurrent access to variables, expensive to start
multi thread : concurrent access to variable is an issue, cheap to start

join()

One thread (A) can wait for another thread(B) to end.

Thread synchronization

safety - nothing bad ever happens, no race condition
liveness - no deadlock

Race condition

a kind of flaw in hw or hw where output is dependent on timing of execution

Condition variables

syncronization primitives that enable threads to wait until a particular condition occurs
without condition variable
=> need to have threads continually polling to check if the condition is met
a condition variable always used in conjunction with a mutex lock.

Producer-Consumer Ploblem

the producer won't try to add data into the buffer if it's full and that consumer won't try to remove data form an empty buffer.
wait() : automatically release the lock and block
notify() : pick one waiting thread and wake it up
notifyAll() : wakes up all current waiters on the condition variable

Concurrency and Parallelism

Interleaved Concurrency : logically simultaneous processing
Parallelism : physically simultaneous processing

Why use Concurrent Programming?

Natural application sturcture
Increased application thoughput and responsiveness(민감도?)
Performance from multiprocessor/multicore hardware
Distributed system

Synchronization serve two purposes

ensure safety for shared updates
coordination actions of threads

Mutual Exclusion

prevent more than one thread from accessing critical section at a given time
ensuring that no two processes or threads are in their critical section at the same time

Account Transfers

nested lock => cycle in locking graph ==> deadlock
standard solution : canonical order for locks.

Dining Philosophers Problem (deadlock example)

단순히 table 접근하는거 자체를 lock을 걸어서 해결 하는 방법
left=>right , right=>left를 미리 계산해서 안겹치게 주는 방법

Potential Concurrency Problem

Deadlock : Two or more threads stop and wait for each other
Livelock : Two or more threads continue to execute, but make no progress toward the ultimate goal
Starvation : Some thread gets deferred forever
lack of fairness : each thread gets a turn to make progress
Race condition

Synchronization

coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected race condition

Semaphore

a synchronization object that maintains a count between zero and a specified maximum value

concurrent hash map

a hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates

copy on write arrays

read/write 동시 지원 => 왜냐면 새로운거 카피해서 주거든

Barrier

a barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier

gk..

하.... 마지막 페이지는 그냥 읽자............