멀코 중간 정리
Multicore Processor
- Multiple cores run multiple instructions at the same time(concurrently)
- Increase overall program speed(performance)
- strongly dependent on the software algorithms and implementation
CUDA
- Compute Unified Deviced Architecture
- NVIDIA에서 만든 parallel computing platform
GPGPU
- General Purpose Graphics Processing Unit
Parallel Applicaitions
- huge amount of independen task면 easier to parallelize
Process
- run in separate memory space
Thread
- part of the process
- Run in shared memory space in a process
Multithread Program
- a program running with multiple threads that is executed simultaneously
Parallel Computing
- using multiple processors in parallel to solve problems more quickly than a single processor
- goal = performance
- cluster computer that contains multiple PCs combined together with a high speed network
- shared memory multiprocessor by connecting multiple processors to a single memory system
- Chip Multi-Processor contains multiple processors on a single chip
Parallel Programming
- solve one single problem quickly
Concurrent Programming
- correctly and efficiently controlling access by multiple threads to shared resources
- preventing a bad interleaving
Parallel Programming Techniques
- Shared memory
- Distributed Memory (view one computer system)
- hybrid
- GPU parallel programming
Cluster Computing
- a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.
- connected by network, homogeneous, memory not shared
- good price to performance ratio
- easy to break down ( bad maintenance )
Grid Computing
- federation of computer resources from multiple locations to reach a common goal
- a large scale distributed system
- grids tend to be more loosely coupled and heterogeneous, geographically dispersed
Cloud Computing
- wherever we go any place, any time, we have access our sotrage file...
- shares networked computing resources
- 쓰는 이유 - 서버 구축하는데 돈이 너무 많이 듬
- cloud - users only pay for what they use
- (이미 cloud provider가 다 제공해줌)
Good Parallel Program
- Correct(Result)
- Good Performance
- scalability
- Load balance
- Portability
- Hardware specific Utilization
Distributed System
- Resource Sharing
- Openness
- Concurrency (단순히 동시 시행이 가능하다.. 정도의 의미)
- Scalability
- Fault tolerance
- Transparency(invisible)
- Heterogeneity
- Network latency
- Distributed coordination
- Data consistency (동시 시행할 때, 어떻게 일관성을 맞출건지)
- high start up cost
The Concurrency Revolution
- Free lunch : automatically speed up that coms to up clock speed
- clock speed, execution optimization, cache
- => concurrency-agnostic
- obstacles : we couldn't increase clock speed anymore.
- The performance gains are going to be accomplished in fundamentally different ways, and most current applications will no longer benefit from the free ride without significant redesign.
- new chips : Hyperthreading, multicore, cache
- programming language and systems will increasingly be forced to deal well with concurrency
Moore's Law
- Doubling of the number of transistors on integrated circuits roughly every two years.
Computer Hardware Trend
- power consumption and heat generation is too high to be tolerated
- No more hidden parallelism to be found
- Transistors number still rising
- Clock speed flattning sharply
- need multicore programming to utillize more transistor
Generic SMP
- symmetric Multiprocessor
- Single logical memory
- Shared bus often bottleneck
Multi-core processors represent an important new trend in computer architecture
- Decreased power consumption and heat generation
- Minimized wire lengths and interconnect latencies.
Why writing parallel programs is hard
- Finding enough parallelism(Amdahl's law)
- granularity
- locality
- load balance
- coordination and syncronization
Amdahl's law
- p = number of processors
- s = fraction of work done sequentially
- speedup(p) = time(1)/time(p) <= 1(s+(1-s)/p) <= 1/s
- even if the parallel part speeds up perfectly, performance is limited by the sequential part
Overhead of Parallelism
- cost of starting a thread or process
- cost of communicating shared data
- cost of syncronizing
- extra redundant computation
SISD (Single Instruction, Single Data)
SIMD (Single Instruction, Multiple Data) - high degree of reqularity, graphics/image
MISD (Multiple Instruction, Single Data)
MIMD (Multiple Instruction, Multiple Data)
Creating a Parallel Program
- decomposition - domain decomposition(block, cyclic => locality good), functional decomposition
- => coverage of parallelism in algorithm
- assignment- balanced workload, reduced communication costs, syncronization cost
- =>granularity of partitioning among processors
- orchestration - structuring communication and synchronization
- mapping
- o&m => locality of computation and communication
Performance Scalability
- the capability of a system to increase total throughput under an increased load when resources are added.
Granularity
- is a qulitative measure of the ratio of computaion to communication
Fine-grain Parallelism
- low computation to communication ratio
- small amounts of computational work between communication stages
- high communication overhead
Coarse-grain Parallelism
- high computation to communication ratio
- large amount of computational work between communication events
Load Balancing
- distributing approximately equal amounts of work among tasks so that all tasks are kept busy all of the time.
Static load balancing
- low run time overhead
- Works well for homogeneous multicores
Dynamic load balancing
- High run time overhead
- Ideal for codes where work is uneven, unpredictable and in heterogeneous multicore
Synchronization
- coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected condition
- Barrier
- lock
- semaphore
Locality
- placement of data affect performance a lot
- parallel computaion is serialized due to memory contention and lack of bandwidth
- memroy contention과 lack of bandwidth 때문에 직렬화 된다고
Uniform Memory Access (UMA)
- Centrally located memory (shared)
- 모든 프로세스 메모리 접근 시간 같음
Non-Uniform Memory Access (NUMA)
- any processor can access any data
- physically partition memory, logically one memory system
- placement of data affects performacne
Cache Coherence
- the uniformity of shared resource data that ends up stored in multiple local caches
- problem : when a processor modifies a shared variable in local cache, different processors may have different value of the variable
- Snooping cache coherence
- => send all requests for data to all processors
- directory-based cache coherence
Shared Memory Architecture
- all processors to access all memory as global address space(UMA, NUMA)
- adv : global address space provide user-friendly, data sharing fast
- disadv : lack of scalability between memory and CPUs
Distributed Memory Architecture
- adv : scalable
- disadv : programmer resposibility of data communication
UNIX process
- the only way make process is fork()
- => first process는 예외적으로 OS가 생성
- fork() => child process가 parent process copy , 그 시점부터 동시실행
Threads
- code section, data section, OS resource 공유
- 각각 registers 랑 stack 가짐
Multi process vs Multi thread
- multi process : Don't have worry about concurrent access to variables, expensive to start
- multi thread : concurrent access to variable is an issue, cheap to start
join()
- One thread (A) can wait for another thread(B) to end.
Thread synchronization
- safety - nothing bad ever happens, no race condition
- liveness - no deadlock
Race condition
- a kind of flaw in hw or hw where output is dependent on timing of execution
Condition variables
- syncronization primitives that enable threads to wait until a particular condition occurs
- without condition variable
- => need to have threads continually polling to check if the condition is met
- a condition variable always used in conjunction with a mutex lock.
Producer-Consumer Ploblem
- the producer won't try to add data into the buffer if it's full and that consumer won't try to remove data form an empty buffer.
- wait() : automatically release the lock and block
- notify() : pick one waiting thread and wake it up
- notifyAll() : wakes up all current waiters on the condition variable
Concurrency and Parallelism
- Interleaved Concurrency : logically simultaneous processing
- Parallelism : physically simultaneous processing
Why use Concurrent Programming?
- Natural application sturcture
- Increased application thoughput and responsiveness(민감도?)
- Performance from multiprocessor/multicore hardware
- Distributed system
Synchronization serve two purposes
- ensure safety for shared updates
- coordination actions of threads
Mutual Exclusion
- prevent more than one thread from accessing critical section at a given time
- ensuring that no two processes or threads are in their critical section at the same time
Account Transfers
- nested lock => cycle in locking graph ==> deadlock
- standard solution : canonical order for locks.
Dining Philosophers Problem (deadlock example)
- 단순히 table 접근하는거 자체를 lock을 걸어서 해결 하는 방법
- left=>right , right=>left를 미리 계산해서 안겹치게 주는 방법
Potential Concurrency Problem
- Deadlock : Two or more threads stop and wait for each other
- Livelock : Two or more threads continue to execute, but make no progress toward the ultimate goal
- Starvation : Some thread gets deferred forever
- lack of fairness : each thread gets a turn to make progress
- Race condition
Synchronization
- coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected race condition
Semaphore
- a synchronization object that maintains a count between zero and a specified maximum value
concurrent hash map
- a hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates
copy on write arrays
- read/write 동시 지원 => 왜냐면 새로운거 카피해서 주거든
Barrier
- a barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier
gk..
하.... 마지막 페이지는 그냥 읽자............