전공/멀티코어컴퓨팅

멀코 중간 정리

nongdamgom 2024. 4. 21. 20:41

Multicore Processor

  • Multiple cores run multiple instructions at the same time(concurrently)
  • Increase overall program speed(performance)
  • strongly dependent on the software algorithms and implementation

CUDA

  • Compute Unified Deviced Architecture
  • NVIDIA에서 만든 parallel computing platform

GPGPU

  • General Purpose Graphics Processing Unit

Parallel Applicaitions

  • huge amount of independen task면 easier to parallelize

Process

  • run in separate memory space

Thread

  • part of the process
  • Run in shared memory space in a process 

Multithread Program

  • a program running with multiple threads that is executed simultaneously

Parallel Computing

  • using multiple processors in parallel to solve problems more quickly than a single processor
  • goal = performance
  • cluster computer that contains multiple PCs combined together with a high speed network
  • shared memory multiprocessor by connecting multiple processors to a single memory system
  • Chip Multi-Processor contains multiple processors on a single chip

Parallel Programming

  • solve one single problem quickly

Concurrent Programming

  • correctly and efficiently controlling access by multiple threads to shared resources
  • preventing a bad interleaving

Parallel Programming Techniques

  • Shared memory
  • Distributed Memory (view one computer system)
  • hybrid
  • GPU parallel programming

Cluster Computing

  • a set of loosely connected computers that work together so that in many respects they can be viewed as a single system.
  • connected by network, homogeneous, memory not shared
  • good price to performance ratio
  • easy to break down ( bad maintenance )

Grid Computing

  • federation of computer resources from multiple locations to reach a common goal 
  • a large scale distributed system
  • grids tend to be more loosely coupled and heterogeneous, geographically dispersed

Cloud Computing

  • wherever we go any place, any time, we have access our sotrage file...
  • shares networked computing resources
  • 쓰는 이유 - 서버 구축하는데 돈이 너무 많이 듬
  • cloud - users only pay for what they use 
  • (이미 cloud provider가 다 제공해줌)

 

Good Parallel Program

  • Correct(Result)
  • Good Performance
  • scalability
  • Load balance
  • Portability
  • Hardware specific Utilization

Distributed System

  • Resource Sharing
  • Openness
  • Concurrency (단순히 동시 시행이 가능하다.. 정도의 의미)
  • Scalability
  • Fault tolerance
  • Transparency(invisible)
  • Heterogeneity
  • Network latency
  • Distributed coordination
  • Data consistency (동시 시행할 때, 어떻게 일관성을 맞출건지)
  • high start up cost

The Concurrency Revolution

  • Free lunch : automatically speed up that coms to up clock speed
  • clock speed, execution optimization, cache
  • => concurrency-agnostic
  • obstacles : we couldn't increase clock speed anymore.
  • The performance gains are going to be accomplished in fundamentally different ways, and most current applications will no longer benefit from the free ride without significant redesign.
  • new chips : Hyperthreading, multicore, cache
  • programming language and systems will increasingly be forced to deal well with concurrency

Moore's Law

  • Doubling of the number of transistors on integrated circuits roughly every two years.

Computer Hardware Trend

  • power consumption and heat generation is too high to be tolerated
  • No more hidden parallelism to be found
  • Transistors number still rising
  • Clock speed flattning sharply
  • need multicore programming to utillize more transistor

Generic SMP

  • symmetric Multiprocessor
  • Single logical memory
  • Shared bus often bottleneck

Multi-core processors represent an important new trend in computer architecture

  • Decreased power consumption and heat generation
  • Minimized wire lengths and interconnect latencies.

Why writing parallel programs is hard

  • Finding enough parallelism(Amdahl's law)
  • granularity
  • locality
  • load balance
  • coordination and syncronization

Amdahl's law

  • p = number of processors
  • s = fraction of work done sequentially
  • speedup(p) = time(1)/time(p) <= 1(s+(1-s)/p) <= 1/s
  • even if the parallel part speeds up perfectly, performance is limited by the sequential part

 

Overhead of Parallelism

  • cost of starting a thread or process
  • cost of communicating shared data
  • cost of syncronizing
  • extra redundant computation

 

SISD (Single Instruction, Single Data)

SIMD (Single Instruction, Multiple Data) - high degree of reqularity, graphics/image

MISD (Multiple Instruction, Single Data)

MIMD (Multiple Instruction, Multiple Data)

 

Creating a Parallel Program

  • decomposition - domain decomposition(block, cyclic => locality good), functional decomposition
  • => coverage of parallelism in algorithm
  • assignment- balanced workload, reduced communication costs, syncronization cost
  • =>granularity of partitioning among processors
  • orchestration - structuring communication and synchronization
  • mapping
  • o&m  => locality of computation and communication

 

Performance Scalability

  • the capability of a system to increase total throughput under an increased load when resources are added.

Granularity

  • is a qulitative measure of the ratio of computaion to communication

Fine-grain Parallelism

  • low computation to communication ratio
  • small amounts of computational work between communication stages
  • high communication overhead

Coarse-grain Parallelism

  • high computation to communication ratio
  • large amount of computational work between communication events

Load Balancing

  • distributing approximately equal amounts of work among tasks so that all tasks are kept busy all of the time.

Static load balancing

  • low run time overhead
  • Works well for homogeneous multicores

Dynamic load balancing

  • High run time overhead
  • Ideal for codes where work is uneven, unpredictable and in heterogeneous multicore

Synchronization 

  • coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected condition
  • Barrier
  • lock
  • semaphore

Locality

  • placement of data affect performance a lot
  • parallel computaion is serialized due to memory contention and lack of bandwidth
  • memroy contention과 lack of bandwidth 때문에 직렬화 된다고

 

Uniform Memory Access (UMA)

  • Centrally located memory (shared)
  • 모든 프로세스 메모리 접근 시간 같음

 

Non-Uniform Memory Access (NUMA)

  • any processor can access any data
  • physically partition memory, logically one memory system
  • placement of data affects performacne

Cache Coherence

  • the uniformity of shared resource data that ends up stored in multiple local caches
  • problem : when a processor modifies a shared variable in local cache, different processors may have different value of the variable
  • Snooping cache coherence
  • => send all requests for data to all processors
  • directory-based cache coherence

Shared Memory Architecture

  • all processors to access all memory as global address space(UMA, NUMA)
  • adv : global address space provide user-friendly, data sharing fast
  • disadv : lack of scalability between memory and CPUs

Distributed Memory Architecture

  • adv : scalable
  • disadv : programmer resposibility of data communication

 

UNIX process

  • the only way make process is fork()
  • => first process는 예외적으로 OS가 생성
  • fork() => child process가 parent process copy , 그 시점부터 동시실행

Threads

  • code section, data section, OS resource 공유
  • 각각 registers  랑 stack 가짐

Multi process vs Multi thread

  • multi process : Don't have worry about concurrent access to variables, expensive to start
  • multi thread : concurrent access to variable is an issue, cheap to start

join()

  • One thread (A) can wait for another thread(B) to end.

Thread synchronization

  • safety - nothing bad ever happens, no race condition
  • liveness - no deadlock

Race condition

  • a kind of flaw in hw or hw where output is dependent on timing of execution

Condition variables

  • syncronization primitives that enable threads to wait until a particular condition occurs
  • without condition variable
  • => need to have threads continually polling to check if the condition is met
  • a condition variable always used in conjunction with a mutex lock.

 

Producer-Consumer Ploblem

  • the producer won't try to add data into the buffer if it's full and that consumer won't try to remove data form an empty buffer.
  • wait() : automatically release the lock and block
  • notify() : pick one waiting thread and wake it up
  • notifyAll() : wakes up all current waiters on the condition variable

 

Concurrency and Parallelism

  • Interleaved Concurrency : logically simultaneous processing
  • Parallelism : physically simultaneous processing

Why use Concurrent Programming?

  • Natural application sturcture
  • Increased application thoughput and responsiveness(민감도?)
  • Performance from multiprocessor/multicore hardware
  • Distributed system

Synchronization serve two purposes

  • ensure safety for shared updates
  • coordination actions of threads

Mutual Exclusion

  • prevent more than one thread from accessing critical section at a given time
  • ensuring that no two processes or threads are in their critical section at the same time

Account Transfers

  • nested lock => cycle in locking graph ==> deadlock
  • standard solution : canonical order for locks.

Dining Philosophers Problem (deadlock example)

  • 단순히 table 접근하는거 자체를 lock을 걸어서 해결 하는 방법
  • left=>right , right=>left를 미리 계산해서 안겹치게 주는 방법

Potential Concurrency Problem

  • Deadlock : Two or more threads stop and wait for each other
  • Livelock : Two or more threads continue to execute, but make no progress toward the ultimate goal
  • Starvation : Some thread gets deferred forever
  • lack of fairness : each thread gets a turn to make progress
  • Race condition

Synchronization

  • coordination of simultaneous events in order to obtain correct runtime order and avoid unexpected race condition

 

Semaphore

  • a synchronization object that maintains a count between zero and a specified maximum value

concurrent hash map

  • a hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates

copy on write arrays

  • read/write 동시 지원 => 왜냐면 새로운거 카피해서 주거든

Barrier

  • a barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier

gk..

하.... 마지막 페이지는 그냥 읽자............