hpc @ aub day 2

the message passing interface (MPI)

acknowledgements

overview

today's lecture - the message passing interface (MPI)

the free lunch is over

media/free_lunch.png

the multiple forms of parallelism

the rise of manycore

media/concurrency.png

parallel architectures

parallel programming paradigms

  • CUDA assumes large register files, same instruction multiple thread parallelism, and a mostly flat, structured memory model, matching the underlying GPU hardware

parallel programming paradigms

  • OpenMP exposes loop level parallelism with a fork/join model, assumes the presence of shared memory and atomics

parallel programming paradigms

  • OpenCl tries to generalize CUDA, but still assumes a 'coprocessor' approach, where kernels are shipped from a master processor to worker cores

the message passing model

why MPI?

communicators

datatypes

tags

checkpoint

if tags allow us to screen/separate peer-to-peer messages, why do we need communicators?`

mpi basic (blocking) send

C

int MPI_Send(void* buf, int count, MPI_Datatype type,
int dest, int tag, MPI_Comm comm)

Python (mpi4py)

Comm.Send(self, buf, int dest=0, int tag=0)
Comm.send(self, obj=None, int dest=0, int tag=0)

mpi basic (blocking) recv

C

int MPI_Recv(void* buf, int count, MPI_Datatype type,
int source, int tag, MPI_Comm comm, MPI_Status status)

Python (mpi4py)

Comm.Recv(self, buf, int source=0, int tag=0,
Status status=None)
Comm.recv(self, obj=None, int source=0,
int tag=0, Status status=None)

synchronization

C

int MPI_Barrier(MPI_Comm comm)

Python (mpi4py)

Comm.Barrier(self)
Comm.barrier(self)

broadcast/reduce

C

int MPI_Bcast(void *buf, int count, MPI_Datatype type,
int root, MPI_Comm comm)

Python (mpi4py)

Comm.Bcast(self, buf, int root=0)
Comm.bcast(self, obj=None, int root=0)

collective data movement

media/broadcast_scatter_gather.png

collective data movement

media/allgather_alltoall.png

collective data movement

media/reduce_scan.png

understanding performance

$$T = \frac{T_p}{p} + T_s + T_c $$

  • T_c = communication overhead
  • T_s = serial (non-parallelizable work)
  • T_p = parallel work

latency and bandwidth

timing and profiling

the elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:

timing and profiling

the elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:

one-sided communication

why use remote memory access (RMA?)

overview of one-sided api

RAM functions for synchronization

multiple ways to synchronize

mpi i/o

why not use mpi i/o

some performance tuning thoughts

challenges for exascale

Credits