the message passing interface (MPI)
- CUDA assumes large register files, same instruction multiple thread parallelism, and a mostly flat, structured memory model, matching the underlying GPU hardware
- OpenMP exposes loop level parallelism with a fork/join model, assumes the presence of shared memory and atomics
- OpenCl tries to generalize CUDA, but still assumes a 'coprocessor' approach, where kernels are shipped from a master processor to worker cores
if tags allow us to screen/separate peer-to-peer messages, why do we need communicators?`
C
int MPI_Send(void* buf, int count, MPI_Datatype type,
int dest, int tag, MPI_Comm comm)
Python (mpi4py)
Comm.Send(self, buf, int dest=0, int tag=0)
Comm.send(self, obj=None, int dest=0, int tag=0)
C
int MPI_Recv(void* buf, int count, MPI_Datatype type,
int source, int tag, MPI_Comm comm, MPI_Status status)
Python (mpi4py)
Comm.Recv(self, buf, int source=0, int tag=0,
Status status=None)
Comm.recv(self, obj=None, int source=0,
int tag=0, Status status=None)
C
int MPI_Barrier(MPI_Comm comm)
Python (mpi4py)
Comm.Barrier(self)
Comm.barrier(self)
C
int MPI_Bcast(void *buf, int count, MPI_Datatype type,
int root, MPI_Comm comm)
Python (mpi4py)
Comm.Bcast(self, buf, int root=0)
Comm.bcast(self, obj=None, int root=0)
$$T = \frac{T_p}{p} + T_s + T_c $$
- T_c = communication overhead
- T_s = serial (non-parallelizable work)
- T_p = parallel work
the elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:
double t1, t2;
t1 = MPI_Wtime();
t2 = MPI_Wtime();
printf("time elapsed is: %e s\n", t2-t1);
the elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime:
t1 = MPI.Wtime()
t2 = MPI.Wtime()
print("time elapsed is: %e\n" % (t2-t1))
MPI_Win_create exposes local memory to RMA operation by other processes in a communicator
- collective operation
- create window object
MPI_Put moves data from local memory to remote memory
MPI_Get retreives data from remote memory into local memory
MPI_Accumulate updates remote memory using local values
data movement operations are non-blocking
synchronization on window object still needed to ensure operation is complete
multiple ways to synchronize