hpc-primer

Parallelism

Content


MPI

We here present some of the aspect of the MPI-3.1 standard. The newest currently v4.0, however only a few features of the new standard are available currently.

Asynchronous communications

Asynchronous communications allows the user to start the communication and then come back later to check for completness. Doing so is convenient to ‘hide’ the communications behind computations or other tasks.

Requests

MPI_Request is used to track the status of async communications

request progress

A call to MPI_Wait returns when the operation identified to the request is completed. If the request is a persistent one it becomes inactive, otherwise the request is deallocated and set to MPI_REQUEST_NULL.

A call to MPI_Test returns flag=true if the operation associated to the request is completed. If the request is a persistent one it becomes inactive, otherwise the request is deallocated and the handle set to MPI_REQUEST_NULL.

A persistent request must be free’d using MPI_Request_free as it is not automatically deallocated by the MPI_Wait/MPI_Test. It is also permitted to free an ongoing request, however then the status of the request cannot be checked anymore.

Non-blocking calls

To every point-to-point MPI call, there is a corresponding non-blocking version. The non-blocking version is prefixed with an I: Alltoall becomes IAlltoall, Send becomes ISend, Recv becomes IRecv etc.

The function takes a request into account, which then becomes active at the return of the call (not the completion of the communication). The status of the request and the completion of communication can be queried or enforced using MPI_TestX functions or MPI_Wait.

Persistent calls

When the communication pattern is fixed, there is an advantage in using persistent calls to avoid the overhead while creating the communication. The persistent calls are SendInit and RecvInit where a request is initialized with the needed information. The communication itself can then be started using MPI_Start functions and the commpletion is checked as for the non-blocking calls with the Test or Wait functions.


UCX

Here are some of the information I have found useful about UCX. Be mindful this might not be the most accurate and up-to-date information

UCT vs UCP

UCX has two levels of design: UC-P (P = protocols) and UC-T (T = transport).

The UCT level itself relies on various transport low-level API, based on the hardware. It can also have different communication mode such as the RC (reliable communication) and UD (unreliable datagram). The RC is a reliable mode of communication where a send matches a receive. However it requires $N^2$ memory to connect $N$ endpoints together (every pair has a send and receive queue). On the other side, UD can only transfer one MTU at a time and implements only the two sided communications. Therefore the latency is higher than RC and the bandwidth is smaller. However, it has a very low memory consumption. On a broad picture, RC is costly, yet fast and UD is cheap, yet slow. In an attempt to get the best of both worlds, Mellanox introduced DC (Dynamical connected), which reproduces RC but dynamically adapt the memory to the need of communication.

To get an overview of all the UCT available one can use ucx_info -d. Similarly the command ucx_info -c will return the variables associated.

The UCP part of ucx will dynamically choose and handle the best UCT for the communication to perform. The choice of UCT is based on parameter values and is described on the website. Shortly put, for small scale RC will be prefered, while DC (or UD if not available) will be used for larger scales.

On top of this main directions, some specific implementations of each communication strategy can be selected, cfr the list on the website.

Resources


Performance analysis

Strong scaling - aka Amdahl’s Law

In this case we fix the global workload and increase the computational resources. Considering the reference configuration as being $T_o = \left( 1-\alpha_p\right) + \alpha_p $ where $\alpha_p$ percent of the program can be done in parallel, then the speedup from one resources $N_o$ to another one $N$ is given by (where $r = N/N_o$)

\[S = \dfrac{T_o}{T} = \dfrac{1}{ \left( 1-\alpha_p\right) + \dfrac{\alpha_p}{r}}\]

As N increases, $r \rightarrow \infty$ and the speedup is bounded by $1/\left(1-\alpha_p\right)$.

The speedup is then valuable to determine $\alpha_p$, the percentage of the program actually done in parallel. Also we see that with this analysis the speedup is bounded.

Weak scaling, aka Gustafson’s law

Here we take the opposite approach as we aim to increase the computational resources proportionally to the size of the problem. This analysis is closer to the real-life applications

Considering the reference configuration as being $T_o = \left( 1-\alpha_p\right) + \alpha_p$, when multiplying the computational resources by $r$, the size of the problem is going to be multiplied by $r$ as well. As we scale the domain, the speedup does not make sense anymore and we will instead use and efficiency measure:

\[\eta = \dfrac{T_o}{T} = \dfrac{1}{ r \cdot \left[ \left( 1- \alpha_p \right) + \dfrac{\alpha_p}{r} \right]} = \dfrac{1}{r - (r-1) \alpha_p }\]

By analogy some still like to define the speedup as being $s = \dfrac{1}{\eta}$, which gives now \(S = r - (r-1) \alpha_p\)

Another approach is to define $\alpha_s = 1 - \alpha_p$ as the serial percentage of the program. The relations then become

\[\eta =\dfrac{1}{1 + (r-1) \alpha_s } \quad \Leftrightarrow \quad S = 1 + (r-1) \alpha_s\]

Strong scaling and Gustafson’s law

It is also possible to apply the Gustafson logic to the strong scaling analysis, where one would define the efficiency as being

\[\eta = \dfrac{T_o \cdot N_o}{T \cdot N} = \dfrac{1}{ r \cdot \left[ \left( 1-\alpha_p\right) + \dfrac{\alpha_p}{r}\right]} = \dfrac{1}{r - (r-1) \alpha_p}\]

home