header software2

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

A central pillar of the DEEP software stack in the DEEP projects is ParaStation MPI, an MSA-enabled implementation of the Message-Passing Interface (MPI) standard.

Since the Message-Passing Interface is the most commonly used programming standard for parallel applications, the extensions implemented in ParaStation MPI within the DEEP projects can provide support for a wide range of applications. In doing so, ParaStation MPI is fully MPI-3 compliant and its DEEP-related extensions are designed to be as close as possible to the current standard, while still reflecting the peculiarities of the DEEP prototypes. This way, applications tuned to MSA environments can remain portable and essentially compliant with the MPI standard.

 

MSA Extensions

ParaStation MPI makes affinity information available to applications running across different modules of the DEEP prototypes while adhering to the MPI interface. This way, applications may exploit the underlying hardware topology for further optimisations. These are not limited to the program flow but may likewise affect the communication patterns, e.g., by using the new split type MPIX_COMM_TYPE_MODULE applications are able to create module-specific MPI communicators.


Additionally, ParaStation MPI itself applies application-transparent optimisations for modular systems, in particular regarding collective communication patterns. Based on topology information, collective operations such as Broadcast or Reduce can be performed hierarchically so that the inter-module communication (forming a potential bottleneck) can be reduced.


NAM Integration

One distinct feature of the DEEP-EST prototype will be the Network Attached Memory (NAM): Special memory regions that can directly be accessed via Put / Get-operations from every node within the EXTOLL network.


For making the programming more convenient and familiar, the DEEP-EST project also aims at integrating an interface for accessing the NAM via MPI. That way, application programmers shall be able to use well-known MPI functions (in particular those of the MPI RMA interface) for accessing NAM regions quite similar to other remote memory regions in a standardized (or at least harmonized) fashion under the single roof of an MPI world.


In a first step, a shared-memory-based implementation of this interface has already been developed in ParaStation MPI that allows for emulating the MPI-related handling of NAM memory by using persistent memory segments on the compute nodes instead.

 

CUDA Awareness

By using a CUDA-aware MPI implementation, mixed CUDA+MPI application are allowed to pass pointers to CUDA buffers located on the GPU to MPI functions. In contrast, a non-CUDA-aware MPI library would fail in such a case. Furthermore, a CUDA-aware MPI library may determine that a pointer references a GPU buffer to apply appropriate optimisations regarding the communication. For example, so-called GPUDirect capabilities can then be used to enable direct RDMA transfers to and from GPU memory.


ParaStation MPI supports CUDA awareness, e.g., for the DEEP-EST ESB, at[SP1]  different levels. On the one hand, the usage of GPU pointers for MPI functions is supported. On the other hand, if an interconnect technology provides features such as GPUDirect, ParaStation MPI is able to bypass its own mechanism for the handling of GPU pointers and to forward the required information to the lower software layers for the exploitation of such hardware capabilities.


One goal within the DEEP-EST project is the utilisation of GPUDirect together with Extoll via ParaStation MPI.


Gateway support

The MSA concept considers the support for different network technologies within distinct modules. Therefore, ParaStation MPI provides means for message forwarding based on so-called gateway daemons. These daemons run on dedicated gateway nodes being directly connected to different networks of an MSA system, e.g., in the DEEP-EST prototype system there are gateway nodes bridging between the InfiniBand and the EXTOLL network.


This gateway mechanism is transparent to the MPI processes, i.e., they see a common MPI_COMM_WORLD communicator spanning the whole MSA system. Therefore, the mechanism introduces a new connection type: the gateway connection as compared with the fabric-native transports such as InfiniBand, EXTOLL, and shared-memory. These virtual gateway connections map onto the underlying physical connections to and from the gateway daemons.


Transparency to the MPI layer is enabled by completely implementing the gateway logic on the lower pscom layer, i.e., the high-performance point-to-point communication layer of ParaStation MPI. This way, more complex communication patterns implemented on top, e.g., collective communication operations, can be executed across different modules offhand.


ParaStation MPI takes different measures for the avoidance of bottlenecks with respect to the transmission bandwidth of the cross-gateway communication. For one thing,  the module interface might comprise multiple gateway nodes. Therefore, the MPI bridging framework is able to handle multiple gateway nodes such that a transparent load balancing can be achieved among them on the basis of a static routing scheme. For another thing, the upper MPICH layer of ParaStation MPI is able to retrieve topology information (cf. MSA extensions) for the optimisation of complex communication patterns, e.g., to minimise the inter-module traffic.


In the DEEP-EST project, several optimisations for the gateway protocol have been implemented. These optimisations leverage the RMA capabilities of Extoll (as the interconnect of the ESB and the DAM) in combination with message forwarding from and to the IB-equipped Cluster Module (CM). In doing so, the gateway connections support the fragmentation of MPI messages into smaller chunks. This way, the gateway daemons may benefit from a pipelining effect: while receiving message parts on one end of the connection, completely received fragments may already be forwarded to the destination node on the other end of the connection. Ideally, the data transfer from the source to the gateway daemon perfectly overlaps with the transfer from the gateway daemon to the destination. Furthermore, the gateway protocol supports so-called rendezvous semantics. Instead of relying on intermediate, pre-allocated communication buffers, the MPI message is simply announced by a small control message. Subsequently, the actual data transfer can be conducted efficiently by relying on the Remote Direct Memory Access (RDMA) capabilities of the hardware avoiding the costly involvement of the CPU. Moreover, by relying on this approach, the message transfer may be delayed until the actual receive buffer is known to the communication layer, i.e., this is the case when the receive call has been posted by the application layer.