To recover from Uncorrected Errors (UC), various techniques are implemented.
OmpSs resiliency for offloaded task
In DEEP-ER we have also extended the task-offloading model developed in DEEP with resiliency support. In the DEEP offloading model, some tasks can be offloaded from some nodes (master) to other nodes (slaves). If one or more slave nodes fail due to an Uncorrected Error (UC), the OmpSs runtime, in cooperation with an enhanced version of ParaStation MPI, is able to re-execute the failed task on another slave node, without the need to kill and restart any of the master nodes involved.
On MPI level, this is facilitated by a new connection guard feature that has recently been added to ParaStation MPI for detecting and reporting broken links between processes as they may indicate failed tasks. In addition, the ParaStation MPI process manager has been made capable of distinguishing between master and slave nodes in the error case. This way, the process manager is able to clean-up just the offloaded tasks while the processes on master nodes can be kept alive for re-doing the offload.
For any other UC error that cannot be handled with the previous techniques, DEEP-ER also provides two interfaces to use application-based checkpoint/restart techniques.
- The first interface is based on the well-known SCR library that has been optimized to make the most of the advanced I/O software-hardware architecture developed in DEEP-ER. It now supports applications leveraging node-local parallel I/O for maximum scalability in conjunction with buddy-checkpointing for redundancy. Not only intrinsic features can be used for this purpose but also the SIONlib buddy-checkpointing mechanism is handled safely by SCR. Flushes and fetches betwen local and background storage are carried out using the BeeGFS asynchronous API. An analytical failure model has been integrated to find the optimal checkpoint frequency for a given mean time between failures or short MTBF.
- The second interface is an extension of OmpSs that provides support for persistent task-based resiliency. This interface is a more user-friendly way to provide application-based checkpoint/restart capabilities based on pragmas but still leveraging the SCR library to do scalable and efficient I/O. In this latter approach, the application developer only has to identify the application state. The OmpSs runtime system then takes care of the application re-execution and data serialization/deserialization.