header applications

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

The second focus topic of the DEEP-ER project is resiliency. The resiliency methods developed in the project are flexible enough to accommodate for the heterogeneous nature of systems like the DEEP-ER prototype. The idea is to avoid the necessity of full application restart. The figure below shows an overview of the resiliency techniques within the DEEP-ER Project.

 

resiliency


The first technique mentioned there is the application (or user) based checkpointing. The idea is to write checkpoints from where the application can be restarted. This is necessary if a fatal hardware failure occurs. In this case the node has to be rebooted so the application won’t be able to continue. But instead to start all over again after the reboot the application will start from the last written checkpoint. There are two possible techniques for this approach:

  1. The scalable checkpoint-restart library (SCR)
  2. OmpSs persistent task-based checkpointing.


Option A: SCR
SCR supports multi-level checkpointing and redundancy (buddy-checkpointing). SCR also decides if writing a checkpoint is necessary. It can be used within MPI codes. The library and documentation can be found here (scr).


SCR has to be started with SCR_Initialize(). This will initiate the checkpoint database and look for checkpoints to restart from. With SCR_Need_Checkpoint(int *flag) can be checked if a condition for writing a checkpoint is met. If this is the case a checkpoint can be created with SCR_Start_Checkpoint(), writing the checkpoint and SCR_Complete_Checkpoint(int valid). At the end of the code (before MPI_Finalize() ) SCR_Finalize() should be called. SCR was also integrated in the SIONlib library, so on the DEEP-ER architecture both libraries are combinable.


Techniques such as SCR focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of manually serialising and de-serialising the application’s state. Therefore new features have been implemented into the OmpSs programming model like the directive-based approach to perform application-level checkpoint/restart in a simplified way.

  • The checkpoint clause allows the user to specify easily the state of the application that has to be saved and restored.
  • The runtime system, which relies on SCR to perform scalable and efficient I/O operations, will do the serialization and deserialization activities.
  • The checkpoint clause needs to be combined with the dependencies semantics (e.g. in() or inout()).
  • The checkpoint task then will store these dependencies.


The following OmpSs pragma shows an example where cp_data is checkpointed:
#pragma omp task in (cp_data) checkpoint ( )


In this example the checkpoint frequency is given by the underlying checkpoint/restart library (in our case SCR). The checkpoint clause also accepts an optional condition expression to let the user control the checkpoint frequency.

In the following example a checkpoint of cp_data will be performed every five iterations:
#pragma omp task in (cp_data) checkpoint (iter%5==0)


Option B: task-based resiliency
The second topic of the resiliency features of the DEEP-ER project is the novel task-based resiliency technique. The key idea is to checkpoint task inputs “in-memory”, so in case of a task failure the runtime can isolate the error and re-execute the affected task. The first task-based resiliency technique mentioned in Figure 8 is the lightweight task-based resiliency with OmpSs. This feature allows to checkpoint the task inputs when the task is ready to execute. When an error occurs within a checkpointed task, the runtime is able to restore and restart its execution. If no error occurs during the task execution, the runtime releases the dependencies and removes the checkpoint and related software data structures. To use the lightweight task-based resiliency the user has to add a recover clause to the task pragma:


#pragma omp task in(in_data) recover


In this example the task has been marked as recoverable, which means that the runtime will checkpoint the input data (in_data) before execution.


It is also possible to mark OmpSs offload tasks with the recover clause. This would be the second technique mentioned in the figure, the resiliency for offloaded tasks with OmpSs:


deep_booster_alloc(MPI_COMM_WORLD, n, ppn, &comm);
#pragma omp task onto (comm, rank) recover


This code fragment allocates ppn offloaded processes in each of n hosts. Then a task is offloaded to each offloaded process. Without using the resiliency techniques a crash of one of the used offloaded processes would result in the process manager terminating all processes (even the host processes). The resiliency for offloaded tasks ensures that the process manager will limit the process clean-up upon failure to those processes sharing the same MPI_COMM_WORLD communicator of the failed process. So if one of the offloaded processes crashes, the failed processes will be replaced with a freshly spawned set of them.