header applications

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

Improving resiliency with the SCR (scalable checkoint restart) library.

 

The SCR library

Scalable checkpoint restart (SCR) is a library for application checkpointing. The library supports multi-level checkpointing and redundancy (buddy checkpointing). The application developer lets SCR decide whether a checkpoint is necessary or not. SCR caches the data for the checkpoints in the fast local storage on the compute nodes. This ensures an ultra-fast way of scalable checkpoint/restart.

 

Increasing resiliency with SCR

The SeisSol application from TU Munich, which is worked on in the project by LRZ, uses SCR to increase the code’s resiliency. Only a few SCR calls have to be added – e.g.SCR_Initialize(…), SCR_Need_Checkpoint(…) or SCR_Need_Checkpoint(int *flag). The integration of SCR improves the checkpointing strategy and makes the application robust against hardware failures.

Measurements show that the overhead produced by SCR is low. The restart opportunity saves a considerable amount of time: making use of this resiliency technique, the application can start from the last checkpoint. Without it, the run would have to start all over again.