To address Uncorrected Recoverable Errors, OmpSs is extended with lightweight task-based checkpoint/restart functionality.
The OmpSs programming model, as developed for the DEEP Cluster-Booster architecture, has been extended to deal with Uncorrected Recoverable Errors (UCR) in a transparent way. Applications written in OpenMP task-based style can easily leverage this mechanism. They can transparently benefit from a low-overhead in-memory checkpoint/restart mechanism based on the input/output annotations of tasks.
The figure below indicates how the OmpSs runtime has been extended to cooperate with the OS kernel so it can recover from UCR errors in a transparent way.