This is done by mapping file operations on individual files to file operations on one or few shared files. At very large scale, parallel I/O to task local files does not perform on current parallel file systems. One of the critical bottlenecks is thereby the management of the file meta-data. By replacing I/O operations to individual files with collective I/O operations to a shared file, SIONlib is able to reduce the number of such file management operations to a minimum.
In the DEEP-ER project, SIONlib will be adapted to the new I/O hardware elements and the software structure of the DEEP-ER platform. Due to the mature SIONlib API only minimum code changes need to be implemented by application developers. One major extension of SIONlib is the integration of ‘buddy’ checkpointing functionality, which saves redundant checkpoints on other nodes when using local storage. This feature enables faster recovery from errors and is an important element of the resiliency strategy within the DEEP-ER project.
The developments with respect to buddy checkpointing have been included in the latest version of SIONlib (as of October 2016). The documentation and the latest version for download can be found here: www.fz-juelich.de/jsc/sionlib