Vecorisation and better memory alignments for speed-up
In general, codes running on Xeon Phi should be able to use many cores and scale well when the number of cores is increased. In the case of GERShWIN the multi-threaded version of the code was taken as a baseline. As codes need to be vectorised to make efficient use of this CPU, the first step was to rewrite the matrix/vector product operations. This made it possible to expose to the compiler as many vectorisation opportunities as possible. In addition, OpenMP SIMD directives were used to enforce vectorisation. These initial modifications already resulted in a speedup of 1.3. In a second step, bad memory alignments were identified and resolved. After these changes the speedup factor increased to 2.1. Similar procedures lead to speed-up factors of 3.5 for the oil exploration code by BSC and even 3.6 for the TurboRVB code by CINECA.
Benefiting from the on-package MCDRAM
Finally, the team explored the Xeon Phi option to leverage the on-package high-bandwidth memory (MCDRAM). For accessing this memory, the numactl command was used to place all application data (for a use case that requires less than 16 GByte memory) on the MCDRAM. The threaded and vectorised version of GERShWIN using the MCDRAM yields a speed increase of factor 1.3 to 1.4 compared to using the DDR4 DRAM. We have measured similar speedup factors e.g. for the TurboRVB code by CINECA.