header applications

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

The next step is optimising the code for KNL. This step is essential to make efficient use of the Booster Nodes. There are several things to consider.

  • The first optimisation step is the threading (see figure below): Codes that should run on KNLs should be able to use many cores and scale well when increasing the number of cores. So using multi-threading is an important point.
  • To optimise the scaling behaviour when using multiple cores there are several things to take into account e.g. affinity or false sharing. Detailed information about the threading can be found in Intel’s Guide for Developing Multithreaded Applications.


KNL optimisation step 1


Memory Management

The next step is to optimise the memory access (see figure below). The guides available from Intel are very helpful with respect to the memory management.

Another important point is the use of the MCDRAM. KNLs provide some on-package memory (16 GB), which is called MCDRAM. Data that is used often should be stored there, because the MCDRAM is much faster than the DDR4. If the application uses not more than 16 GB, the whole application can run within this on-package memory.


KNL optimisation step 2

The MCDRAM can be used in 3 different modes: cache mode, flat mode, and hybrid mode:

In cache mode the MCDRAM is treated as Last Level Cache and it’s used automatically.

In flat mode the MCDRAM is like a NUMA node and the usage is controlled by the application developer. For this mode the application can be bound to the MCDRAM during the execution command like this:
numactl -m 1 ./example_code

In this case everything will be stored in the MCDRAM. If there is not enough space on the MCDRAM the application will be terminated. With the following command the MCRDAM will be preferred as long as there is enough space. The rest of the data will then be stored in the DDR4:
Numactl –p 1 ./example_code

To control which data should be stored within the MCDRAM the memkind library has to be used. When running in flat mode there are 2 NUMA nodes. NUMA node 0 is the DDR4 and NUMA node 1 is the MCDRAM. With the following 2 lines all data will be allocated on the DDR4 (NUMA node 0) except for the memkind allocations on the MCDRAM (NUMA node 1).
numactl --membind=0 ./example_code


The hybrid mode is a combination of cache and flat mode. The different modes are selected at boot time. This presentation from Intel about MCDRAM provides more detailed information.


KNL optimisation step 3


The third step of KNL optimisation is vectorisation (Figure 5). The easiest way is to use the auto vectorisation with the help of compilation options like –vec or –xMIC-AVX512. With this approach the compiler will decide what will be vectorised. Thus, code changes are obsolete as well.

  • In cases where the compiler does not vectorise everything automatically, users can fall back on the guided vectorisation approach. Possible approaches include: In some loops it could be necessary to inform the compiler that there are no vector dependencies with some hints, e.g. #pragma ivdep (ignore any assumed vector dependencies).
  • Another option is to force the compiler to ignore all dependencies and vectorise the loop, e.g. with #pragma simd or #pragma vector. This approach only needs minor code changes (adding the pragmas in front of the loops).
  • A third way in the guided vectorisation is the use of the array notation. This technique requires some more code changes since the for-loops have to be replaced by the array notation. When array notation is used the compiler will use the SIMD instruction set.

KNL optimisation step 4 


Investigating different hardware modes
The next thing to investigate is the different hardware modes KNL provides (Figure 6). Besides the above mentioned different MCDRAM modes the KNL mesh interconnect supports 3 different clustering modes:

  • all-to-all
  • quadrant
  • sub-NUMA clustering

The quadrant mode divides the chip in four quadrants with affinity between the tag directories and memory in each quadrant. It has a lower latency and higher bandwidth than the all-to-all mode. For both modes no code changes are needed. When using sub-NUMA clustering the cores appear as 4 (or 2) NUMA nodes to the OS. This is analogous to a 4-socket Xeon. When using OpenMP with multiple MPI ranks per processor, descriptors like scatter or compact should be used. In OpenMP codes without MPI the NUMA bindings have to be handled manually. For affinity control MPI mechanisms like I_MPI_PIN_MODE (Intel MPI) can be used.

The GERShWIN application from Inria is a good example to see what could be achieved by optimising the application for KNL. The vectorisation efforts led to a speedup of 1.3 (compared to the threaded version). Through identifying and resolving bad memory alignments the speedup was increased to 2.1. Finally by using the MCDRAM a speedup of 1.4 was achieved compared to using the DDR (both using the optimised version).