Juri Schmidt, Research Associate University of Heidelberg
|Developing new supercomputing systems for the Exascale era poses significant challenges. On the hardware side of things, one important task is to provide sufficient memory and storage capacity at ever higher bandwidths within a tight energy budget. In the DEEP-ER project, the team experiments with non-volatile memory (NVM) and network attached memory (NAM) devices. We spoke to Juri Schmidt, research associate at University of Heidelberg, about the latter one; we were curious about what makes a NAM so special, what’s ahead in his research and why Heidelberg decided to establish the work on the hybrid memory cube (HMC) as an open source project.|
You are working on the Network Attached Memory (NAM) device for the DEEP-ER project at the Computer Architecture Group of University of Heidelberg. Can you explain to us the whole concept of a NAM?
As the name indicates: A NAM is basically a storage device plugged into the interconnect network of a Cluster. That sounds pretty simple and straightforward. But the underlying technology is quite new and exciting and the NAM concept enables entirely new approaches for using memory as a shared resource.
In the DEEP-ER architecture the NAM will provide high-speed access to a large amount of DDR memory, accessible through remote memory operations in both the Booster and the Cluster node address spaces. But the ultimate goal is that the NAM controller will include some kind of internal intelligence. This enables the NAM to execute certain computing operations itself – which eventually can speed up the parallel computing processes by reducing communication between processors and between processors and memory.
How does the NAM actually work?
First, we have a hybrid memory cube (HMC) controller and a component that implements the so called “programmable functions and delays” (the NAM “intelligence”). These two components form the NAM controller – which again is completed by a HMC memory device connected to it. The special thing about the HMC controller is that it relies on regular DRAM, but uses an entirely different architecture from DDR3 or DDR4, which significantly increases the memory bandwidth and at the same time reduces power consumption.
The NAM controller will enable all communication between the DEEP-ER interconnect network and the HMC. This is actually a very nice feature for the software work in the project: The NAM is adaptable to different network protocols and access functions and the user does not have to touch the software stack.
Given it is such a new technology: How long have you been researching on it and what is the current status of your work?
At the moment we’re in the prototyping phase and have been experimenting for roughly the last 12 months. Currently, we have a 16-lane, 10Gbit/s per lane link up and running. By the end of the year, we aspire to increase the lane speed to 15Gbit/s. Extrapolated to all four HMC links, this means a bidirectional bandwidth of 240GByte/s.
You mentioned before that the NAM controller will be able to do some computation on its own. Have you already thought about possible use cases for the NAM within the DEEP-ER prototype?
There are a couple of scenarios we are thinking of at the moment. The programmable functions I’ve just talked about offer us many possibilities. We have not yet taken a final decision on which path we want to follow – after all, within the time frame of the project we cannot experiment with every possible use case of the NAM. Most probably we will focus on application checkpointing and restart. Here, functions like calculation of parity information for redundant storage of checkpoints and remote reduction operations come to mind.
What would be the next steps to achieve this?
If we take for example the checkpointing use case: At the moment a checkpoint stored on a NAM would reside in volatile HMC DRAM. That means you would need to provide a back-up battery to avoid the risk of losing data during a power outage. Non-volatile memory could be used to avoid batteries. The NAM will then be able to emulate the characteristics of different (much slower) NVM technologies and products by using special delay functions, while there is still an HMC connected to it. The beauty of this approach is that software does not need to be adapted, as APIs will stay the same.
While working on the NAM for the DEEP-ER project, University of Heidelberg decided to make this an open source project as well – why is that?
We are in the great position to have all the hardware and support needed thanks to industry partners we co-operate with. But a lot of universities and small research institutions do not have access to this exciting new memory technology. So we decided to make the HMC controller an open source project called openHMC that would benefit the community – but at the same time it’ll obviously benefit us as well.
What do you offer to the community?
So, first of all, we have a website, where we offer a package containing the RTL sources (written in Verilog) along with the openHMC documentation. It comes with quite a goodie on top: The HMC controller can be fully parameterized to three different configurations depending on speed and area requirements. Of course, we also provide the option to get in touch with us for support. Additionally, we have a LinkedIn group for feedback and discussions.
And what do you expect to gain from it?
Obviously we are very interested in driving the development of the NAM. With the open source project we exchange our knowledge with the community, but we obviously think we will get a lot back as well. So basically the idea is: the more the merrier and hopefully the better the ideas for improvements and new applications for the NAM.
Thanks, Juri, for your time and good luck to you and your colleagues from Heidelberg for your open source project.