How water displaces air in Lenovo supercomputers

Today, there is a steady trend to extract the maximum power from each computing unit of the data center. Users want to see more and more local disk storage inside data center nodes, and circulatory data transfer systems that combine cluster complexes with network speeds of 200 Gbit / s no longer seem surprising.

CPUs and graphics accelerators are getting hotter every year. At the same time, it is too risky to install thermal packages designed for 240 W discharge in dense server form factors due to the high temperature. Today, due to inefficient heat dissipation, most data centers are less than half full of equipment.

A couple of years ago, it was thought that the maximum limit for air cooling was 600W per unit of Cabinet space. Today, using the most modern technologies, it is possible to create a small 1U computing " box " that consumes up to 1 kW. Theoretically, using air, it is possible to divert up to 1.2 kW per unit. However, if the trend for increasing the power of each individual computing unit continues, it will be necessary to provide cooling of at least 2 kW.

Due to the increased density of computing resources, next-generation information centers will be severely limited in power, cooling, space, and maintenance costs. In addition, the need to comply with regulatory regulations on energy efficiency and emission reduction will have a big impact on the industry.

Андрей Сысоев, ведущий специалист по высокопроизводительным вычислительным технологиям Lenovo в России About the author:

Andrey Sysoev, leading specialist in high-performance computing technologies at Lenovo in Russia

Already this year, processors with a capacity of up to 300 or even 350 W will appear on the market, requiring huge radiators and fans to cool. By 2022, we can expect offers for capacities up to 500 watts per socket. The new processors will require twice as much cooling power. Technically, this means a 4-fold increase in fan speed and an eight-fold increase in volume.

If there is a breakdown in the air cooling system, the equipment will have to work out sky-high numbers. So, for example, in a system with 5 fans, if one fails, the remaining four should produce 25% more power. Therefore, the rotation speed should increase by 50%, and the volume should double.

Neptune replacing air

Water cooling works by transferring high temperature from a hotter object to a colder one. That is, as long as the refrigerant temperature is below the server's operating temperature, the heat generated will be dissipated in the liquid. Compared to air, water is able to transport 4,000 times more heat, so temperature surpluses are easily removed from the server equipment. In the future, hot water can be used to heat the building.

Lenovo Neptune - система жидкостного охлаждения в серверах

Figure 1 Simplified diagram of a data center using direct water cooling technology.

All chips and modules of modern computing systems are designed to operate at temperatures up to 80 °C and above. Due to the wide temperature range – more than 50 °C, the liquid cooling system can set precise parameters, taking into account the specifics of a particular computer center. Microchannel radiators allow you to absorb temperature surpluses directly from the source-the processor, memory module, hard disk or network adapter.

Lenovo Neptune assumes the use of hot water under relatively low pressure, due to which a smaller volume of liquid passes through the cooling loop at the moment. These innovations reduce the thermal resistance and overall energy consumption of the data center - due to the high temperature of the liquid, it does not need to be cooled using energy-intensive chillers.

Бассейн

As long as the outside air temperature is below the water temperature, free air cooling will be sufficient for these purposes. The generated heat can be reused to heat nearby homes, swimming pools, and administrative buildings.

SuperMUC-NG – high performance and energy efficiency for the greatest scientific discoveries

The Leibniz Supercomputer center (LIM), the Central computing center of the University of Munich, is one of the world's largest academic data centers. LRZ provides the scientific community with world-class HPC services and resources, supporting ground-breaking research from cosmology to medicine.

Суперкомпьютер Lenovo SuperMUC

High-performance computing is the cornerstone of modern science. More and more researchers are using modeling and simulations in their work.

Over time, the capacity of the existing cluster became insufficient, and the Leibniz supercomputer center signed a contract with Lenovo to design and build a new system designed for processing and visualizing big data. The project was named SuperMUC-NG (NG-New Generation) and was the third phase of the SuperMUC series of supercomputers.

The innovation cluster is four times more powerful than its predecessor. SuperMUC-NG consists of 6480 Intel ® Xeon ® Scalable processors with 311,000 cores and a peak performance of 26.7 petaflops. The cluster has 700 TB of RAM and 70 PB of storage and more than 60 km of cables.

Like its predecessors, the SuperMUC-NG is an extremely energy-efficient machine. The basis of the cluster was based on the technology of high-density Lenovo ThinkSystem SD650. Computing nodes are equipped with direct-to-Node thermal cooling, which uses the inlet water temperature up to 50 °C.

Thanks to Lenovo Neptune ™ liquid cooling technology SuperMUC-NG uses 30-40% less energy than comparable systems and uses waste heat to heat all LRZ buildings. Among other things, the Lenovo system allowed the computing center to reduce CO2 emissions by up to 85%, which in absolute terms is equal to 30 tons per year.

MareNostrum 4-optimal real-time computing

Every year, more than 10,000 people visit the Torre Girona chapel on the outskirts of Barcelona to watch Mare Nostrum 4, one of the largest and most powerful supercomputers in the world. The cluster consists of 3,456 Lenovo Think System SD530 nodes with Intel Xeon Platinum processors, and has a computing power of 11 petaflops.

Lenovo суперкомпьютер

Despite being ten times faster than its predecessor, MareNostrum 4 uses only 30% more energy at 1.3 MW per year. The cluster is recognized as one of the top ten systems in Europe by the GREEN500 rating of the most energy-efficient computing systems.

Power and energy have become critical constraints for HPC systems.

The performance and power consumption of parallel applications depends on a number of factors, such as:

  • Architectural parameters of the computer
  • The configuration of the compute node during execution of the code
  • Characteristics of application software
  • Input
Selecting the optimal parameters is a very difficult task, which is usually performed manually. This is a time-consuming process of selecting resources, and then capacity, which is performed when the supercomputer is put into operation. Over time, the optimal parameters may change, as well as vary from node to node within the cluster.

Energy Aware Runtime (EAR), a joint development of Lenovo and the Barcelona Supercomputer Center, allows you to select the optimal mode of operation of equipment automatically, based on the analysis of experience with a particular task.

EAR, as part of Lenovo Neptune technology, supports setting automatic and dynamic processor frequency selection based on various factors. Then the performance and power consumption of the supercomputer as a whole are projected. The last step is to configure the necessary thresholds for defining user or system policies for selecting the processor frequency. For example, the system supports power saving mode by reducing the frequency, and Vice versa, it can limit the decrease in performance.

Andrey Sysoev (Lenovo)

13/04.2020


Read also:

How to prepare for accidents in the data center

The main problem of data centers is that the equipment in them must function 24 hours a day, providing service to customers, otherwise they will simply leave, throwing lawsuits at last. It cannot be turned off or moved to anothe...

Server rack monitoring: what is needed for this

Technologies are so advanced today that there are many tools for monitoring server racks and the entire computing infrastructure of a small enterprise. These systems also allow you to optimize energy use, manage backup power and...

Guide to patch panels

Patch panels differ in the principle of installation, design, shielding and other parameters. Today it is already a mandatory attribute of a server rack or a switching cabinet, so you need to know everything about them!