How to prepare for accidents in the data center

Is it necessary to prepare for disasters in data centers? Definitely yes. If your company processes a significant amount of information, it is important for it to ensure proper protection against any potential threats. The main problem of data centers is that the equipment in them must function 24 hours a day, providing service to customers, otherwise they will simply leave. It cannot be turned off or moved to another place for a while, while over the past 20 years enough experience has accumulated to understand what you need to be ready for.

What to prepare for?

The first step to any disaster prevention and recovery plan is to identify the potential types of disasters that may occur. Below are examples of major disasters that institutions should prepare for first.

  • Fire is one of the most common types of disasters that a data center can face. Institutions should prepare for the risk of approaching forest fires, for the fact that a fire can spread from neighboring buildings, especially if the data center is built on the territory of an industrial enterprise and, of course, be prepared for a fire indoors.

The most famous case of fire with the most devastating consequences was recorded on March 10, 2021 in the data center of OVHcloud. As a result, the entire SBG2 data center in Strasbourg burned out, and the equipment and customer data could not be restored. The cause has not yet been established.

  • Floods – can be caused by excessive precipitation, snowmelt, destruction of dams or other natural phenomena. Another type of flooding to pay attention to is pipes damaged inside the data center.

In 2012, Hurricane Sandy hit the United States, as a result of which many small data centers of commercial companies were simply flooded with water. In general, local floods from pipe breaks and roof leaks in data centers are more common than other shocks.

  • Earthquakes can be absolutely devastating and often happen completely unexpectedly. Even small earthquakes can cause serious damage to unprotected equipment.

As a rule, large data centers are built in earthquake-resistant areas, according to earthquake-resistant construction, and special earthquake-resistant cabinets are used in them.

  • Tornadoes and hurricanes – A powerful tornado wind can de-energize electricity, break data transmission circuits, topple trees on a building and break windows. Everyone knows that computing devices cannot withstand water, but in addition to this, hurricanes can cause destruction, fires, power outages and much more.
  • Terrorist attacks and pogroms – Data centers are responsible for supporting a large part of the economy, which makes them a potential target of terrorists. Computing devices are very fragile, they are easy to damage and almost impossible to restore after that, so both a large data center and a small one can attract not only terrorists, but also rioters.
  • Military operations and war - Data centers are critical infrastructure facilities, and therefore, in the event of war, they will be destroyed first.

Disasters that do not directly affect the data center, but affect its operation

If the disaster does not affect the data center directly, it can still suffer from indirect influence. For example, if a natural disaster occurs in a neighboring area of the city, it can lead to power outages or problems with the availability of employees who cannot get to the workplace due to the traffic situation. Damage at an electrical substation or power plant will inevitably lead to power outages, which can cause cascading outages throughout the region. And since in such cases the development of events takes an avalanche-like character, the data center cannot provide work from its own generators, as problems arise with fuel supplies. A typical example of such a situation is the actions during Hurricane Sandy in Manhattan, USA, in 2012. During the general evacuation of the city, one of the Internap data centers continued to work on backup fuel, which was delivered by huge fuel trucks. Here's what a company representative writes about it:

Hospitals and critical care facilities were at the top of the list of local fuel suppliers," Orchard said. "They had an advantage over everyone else, which is understandable." As a result, Internap ordered two machines and pumps from its supplier from Baltimore. As soon as the fuel level in the tank of the data center dropped to a critical level, the servers were turned off and stood without power for more than 12 hours.

The fuel trucks arrived on October 30 at about 20:00, but the transfer of fuel to the upper floor was not such an easy task. It was necessary to make adapters to connect the hoses of fuel trucks to the fuel tank. "I had to look for the right parts and either weld them together or hastily connect six parts," Orchard said.  Eventually the generator was started up again and power returned to the clients' servers around 23:30 that same day. The servers remained on the generator's power supply for 10 days until late in the evening on November 10, when the site was finally switched to electricity from the substation.

In total, Internap spent about 65 tons of fuel to power the data center servers from the generator and allowed a downtime of about 3-4 hours in such a difficult situation. And we can say that Internap employees were lucky, because unlike their competitors from Peer1, they did not have to carry 20-liter cans of gasoline to the 18th floor on foot by stairs during the same period.

As can be seen from the experience of data centers in the USA in 2018, in case of a critical situation, data centers face not only a shortage of personnel due to evacuation (which in that case was not mandatory), but also with the distribution of priorities, when all government resources are directed to saving people and critical medical organizations. In this case, the rescue of a drowning data center is not just assigned to the company's employees, but is also complicated by the evacuation of the population.

Data center evacuation

There have been no examples of evacuation of data centers from the disaster zone in history, but today, with the increase in the number of data centers and their increasing importance, the development of effective evacuation and restoration plans is becoming an increasingly urgent task. Here it is necessary to separate two aspects of data center rescue: evacuation of equipment to protect against destruction before an event occurs (for example, in case of impending fires or the approach of the line of hostilities) and the transfer of work to external resources after a negative event has occurred, and the data center can no longer continue to work in the same place (for example, as a result of destruction by weapons of mass destruction), the eruption of a super volcano, a tsunami, a disaster at a nuclear power plant, etc.

Artificial high-altitude electromagnetic pulse (HEMP), created by man, can cause serious disasters and lead to complete isolation of terrestrial information technology systems. And since the pulse also propagates upwards, it is potentially dangerous for low-flying satellites. Traditionally, one of the important applications of satellite communications is a backup connection in case of failure or serious overload of terrestrial communications. However, after the HEMP, some satellites may become unavailable, possibly preventing them from being used as a backup connection in the post-disaster environment. Thus, from the point of view of IT infrastructure, HEMP is one of the strongest catastrophic events.

Usually, on the eve of a disaster or immediately after it, the load on the communication lines increases tenfold, which may be due to damage to other communication channels and the direction of their traffic "bypassing". Therefore, you need to clearly understand that most likely in conditions of insufficient time, the only part of the equipment that can be evacuated is data drives (HDD, SSD, LTO cassettes). Evacuating insured equipment in cases other than war (usually not covered by insurance) is impractical, and removing SSD/HDD/LTO is the only way to quickly transfer large amounts of data. Accordingly, first of all, you need to take care of the availability of these "backup data", which are not in a "hot" state and can be restored on other equipment, as well as cases for transporting HDD and encryption passwords that are stored separately from the equipment.

How to prepare for a disaster in advance

Preparing for a disaster begins from the very beginning of designing any data center. Regardless of whether a new facility is being built from scratch or an existing room is simply being converted, this will require some thought. The following are key disaster planning solutions that can significantly reduce the risk to any data center:

  • Proper fire extinguishing system - Fire extinguishing systems are required in almost all buildings. Choosing a system that will not damage the computer equipment inside the data center is a painstaking process. In the case of using a powder fire extinguishing system, you need to choose special nozzles, the sound of which does not damage hard drives by vibrations. There are many solutions to protect servers from fires, such as a fire protection liquid based on fluoroqueton.
  • Enhanced security on the ground and in the air - will help minimize the risk of terrorist attacks, prevent damage by an unsatisfied employee and protect in case of riots or civil unrest. Given the realities of recent years, the likelihood of a terrorist attack from drones should be considered, so the security service should be able to shoot down drones.
  • Seismic shelves for server racks - In areas where earthquakes are possible, specially designed shelves for server racks help hold servers, routers, switches and other equipment without letting it fall to the floor. They also help to reduce the vibration transmitted from the ground to the equipment in order to avoid damage.
  • Raised floors are often used in data centers for laying cables. Such floors are able to protect expensive equipment from flooding. In addition, it is necessary to provide drainage, drainage and pumps designed to quickly remove water from the building.
  • Disaster recovery location - if further work in the data center is not possible, the company must have a relocation plan to another data center or to the cloud.

For the data center, the training of personnel and the creation of a detailed action plan in case of any natural disaster is of paramount importance. You need to understand exactly that as soon as the X hour strikes, everything will immediately go wrong as you would like: at best, the staff will not know what to do, and at worst, they will be evacuated or will not be able to get to work. That the security will not allow a car to enter the territory, for which no one could issue a pass for the reason indicated above. You will need gasoline - it will not be there when you find it - there will be no fuel tanker, there will be no hose, like the guys from Internap, or there will be no canister. And if you decide to take out hard drives with data, you will not find any cases for transportation, no cars, no labels for their labels. Up to the point that the screwdrivers used to unscrew the servers or the keys used to lock the HDD can simply disappear at the right time in accordance with Murphy's laws.

Ron Amadeo
31/03.2023


Read also:

Server rack monitoring: what is needed for this

Technologies are so advanced today that there are many tools for monitoring server racks and the entire computing infrastructure of a small enterprise. These systems also allow you to optimize energy use, manage backup power and...

Guide to patch panels

Patch panels differ in the principle of installation, design, shielding and other parameters. Today it is already a mandatory attribute of a server rack or a switching cabinet, so you need to know everything about them!

The main accessories for server cabinets that you need

Engineers and system administrators know that choosing the right accessories for server racks can significantly improve the ease of maintenance and reliability of server equipment. In this article we will look at the most import...