Critical mistakes or why Data Centers fails
The operation of the data center, the server room, is somewhat like a ride on the road. When the road is empty, you can take risks and ride against the rules, and nothing terrible will happen. But as soon as there are a lot of cars on the road, any wrong maneuver, unnoticed pit or cobweb can lead to an accident. A similar situation with data centers and server: the more workload, the higher the price of the error.
Today we will tell you about the mistakes in the design, construction, and operation, because of which an accident can occur in the data center.
Errors in the design and construction phase
I had a separate article on the issue of design errors. There are basically listed the moments that will make the operation of the data center uncomfortable, but now I’ll tell you about what will really hurt.
The project does not provide for whole systems. Some believe that the data center can do without a system of guaranteed food. That is, the DGU. Once, one of the customers for whom I was doing the audit of the data center project, asked what would be the level of fault tolerance for Uptime without the DGU. I did not find anything better than to call Tier 0.
Many perceive the DGU as a reserve, which can be neglected if necessary – a spare one. Actually, it’s worth treating it as the main one, because we can fully manage this kind of energy supply.
Single point of failure. Here are the options:
There is no reservation at all. Then a breakdown or scheduled maintenance will mean a complete loss of the system element.
Selective reservation. This option can be called relatively reliable since the level of system reservation will still be considered according to the minimally reserved element. For example, you have duplicated power beams, DGU, switchboards, PDU in a rack, and UPS – no. If this UPS fails, then everything that was in the chain after it will not save.
Error in calculations. Here is the top of the most sensitive miscalculations in the energy distribution system:
- Incorrect selectivity. Selectivity protects against overloads and short circuits. To maintain selectivity, the value of the automata from the power source to the consumer should decrease. If the compressor closes in the air conditioner, the automatic machine inside the air conditioner will shut down, and not the one that stands in the switchboard.
If the selectivity is not respected, then the machine will not perform its protective functions, and the fault will go higher through the circuit. So, due to overload or short circuit with improper selectivity in the computer room, you can lose a whole beam of power.
- The discrepancy between the cable cross-section and the power of the machine. If the denomination of the machine does not correspond to the cross-section of the cable, the machine will not knock out when the load is exceeded, but the cable will start to overheat or, worse, melt. Choose machines and cables in accordance with the table to calculate the cross-section of the cable, current, and power.
- There is no reserve for capacity. Designing in-house is a bad practice. The equipment began to consume more than you expected for the project, it was necessary to connect additional equipment, losses on the power line due to the length of the trails – all this can be experienced if you add 30% of the reserve to the designed capacity.
- Ignition currents are not taken into account. Equipment that has electric motors, pumps or compressors on board, gives a greater load on the network at start-up than during operation. If you do not provide for this in the project, then you will not be able to start several air conditioners or chillers at once. The system will not cope with the load, and the machines will be disconnected.
- The battery charge current (ACB) is not taken into account. The UPS gives about 10% of its capacity to recharge the battery. If we do not take into account this additional load, the UPS will not be able to switch from battery power to “city”: each time the UPS returns to the city’s electricity supply and starts recharging the battery, the machines will be knocked out.
- Incorrect routing of cables in the sleeves between the rooms. Not exactly about the calculations, but also for construction. There are two points:
1. All phases (l1, l2, l3) must be laid in one sleeve with neutral, otherwise, the cables start to overheat.
2. When several single-core cables are used (several cables are used in one phase), make sure that the cables on the trays lie in the correct sequence (see the corresponding section in the PUE). You do not need to twist them, braid them in braids for beauty if you do not want it to overheat.
- Incorrect estimation of street temperature conditions. When designing, often take as a basis statistics on the average temperature in a particular city – without taking into account the features of a particular building and from unverified sources. If the roof of the building is very hot in the sun, then the real temperature will be several degrees higher.
- Poor air circulation between external units. Due to the tight location and problems with free air flow, the external units of air conditioners start sucking in the exhausted hot air of each other. On the street, it may not be so hot, but the temperature at the entrance to the outdoor unit will be high. The same result will be obtained if the external units are placed next to the exhaust pipe of the DGU or above the DGU, next to the transformers. Think about whether there are any additional sources of heat near the external blocks.
- The incorrectly calculated real capacity of air conditioners and cooling capacity. The power consumption of air conditioners according to the passport is not always true. The manufacturer shows beautiful figures? Do not be too lazy to read the documents yourself and find out under what conditions such indicators will be. And what consumption will be at maximum load? If in the peak load the air conditioners will consume more than what is planned for the project, there is a risk of remaining without the air conditioning system. Leave a reserve.
- Similarly with the cooling capacity: depending on the length of the routes, street temperature, and operation parameters, it can vary.
Errors in operation
The data center, built on an exemplary project, can be spoiled by improper exploitation. Below we will consider what errors in the management of engineering infrastructure can lead to accidents.
Unbalanced load in phases and beams. The power of the cable and automatics is used effectively if the load in the phases is distributed evenly. When one or two phases are overloaded, and one or two are underloaded, a so-called phase skew occurs. Because of it, the available power is used irrationally. In the worst case, this will lead to the disconnection of the machine and overhead of the cable.
With the rays, the story is as follows: in the data center with a 2N power reserve, when one of the power beams is cut off, the second takes on the load of the failed one. In order for the remaining beam to withstand a double load, each of them must be loaded only half of the rated power, taking into account the starting currents. Otherwise, the reserve on the second ray will not save.
Both conditions must be respected simultaneously. Tracking the load distribution from the transformers to the PDU will help monitor the system at the maximum number of points.
Settings on the machines:
- To maintain selectivity, the nominal power of automata is artificially reduced by means of settings. In the process of operation, when it is necessary to connect additional loads, the settings are forgotten and oriented to the nominal value of the machine. Accordingly, if the connected load is greater than the setpoint, the machine will shut down.
Instructions and regulations of the operation service:
- In the server or data center, the pre-emergency state and the engineer understand what to do and who to call. Even worse, when the duty officer decides not to do anything. Regulations and instructions can save you from confusion and loss of time in an emergency.
But the rules of the regulations are different: if it is written for a tick, never updated and no one tested it during the exercises, then we can assume that there is no regulation.
Even if all the schemes have been worked out, the regulations and instructions should always be at hand (in paper and electronic form) so that during an accident you do not have to waste time searching for them. Hang up posters with brief instructions at the engineer’s workstation, from which the rescue operation begins in the event of an accident. Instructions for using the equipment should be placed directly in the equipment case. You can add checklists to the instructions, in which the engineer will mark each of his actions. So there will be less chance of missing the instructions.
Quickly localize the problem in the data center will help the layout of equipment, which must also be relevant and in reach for engineers.
- It would seem, what does marking have to do with accidents? The most direct. For example, to turn off the switched off the machine is a matter of a couple of minutes. But if there are no schemes and markings, then this turns into a real quest with good prospects for a long simple. Or another situation: for repair, it is necessary to disconnect some equipment. We open the shield, and there all machines are the same from the face and without marking. How high is the probability of turning off not what you need, consider yourself?
- In a small server, monitoring of engineering infrastructure may be absent as a class or not all systems are monitored. Then you have to deal with the following situations: on Sunday evening the air conditioning is turned off, but the engineer will only find out about it on Monday morning when the server room already has a bath. Or there was a breakdown with the city’s food, and the diesel did not start. The situation was noticed only when there were alerts about problems with one of the server power beams. In either case, a large-scale accident could be prevented if minimal monitoring with SMS or email alerts were configured.
Data center monitoring has its own nuances: it needs to be properly configured. For example, set the correct threshold values. If the monitor is permanently red from critical errors, then monitoring is not configured correctly. For an engineer, such monitoring will quickly become uninformative, false alarms will arise, and real accidents will go unnoticed among routine alerts.
What else can lead to an accident?
Let’s see what can go wrong in the work of conditioning, electricity (power distribution system, uninterruptible power system, DGU) and fire fighting system.
- For the cooling system, everything can begin with the failure of several air conditioners, for example, due to the fact that the external blocks are clogged with poplar fluff. If the hall is well loaded and the cold ceases to suffice, then there are local overheating. Freon conditioners are very sensitive to the temperature at the inlet, so when it rises, other air conditioners start to stop by mistake. As a result of this “domino effect,” the Hall will remain without cooling.
- For chiller systems, the worst thing is the loss of pressure in the circuit, for example, due to leaks. In this case, the whole system rises, and not a separate air conditioner. In order to keep track of such a situation in time, monitor the pressure, put more leakage sensors, consider the possibility of making up the system with the help of storage tanks, additional pumps.
- Uninterrupted power supply. In addition to the failure of the UPS, which can be prevented by maintenance and timely repair, there is such an interesting thing as the discrepancy between the real time of the UPS’s autonomous operation and the evaluation on the UPS display. I, of course, about the case when the display shows more than it actually is. For example, during the maintenance of the shields between the DGU and the UPS, when the entire battery is held by the battery, the service department counts for one time, but in reality, it gets a couple of minutes less.
Avoid this embarrassment can be if periodically to conduct a “controlled” discharge of the battery with the construction of appropriate schedules. Based on this graph, the battery life is calculated and the readings on the UPS screen are calibrated. For reinsurance, the resulting time is better rounded down. Here as with the clock: it’s better to let hurry and you’ll come to the meeting earlier than you’ll be late.
Guaranteed power supply:
“Failures can occur at any stage of the work of the DSU”
- When the main power was switched off, there was no signal to start the DGU;
- The DSU did not start;
- Wound up, but did not take the load;
- DGU worked and turned off;
- The fire-extinguishing system on the container sensor falsely worked;
- The fuel ran out or was of poor quality.
- To make the DGU work without surprises, regularly conduct test launches under load.
- False alarm system. From this, you can protect yourself by switching the system to semi-automatic mode. That is, before releasing gas, a specially trained person checks whether there is really a problem where the sensor has worked. And then it’s not enough: someone unsuccessfully touched the sensor under the raised floor, and alarm worked.
- The system did not work when it was needed. It is treated by tests.
- Errors in the directions: the sensor worked in one place, and the gas went to another room. The output is the same – testing.