Skimp on Server Room Air Conditioning? At Your Peril
AC’s that fail, people that turn off the AC just for a silly reason like it is winter outside and cold (so why would you need air conditioning?) People who are trying to paint in the server room without supervision who helpfully cover the servers with plastic; failing cooling fans; a purchase decision that (as it turned out) left us with a pile of some of the most temperature-temperamental boxes on the planet.
One thing to make clear is that this isn’t just about running ambient air a bit warmer than you should. It is about setting up your facility to remove heat. Remember, the nodes GENERATE heat. It can be cold outside, dead of winter, -20°C and with a cold wind blowing and a midsize room with 64 nodes in it will be burning between 6 and 15 kilowatts. That’s enough heat to keep a small log cabin chinked with paper towels toasty warm in the middle of winter.
We have at this point somewhere between 100-200 nodes in one medium sized, fairly well insulated, room. When the AC fails, we have a time measured in minutes (usually around 15-20) before the room temperature goes from maybe 15°C to 30°C (on its way through the roof), independent of the temperature outside. No matter WHAT your design, you’ll have to have enough AC to be able to remove the heat you are releasing into the room as fast as you release it, and this is by far the bulk of your engineering requirement as far as AC is concerned.
You cannot really not have any AC at all, and whatever AC you have will still have to remove all that heat. What you’re really comparing, then, is the MARGINAL cost of conditioning the air at a (too) high temperature vs conditioning the air to a safe operating temperature. In my estimation (which could be wrong) the amount you save keeping the room at 30°C instead of the far safer 20°C will be trivial, maybe $0.05-0.10/watt/year -- a small fraction of your total expenditure on power for the nodes (in the US, roughly $0.60/watt/year), the AC hardware itself (can be anywhere from tens to hundreds of thousands of dollars), and the power required to remove the heat you MUST remove just to keep the room temperature stable at ANY temperature (perhaps $0.20/watt/year).
So you are risking all sorts of catastrophic meltdown type situations to save maybe $5-20 per node operating cost per year against an inevitable budget for power for the nodes of $100-200 each per year. I don’t even think you’ll break even on the additional costs of the hardware that breaks from running things hot, let alone the human and downtime costs.
To give you an idea of the magnitude of the problem, the ONE TIME our server room overheated for real, reaching 30-35°C for an extended period of time (many hours -- the thermal warning system that was supposedly in place but never tested not, actually, working quite the way it was supposed to) we had node crashes galore, and a string (literally) of hardware failures over the next three months -- some immediate and obviously due to immediate overheating, some a week later, two weeks later, four weeks later. Nowadays if the room gets hot we respond immediately, typically getting nodes shut down within minutes of a reported failure and incipient temperature spike. When the overheating occurred I had 15 nodes racked that had run perfectly for a year, 3 blew during the event, 4 more failed over the next few months, 2 more failed after that.
Power supplies, motherboards, memory chips -- that kind of heat weakens components so that forever afterward they are more susceptible to failure, not just during the event. The overheating can just occur one time, for a few minutes, and you’ll be cursing and bitching for months and months later dealing with all the stuff that got almost-damaged, including the stuff that isn’t actually broken, just bent out of spec so that it fails, sometimes, under load.
Also to think about is that server room temperature is rarely uniform EVEN if you are running it at 20°C, there will be places in the room that are 15°C (right in front of the output vents) and other places in the room that are 25-30°C (right behind the nodes). Any unexpected mixing or circulation of the air in a room running at 30°C and you could have 35-40°C ambient air entering some nodes some of the time, and at those temperatures I’d expect failure in a matter of days to weeks, not years. The warmest I’d ever run ambient air is 25°C in a workstation environment, 22°C in a server/cluster environment (where hot spots are more likely to occur).
This article has been reproduced with permission from a posting on Beowulf mailing list by Robert G. Brown Duke University Dept. of Physics. We would like to express our thanks to Robert for allowing us to reproduce this article.