Guide to Server Room Monitoring
The various aspects that make up server room monitoring make it a fascinating topic in its own right. Apart from the obvious environmental monitoring concerns there is also physical access control and monitoring, as well as infrastructure monitoring to consider.
All taken together a fascinating blend of tools and techniques to understand.
Most people are aware that their servers need to be maintained within temperature parameters in order to achieve optimal performance and reduce errors or malfunctions. But the server room also contains valuable business data, so if we take a moment to think about what the machine is actually doing, any environmental threat can be disastrous for an organisation that relies on its data in order to function. Imagine a legal firm, bank or GP surgery experiencing a flood and the ensuing delay, despite having backups, in getting the system up and running again. Clients may also like to be reassured that their personal data is safe and secure and that as an organisation you are doing your level best to achieve this. Let's take a look at some of the most important environmental threats and why it is important to monitor them and have early warning.
Temperature is the environmental variable that most users consider first of all. IT equipment will operate over a very wide range of conditions. Often equipment specifications will show a wide range over which the equipment is approved, these should not however be taken as recommendations but rather as the limits within which the equipment will continue to function. Equipment in an environment anywhere near the operating extremes will experience a higher than usual number of malfunctions. More sensible limits are the practical ones intended for prolonged use of equipment. For practical purposes your servers routers and switches should operate in an ambient temperature in the range 17 °C to 27 °C.
Servers, and other equipment emit a lot of heat and a faulty or broken down air conditioning system could turn your server room into a furnace in as short a time as an hour or two. When OPENXTRA first started and environmental monitoring was not so widely adopted as it is today, we regularly took calls from users who, after the weekend, returned to find the aircon had broken down and the server room was cooking! Even if you do get there in time, running servers for any period of time above operating temperature puts great heat stress into equipment and can result in unreliable running and intermittent faults far into the future. Duke University had just one meltdown in their server room where temperatures rose to between 30-35 °C yet this one incident caused them to experience a whole series of hardware failures over the following 3 months.
Humidity is the second parameter that users think about. Strictly speaking we should talk of Relative Humidity (RH) as humidity is related to the temperature. Too high a value can cause failure of electronic components and corrosion. Tape devices are especially susceptible so if you use any such devices keep the RH on the low side, but not too low. Too little RH is also a potential problem, with electrostatic build up and discharge being a risk. Again tape devices are particularly sensitive.
Humidity can be maintained between 20% to 80%, but aim for stable conditions with as little change as possible over time. Keep temperature changes limited to no more than 5C per hour and RH changes to less than 5% per hour. Under no circumstances should there be any condensation.
Over cooling is not only expensive on energy, but it risks changing humidity levels too fast. It is though a fairly common problem. An over large air conditioning system may be more of an issue than one that is too small.
Flood/ Water leakage
You would be pretty unlucky to have your server room flood due to weather conditions but we have customers whose server rooms are down in the basement, and can only get insured for damage if they install flood sensors. More simply many server rooms have raised floors. Suppose the air conditioning broke down and started to leak, where do you think the water goes? Stories of cables swimming in water are not uncommon.
With the increased density of modern servers and equipment combined with so much power and heat the risk of localised over heating is raised. Monitoring for smoke just makes sense.
Airflow is less frequently a concern, but in any high density server rooms it can be very important to make sure that cold and hot aisles are kept separate. You do not want to feed warm air from one device into the intake of another. Monitoring airflow can also give early warning of air conditioner failure.
Physical Access Management
Physical access to the server room is normally restricted to authorised personnel, there are plenty of things to go wrong without the added problems of people plugging and unplugging cables. Limiting access can be as simple as locking the door, but other methods are also very common.
Card Reader Systems
Often access is by a physical key, an electronic keypad, a smart card either as a desktop reader swipe card or a proximity reader. Monitoring access allows you to check that doors are not left open or that too many people are entering or leaving too often. We have seen temperature issues caused by constant opening of doors in server rooms.
Electronic Door Locks
Card readers are frequently teamed with electronic door release mechanisms and locks. Again careful monitoring allows you to check who is entering and leaving and how often access is required.
It may be useful to record the access to the room using motion detectors, video cameras etc. as part of a surveillance and recording system.
Night time and out of hours restrictions may also apply and physical access may also be part of an Intruder Detection System.
Many peripheral devices that attach to the network can also be managed and can raise alerts when various parameters change. One standard way to do this is to use Simple Network Management Protocol (SNMP). SNMP is supported by most equipment manufacturers.
Imagine a typical scenario where there is a battery backup system in the form of an Uninterruptible Power Supply (UPS). In the event of a power outage the battery backup switches in and keeps the equipment running until power is restored, or at least until a controlled shutdown can take place. Using SNMP it is possible to detect that the battery backup has kicked in and this generates an SNMP Trap. An alert can then be raised based on this event.
If a network attached air conditioning system fails, or if the power supply to a critical device were to fail they too could raise Trap alerts.
Power consumption in the datacentre is always a major issue. It has already been mentioned that over cooling can be very expensive and with energy prices continuously rising this is likely to be an ongoing concern.
Monitoring of power consumption is also important in many other ways. It can help in identifying waste, in the provisioning of power to users, in identifying heavy and light users, and it can assist in the planning of future additions or changes. In some circumstances it may be useful to bill individuals or departments for their energy consumption. All of this can be achieved by careful monitoring.
Intelligent Power Distribution Units (PDUs) can log usage by individual socket, generate usage statistics by KwH, by CO2 generated and so on, and can raise alerts as required. In some cases individual sockets can be switched On and Off remotely to reduce energy consumption.
Intelligent power sensors can be fitted between devices and the mains power inlet to monitor usage, in single and three phase systems, again generating statistics and alerts.
The Server Room is a key component in your business. Careful and accurate monitoring of the environmental conditions in it will maximise your investment, extend the useful life of your equipment and the reliability of the components, minimise your running costs and help you avoid expensive problems. It will assist you in planning future changes and upgrades and help you sleep easy at night, safe in the knowledge that should anything untoward occur you will be notified in time to take corrective actions before the damage becomes too severe.