Alarming & Alerting in Network Management
A roundup of the network manager’s options for the implementation of alarming & alerting solutions for their monitoring applications.
Why is Alerting Important?
Before we get into the nitty gritty details of how to implement an alerting strategy, it is worth investigating why you should be concerned about alerting.
There are two main methodologies you can use when managing your systems.
‘Headless Chicken’ Method
When using the headless chicken method the burden for monitoring your systems to ensure that they are operating properly falls on you, the network manager.
You setup up systems to graph the server usage of a number of items on your servers like free disk space & CPU utilization, you then graph things like network utilization on your switch ports and router ports. Unfortunately, you are the one who needs to go through all of the graphs to make sure everything is operating normally.
The headless chicken method is characterised by you doing lots of work checking that things are working correctly. Checking the system’s operation becomes a routine, almost a mental tick.
When new systems are added this only compounds the problem. Not only will you have to check all of your existing systems, you’ll have to check the new ones too. The headless chicken method just doesn’t scale well. The more systems you manage, the more monitoring you have to do.
Unfortunately, whilst you are doing lots of work, you are not being very productive.
‘Cool as a Cucumber’ Method
When using the Cool as a Cucumber method the burden for monitoring your systems falls on your systems themselves. Either the system itself performs the monitoring, or you install a system to monitor a different system, like installing a network monitoring package or an an environmental monitor for your server room.
Packages are now available that are capable of monitoring a variety of systems and telling you in a variety of ways that something isn’t as it should be.
A good example is a network manager at a large site checking the utilization graphs of a number of switch ports to see whether the server backup was successful. Why doesn’t the backup system tell the network manager that something went wrong with the nightly backup? If the backup system doesn’t know whether it succeeded or not then nothing will.
When purchasing a new system, the degree to which you can manage the system should be close to the top of your list of requirements. Why should you spend time monitoring something when it should be doing the work for you?
Knowing that your systems will tell you when something goes wrong means that you can get on with other, more productive activities.
Alarm Strategies
In Small/Medium Enterprises a typical scenario is to develop a number of disparate systems for monitoring various facets of the IT infrastructure. For instance, there might be one system for monitoring network performance, another for server monitoring and another for server room environmental monitoring. Each system will likely use a separate mechanism for performing the alerting, each system will be configured seperately too.
Whilst separate monitoring systems are not ideal they are certainly a lot better than nothing. So long as the number of separate monitoring systems remains relatively low then the job of maintaining the different alerting systems should be relatively easy.
For instance, say the support mobile phone dies and you have to replace it. You will need to go through each separate system and change the mobile phone number in each. Suppose you forget one, you will not receive messages from that system when it fails.
Of course, eventually the number of systems will increase to a point where managing them all becomes a nightmare. At that point, you may need to consider an event correlation system.
Event Correlation
The main advantage that event correlators bring is that they allow you to configure your alerting policies from a single point.
So, for instance, you will be able to configure the event correlator to send messages to the local IT person during business hours and outside those hours to send messages to a different business unit in a more sociable time zone.
Event correlators are configured via rules. The rules can be very simple, or extremely complex. As with any system the event correlator must be tested before it goes live to ensure that it is behaving as expected.
For instance, if your server room envoronmental monitor detects a high temperature it can send a SNMP trap to the event correlator. The event correlator would then trigger one or more rules you have configured for such an event. One action may be to send a SMS message to the on-call IT technician’s mobile phone.
Any faults in the rules will severly limit the usefullness of the event correlator.
A number of event correlation systems exist, both open source and commercial. On the open source front there is SEC & ruleCore.
There are a number of commercial event correlation systems available, from low-end systems like Prism Microsystems Inc Event Log Manager to enterprise systems like tavve EventWatch. As with most software, download a trial of the tools that are within your budget & see which one most suits your requirements.
Conclusion
Alarming & alerting are very important tools in the network manager’s toolkit. You can save yourself a lot of work by using your systems to alert you when things aren’t working as they should. If you cannot justify the time & effort to set up an event correlator make sure you keep an up-to-date set of documentation on each system. Then, when a change is required, you can methodically go through each system updating them as appropriate.
About the Author
Jack Hughes possesses extensive experience in the design and implementation of software projects, particularly specialising in network management and communications. In 2003 he co-founded OPENXTRA together with Denis Laverty using his skills as the technical and programming expert within the company as its Chief Technical Officer.