Site Monitoring

September 9th, 2008

Reactive Monitoring

To expeditiously address any problems that your web site might encounter, most companies establish some platform and process for site monitoring.  Normally, the monitors are set to identify problems when they occur with one of the most common areas below and are set to automatically notify the individual(s) who can quickly rectify the problem.

* HTTP: Web Server
* POP3: Email Server
* SMTP: Outgoing Email Server
* FTP: File Transfer Protocol
* SSL: Secure Socket Layer
* DNS: Domain Name Server
* Custom TCP Ports
* Ping
* Web Page Content

To run a reliable site, these monitors must be in place.  In addition a documented plan for reacting to the event notification along with an automated escalation procedure must also be established.  I refer to this as reactive monitoring and this type of monitoring is quite common.

Proactive Monitoring

While having the ability to react to problems is extremely important, it is also just as important to attempt to prevent problems from arising in the first place.  That’s why I advocate establishing a much more extensive and comprehensive set of monitors that can help you identify potential problems, and correct them, before they become apparent to the end-user.  This list can become quite extensive.  It is probably a good idea that the more traffic and transactions that your site handles the more monitors that you should establish.  Again, the attempt here is to identify certain thresholds that when reached indicate that if action is not taken there is the potential for problems at some point in the future.  Examples include:

* CPU usage
* Disk space usage, both physical and virtual, both database and front-end server
* Various application services, above and beyond those discussed above
* Bandwidth usage, if you are bandwidth constrained
* Disk utilization to measure physical disk performance constraints
* Tempdb space
* Environmental factors, particularly temperature

These are proactive monitors and are part of an overall strategy to achieve the highest site availability and reliability that you can.

Performance Monitoring

The third type of monitoring that is used is Performance Monitoring.  With performance monitoring you usually contract with a vendor that has a number of geographically dispersed monitoring sites that specifically measure the response time of requests to your servers.  In addition to providing information that can identify performance problems, even when everything is actually running, this service can also identify problems that specific geographic areas may currently be having in accessing your site.  If the performance issues are latency related this information may lead you to incorporate a strategy which mitigates the problems for users in the affected areas.
Historical or Baseline Comparison Monitoring

This is not really a separate type of monitoring, but a means to store data values obtained during proactive monitoring so that the data points can be graphed to show historical trends.  Having this ability can help identify the trends of such things as CPU and disk space usage so that hardware upgrades can be made in advance of when they are needed and emergency upgrades avoided, which is never a good thing.

Usually, site monitoring is an afterthought.  As an afterthought only the basics are initially implemented.  However, it is important to develop a comprehensive monitoring strategy that incorporates reactive, proactive, performance and historical monitoring.