As we explained in a related post, many IT managers recount stories of downtime in their distributed server rooms and remote wiring closets caused by unexpected but rather routine events. When analyzing these stories, a common thread emerges: lack of information that leads to human error, which causes the downtime.
Consider these statistics:
- The IDC estimates there are 2.9 million server rooms and wiring closets in the United States alone.
- More than 70% of reported data centre outages can be directly attributed to human error, according to the Uptime Institute.
Video surveillance and sensors
Scalable monitoring and automation systems can collect, organise and distribute critical alerts and surveillance videos. By monitoring power, cooling, the backs and fronts of racks, and the environment, these systems can generate instant fault notification, enable quick assessment of the situation, and provide resolution of critical infrastructure events that can adversely affect IT system availability.
Video surveillance systems can be tied to motion sensors so that whenever motion is detected, it triggers the camera to pan the area and sends the video to an authorised administrator, who can quickly rectify situations such as contractors shrink-wrapping live servers.
A camera management system typically allows for tracking of facilities personnel, vendors, security personnel, custodians and other visitors who come into the server room or remote wiring closet. An administrator may opt to remotely log into the system and observe the actions of anyone who is in the room. Some systems can be equipped with speakers so that the administrator can deliver instructions or provide warnings to the visitor.
Intelligent rack outlets
Intelligent rack outlets, also known as rack-mounted PDUs, are long thin strips of electrical outlets mounted to the inside back of a rack. The devices allow users to remotely recycle power to locked-up equipment and configure the sequence in which power is turned on or off for each outlet, to predetermine which piece of equipment is turned on first so other equipment dependent on that unit will function properly.
The monitoring system prevents overloads by measuring actual consumption through the intelligent rack outlets, giving administrators the information they need to make decisions about where to place new equipment.
Monitoring and automation software
A management and automation system provides administrators with a wealth of data that will reduce downtime cause by human error, including:
- Alarms and notification when thresholds are exceeded, via email, text message, phone call or whatever method the user chooses.
- Equipment status checks for everything from servers to batteries. Remember that the failure of a single battery can result in the loss of the critical load. The cost of replacing one or two batteries is minimal compared to experiencing a failure that causes the closet or server to crash.
- Reporting analytics: Data gathered by a monitoring system can be converted into customised reports for the IT administrator to review. Such reports can alert administrators to situations such as temperature fluctuations, who has been at which rack for how long, and how much load is on a particular UPS.
- Mass configuration: Administrators can issue mass change orders for all devices profiled into the central monitoring and automation system, such as locking 50 rack doors at once – perhaps to protect them from overzealous cleaning staff.
- Control: Detailed monitoring and automation system data helps give administrators the information they need to take control when problems arise. For example, a system can map the power path and physical system relationships and dependencies, to help identify the source of a problem. A system may also illustrate the consequence of a particular device failure on rack-based equipment, to help identify a critical business impact.