On Wednesday, October 22nd digital.forest experienced two electrical interruptions lasting between 6 to 8 milliseconds each. The first event occurred at 13:11, and the second at 19:30 Pacific Daylight Time. These interruptions were caused by a mechanical contact switch fault inside one of our UPS units. This fault occurred on a single phase of the three-phase power within that UPS. The fault caused a voltage drop to be passed into the datacenter along that particular phase. Most computers connected to that electrical phase experienced the voltage drop as a brief interruption of power. Roughly 17% of the servers at digital.forest were affected. Discovery of the root cause, repairing the UPS systems, and bringing the facility back to normal operations required 3 days of hard work by the digital.forest staff, VECA Electric, and MGE - the UPS manufacturer.
The events were triggered as we switched from Bypass mode (power routed around the UPS) to Protected mode (power routed through the UPS) following a scheduled preventative maintenance. This maintenance, which is performed by the UPS manufacturer twice each year, involves taking the UPS system offline, powering it down, inspecting all components, and checking each individual battery.
Following maintenance the UPS system must be transferred from Bypass mode to Protected mode - this switch is a near-zero risk operation. Switchover is handled in such a way that power is not interrupted, and failure during this operation is exceedingly rare. The MGE Service Manager noted that he has seen this operation fail only one other time in his career. digital.forest has performed this switch operation twice per year as a routine part of our maintenance procedures, without incident.
Upon completing the preventative maintenance our UPS vendor brought the system back online. During that process a mechanical contact switch inside one of the units, UPS 2, did not close completely to provide continuous electrical flow. The first time we performed the operation at 13:11, the UPS signaled a fault, and experienced the brief interruption of power on a single phase. The UPS system automatically went offline again, properly reverting to Bypass mode. Unfortunately the interruption on the single phase was long enough in duration to affect some servers downstream.
At this point neither digital.forest nor its vendors knew that a component had failed - only that the switch to Protected Mode was unsuccessful. According to the experts on-site, there was no apparent logical reason for the failure. MGE advised that we make some changes to our electrical distribution as a precautionary measure in preparation for a second transfer operation. At 19:30 power was again routed through the UPS system, and we experienced a second interruption identical to that of 13:11. At this point digital.forest ordered a stop to any further switch attempts and commenced a complete evaluation of UPS 1 & 2. MGE immediately dispatched a senior UPS engineer to our facility. Over the next two days comprehensive diagnosis and testing were performed on both UPS units, and the problems within UPS 2 were identified and repaired. After replacing an inverter and several control and communications cards, the root cause was traced to the fault in the contact switch.
You can view photographs of the faulty contact switch, and some of the damaged circuitry here:
An overall view of the contact switch mechanism.
A close-up view of the specific Phase-A contact that failed.
A close-up of a damaged communications circuit board in UPS 2.
UPS 2 is a relatively new unit, purchased in July of 2007. The physical failure of one of its contact switches is highly unusual. In fact, the manufacturer's specifications rate this component for ten million cycles, whereas we only engage it twice each year. The failed contact switch was inspected during every previous preventative maintenance and showed no signs of trouble, including the preventative maintenance performed earlier that same day.
Following the installation of new parts, we again closely inspected and tested every contact switch (there are 6 total) in both UPS 1 & 2. We also re-inspected and tested every other connection and circuit board inside both of these UPS units. After this comprehensive inspection we tested the UPS units with load banks at 100% power as well as tested the transfer operation under artificial load to validate the diagnosis and repair. At 22:10 on Friday, October 24th the UPS system was successfully brought online, and the datacenter was restored to normal operating conditions.
While this event was traced to a small component, many larger components of our facility, and our procedures performed as intended:
- By design, the bypass equipment properly and automatically re-routed power when the UPS system faulted. This action contained the interruption to a very short duration, and to a limited portion of the datacenter.
- High-level experts were immediately dispatched by our UPS vendor when it became clear that something was out of the ordinary, and parts were quickly flown in, reducing our repair time by days.
- The backup power generation equipment carried our full electrical load continuously and flawlessly for three days.
- Our contracted Diesel fuel vendor performed as we expected, making deliveries on demand with quality product. We topped our fuel tank on 3 separate occasions during the event.
- Most importantly our staff remained on-site, responsive and available to assist you with your servers, as well as assist our vendors with the restoration of our facility to normal operations.
Digital.forest remains committed to providing superior service and to continually examining and maintaining all of the systems upon which our customers rely. We deeply regret any inconvenience or interruption of service this event may have caused. We appreciate the patience of our customers and close cooperation of our partners in working through this event, and welcome any additional questions or comments you might have.
Kind Regards,
The digital.forest Executive Team
posted by Chuck G. at 07:11 PM on Thursday, October 30, 2008
Categories: