digital.forest Technical Support
News archive: October 2008

On Wednesday, October 22nd digital.forest experienced two electrical interruptions lasting between 6 to 8 milliseconds each. The first event occurred at 13:11, and the second at 19:30 Pacific Daylight Time. These interruptions were caused by a mechanical contact switch fault inside one of our UPS units. This fault occurred on a single phase of the three-phase power within that UPS. The fault caused a voltage drop to be passed into the datacenter along that particular phase. Most computers connected to that electrical phase experienced the voltage drop as a brief interruption of power. Roughly 17% of the servers at digital.forest were affected. Discovery of the root cause, repairing the UPS systems, and bringing the facility back to normal operations required 3 days of hard work by the digital.forest staff, VECA Electric, and MGE - the UPS manufacturer.

The events were triggered as we switched from Bypass mode (power routed around the UPS) to Protected mode (power routed through the UPS) following a scheduled preventative maintenance. This maintenance, which is performed by the UPS manufacturer twice each year, involves taking the UPS system offline, powering it down, inspecting all components, and checking each individual battery.

Following maintenance the UPS system must be transferred from Bypass mode to Protected mode - this switch is a near-zero risk operation. Switchover is handled in such a way that power is not interrupted, and failure during this operation is exceedingly rare. The MGE Service Manager noted that he has seen this operation fail only one other time in his career. digital.forest has performed this switch operation twice per year as a routine part of our maintenance procedures, without incident.

Upon completing the preventative maintenance our UPS vendor brought the system back online. During that process a mechanical contact switch inside one of the units, UPS 2, did not close completely to provide continuous electrical flow. The first time we performed the operation at 13:11, the UPS signaled a fault, and experienced the brief interruption of power on a single phase. The UPS system automatically went offline again, properly reverting to Bypass mode. Unfortunately the interruption on the single phase was long enough in duration to affect some servers downstream.

At this point neither digital.forest nor its vendors knew that a component had failed - only that the switch to Protected Mode was unsuccessful. According to the experts on-site, there was no apparent logical reason for the failure. MGE advised that we make some changes to our electrical distribution as a precautionary measure in preparation for a second transfer operation. At 19:30 power was again routed through the UPS system, and we experienced a second interruption identical to that of 13:11. At this point digital.forest ordered a stop to any further switch attempts and commenced a complete evaluation of UPS 1 & 2. MGE immediately dispatched a senior UPS engineer to our facility. Over the next two days comprehensive diagnosis and testing were performed on both UPS units, and the problems within UPS 2 were identified and repaired. After replacing an inverter and several control and communications cards, the root cause was traced to the fault in the contact switch.

You can view photographs of the faulty contact switch, and some of the damaged circuitry here:
An overall view of the contact switch mechanism.
A close-up view of the specific Phase-A contact that failed.
A close-up of a damaged communications circuit board in UPS 2.

UPS 2 is a relatively new unit, purchased in July of 2007. The physical failure of one of its contact switches is highly unusual. In fact, the manufacturer's specifications rate this component for ten million cycles, whereas we only engage it twice each year. The failed contact switch was inspected during every previous preventative maintenance and showed no signs of trouble, including the preventative maintenance performed earlier that same day.

Following the installation of new parts, we again closely inspected and tested every contact switch (there are 6 total) in both UPS 1 & 2. We also re-inspected and tested every other connection and circuit board inside both of these UPS units. After this comprehensive inspection we tested the UPS units with load banks at 100% power as well as tested the transfer operation under artificial load to validate the diagnosis and repair. At 22:10 on Friday, October 24th the UPS system was successfully brought online, and the datacenter was restored to normal operating conditions.

While this event was traced to a small component, many larger components of our facility, and our procedures performed as intended:


  • By design, the bypass equipment properly and automatically re-routed power when the UPS system faulted. This action contained the interruption to a very short duration, and to a limited portion of the datacenter.

  • High-level experts were immediately dispatched by our UPS vendor when it became clear that something was out of the ordinary, and parts were quickly flown in, reducing our repair time by days.

  • The backup power generation equipment carried our full electrical load continuously and flawlessly for three days.

  • Our contracted Diesel fuel vendor performed as we expected, making deliveries on demand with quality product. We topped our fuel tank on 3 separate occasions during the event.

  • Most importantly our staff remained on-site, responsive and available to assist you with your servers, as well as assist our vendors with the restoration of our facility to normal operations.


Digital.forest remains committed to providing superior service and to continually examining and maintaining all of the systems upon which our customers rely. We deeply regret any inconvenience or interruption of service this event may have caused. We appreciate the patience of our customers and close cooperation of our partners in working through this event, and welcome any additional questions or comments you might have.

Kind Regards,

The digital.forest Executive Team

posted by Chuck G. at 07:11 PM on Thursday, October 30, 2008
Categories:

UPDATE: 11/06/08 12:19 AM PST

At approximately 12:10 AM PST today we experienced 2 short outages of this same upstream (45 seconds and 38 seconds). We still have not received an RFO for the first outage and will be escalating both events for resolution. During this event as with the first our other upstreams handled all of our traffic.

10/30/08 12:03 PM PST
At approximately 12:03 PM PST today one of our upstream connections went down and came back up about 45 seconds later. During this event our other upstreams took over the traffic load. There should have been minimal impact from this event.

We are investigating with the upstream as to the cause of this event and will update as soon as we have more information.

posted by Kyle at 04:08 PM on Thursday, October 30, 2008
Categories: Network

At 10:10 PM PDT we threw the bypass switch and brought UPS 1 & 2 back online. The transfer went seamlessly and everything is working within normal parameters. The facility is back on grid power with fully functioning UPS protection.

Thank you again for your patience and understanding as we dealt with this emergency situation.

posted by Chuck G. at 01:25 AM on Saturday, October 25, 2008
Categories: Emergency Maintenance

Starting a new support blog post on a new day to keep things easy to read.

Status as of 9:00 AM Friday, October 24, 2008:

* Facility remains on generator power. We have enough fuel on-site for a run until Monday.

* UPS 3 is online and functioning properly

* UPS 1 & 2 remain offline. UPS 1 is working properly but can not be brought online until UPS 2 is repaired.

* UPS 2 showed inverter errors whenever it was loaded last night. The spare parts package contained the wrong inverter. The correct one is on its way from California on a commercial flight this morning.

* When it arrives we will install and begin testing procedures again with an eye on going back to grid power sometime tonight.

Update 2:45 PM: Two bits of news...

* We have found a faulty connector inside of UPS 2. The fault is very minor, but it could have contributed to some of the issues we have experienced. We are replacing the whole assembly to be on the safe side. We'll post photos soon.

* We just finished topping off our fuel supply with 1300 gallons of Diesel and now have enough for a continuous generator run through Tuesday afternoon (October 28th).

Update 4:00 PM: Several updates...

* The replacement inverter is installed in UPS 2 and we have begun load testing that unit. If all goes well we will sync UPS 1 & 2 and begin load testing them together.

* The replacement connector component has an ETA of 7:30 PM. Once that arrives we should be able to proceed swiftly.

* We are planning to switch back to grid power this evening between 8 and 9 PM PDT.

* Cummins Northwest, our generator maintenance contractor just finished an inspection of our generator system and proclaimed it in excellent condition.

Update 4:30 PM: UPS 2 is confirmed operational by the MGE tech on-site. We are testing the units synced now.

Based on a client request, the timing for the switch back to grid power has been postponed until 10:00 PM PDT

Update 8:00 PM: All the required parts are here, we are finalizing tests and preparing to perform the switch back to grid power tonight just after 10 PM.


Update 8:35 PM: Here are some photos of the faulty connector. Earlier I stated that it was from the static switch, it is in fact from inside UPS 2. It was just my misunderstanding of what was reported to me, and I've corrected that statement above. This connector contains three large switches, and one of them has a slim gap which is making a poor connection.

Above: The connector, with a socket wrench for scale. The connector in the photo is in an "open" state.

Below: Here is a close up shot with the connector closed. The top red arrow is pointing to a solid connection. The bottom red arrow is pointing to the bad connector portion. Note the shadow cast from the camera's flash. There should be no shadow visible if the connector is tight.

We are certain that this is the root cause of our power event on Wednesday. The connection was loose enough to cause poor conduction, which is why we saw a voltage drop and then falling back to a bypass state. It would also explain the damage to the comm board in UPS 2 we showed you yesterday, and the damage to the inverter. UPS 2 was purchased just over a year ago and should not have failed in this manner. Since UPS 1 is an identical unit we have removed the same connector and inspected it. UPS 1's connector is good and solid. We tested UPS 1's connector in UPS 2 and it works perfectly. The technician from MGE is replacing this connector in UPS 1 right now with a new one, and we'll begin testing both of the units soon.


Update 9:30 PM: We are "a go." Both UPS units pass all tests, several times. Everything is working as it should under load. We are preparing for coming off generator power at just after 10 PM.

Update 10:00 PM: 10:10 PM is our specific target time for bringing the repaired UPS system back online.


We will continue to update this post as we learn more information. Thank you for your continued patience and understanding as we deal with this emergency situation.


Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc.


posted by Chuck G. at 11:37 AM on Friday, October 24, 2008
Categories: Emergency Maintenance

A critical security issue has been discovered in Microsoft Windows. You can read information about it here. Microsoft recommends that customers apply the suggested update immediately.

posted by Chuck G. at 02:04 AM on Friday, October 24, 2008
Categories: Security Alerts

We will be taking all of our Windows based shared hosting servers offline between 21:00 and 22:00 today to perform maintenance.

Thank you for your patience.

Update October 23, 2008 22:00: We will be extending the maintenance on our windows servers thru 00:00 tonight.

posted by digital.forest at 07:24 PM on Thursday, October 23, 2008
Categories: Hosting Servers

A briefing for you on the current situation regarding our power event yesterday:

* UPS 1 & 2 remain offline this morning, and the facility remains on generator power.

* UPS 3 is online and functioning normally. We migrated what circuits we could to this UPS system last night between 7 and 10 PM.

* We have enough fuel on-site for over four days of run time, but to be safe we have scheduled fuel deliveries for the next several days to ensure supply.

* We are performing our own hourly generator checks around the clock to ensure proper operation.

* Our generator maintenance vendor is also scheduled to arrive daily and inspect the operation of the system.

* Until we have a fully functioning UPS system we can not risk a power transfer back to the grid. We believe the fault is specific to UPS 2. Technicians will investigate further today.

* At 9 o'clock this morning we are meeting with our UPS manufacturer to determine next steps.


Update 11 AM: The UPS manufacturer has dispatched a high-level technician and is airfreighting a complete set of replacement parts for our UPS systems from California right now. ETA is mid-afternoon. We just received a shipment of load banks which are being installed on our roof. Here is the plan:

* Isolate UPS 1 & 2 from our facility input and load.

* Test every subsystem of each UPS, isolate and repair fault(s).

* Test each UPS under artificial load. If they fail, fix. If they pass...

* Sync the UPS' back into their parallel configuration, retest under artificial load. If they fail, fix. If they pass...

* Test again and confirm smooth transfer of artificial load.

* Plan for reinsertion of UPS 1 & 2 into our facility load.

Our initial ETA for full restoration of protected power is sometime after midnight tonight. We will update this schedule with a specific time once more details are known. Meanwhile we continue to run the facility on generator power. Even though we have only consumed about 20% of our available fuel we still plan on refueling early this afternoon.

Update 12:25 PM: Fuel delivery truck is here and we're topping up the fuel tank.

Update 1:30 PM: Senior-level technician from the UPS manufacturer has arrived on-site.

Update 3:45 PM: Root Cause Suspect...

This is part of the circuit board of UPS 2 that communicates with the static switch. A replacement board in en route.

We're not completely satisfied however, and continuing to give each part of UPS 2 close examination and test. Stay tuned for more details.

Update 7:15 PM: Both UPS units have been inspected and we're quite certain the item identified above is the only issue to be addressed. The replacement part is arriving at Sea-Tac airport very soon. Meanwhile we are working to test UPS 1 under artificial load. Once the replacement board arrives we will test UPS 2 as well. We should have a specific ETA for turn-up by the time we next update the site.

Informational Update 7:55 PM: We have successfully transferred a 100% artificial load on and off UPS 1 three times. We're certain that UPS 1 is operating properly. Repair on UPS 2 will begin soon.

Update 9:10 PM: Barring any new issues we are on track for a transfer back to grid power between midnight and 1 AM tonight. The replacement part has arrived and we should begin testing UPS 2 very soon.

Update 11:25 PM: Some further issues have been uncovered in testing UPS 2, so we have postponed a transfer back to grid power indefinitely. No transfer will take place tonight.


We will continue to update this post as we learn more information. Thank you for your continued patience and understanding as we deal with this emergency situation.

posted by Chuck G. at 11:26 AM on Thursday, October 23, 2008
Categories: Emergency Maintenance

Currently we are working to resolve faults with the following servers in order to return them to normal operating status. All services on these servers will be unavailable during this down time. Please know that we are working around the clock to restore these servers to normal operating status and that we will update this page as we have more information.

power.forest.net
boysenberry.forest.net
ara.wwwnexus.com
mercury.wwwnexus.com

Update October 23rd, 2008 01:24 PDT: Both the boysenberry.forest.net and mercury.wwwnexus.com servers have returned to normal operation.

Update October 23rd, 2008 03:24 PDT: The ara.wwwnexus.com server has returned to normal operation.

posted by digital.forest at 03:27 AM on Thursday, October 23, 2008
Categories: Hosting Servers, ara.wwwnexus.com, boysenberry.forest.net, mercury.wwwnexus.com, power.forest.net

At 13:11 today, the digital.forest datacenter experienced a power event related to our ongoing facility maintenance. As a factory technician from MGE was bringing UPS 1 & 2 back online there was a very brief drop in power to portions of the facility. An unknown percentage of the equipment in the facility lost power for less than one second. We are currently assessing the situation and attending to servers that did not autonomically reboot. Most systems appeared to have come through the event without trouble, however we advise clients to log in and check their equipment.

Systems on UPS 3 were unaffected.

We are investigating this incident and will report findings as soon as possible.

Update 15:30: We are still running on generator power, and the UPS systems remain offline. When we brought UPS 1 & 2 online just after 1 PM, UPS 2's inverter malfunctioned. This should not have caused the resulting issue however as the static switch should have automatically stayed in bypass mode. While any transfer of power carries risk, this action is deemed very low risk. We have performed this action well over a dozen times over the past several years without incident. Static switches such as this do not usually fail in this fashion and our vendors are trying to ascertain the most prudent plan of action in order to bring the UPS systems back online. We'll post an update as soon as we have more information.

posted by Chuck G. at 04:14 PM on Wednesday, October 22, 2008
Categories: Colocated & Dedicated Servers

We'll be providing updates within this blog post today concerning our scheduled UPS Maintenance & Upgrade today. Check back for details as the work progresses.

Switch to Generator Power: Around 10 am PDT the electricians and UPS techs were ready to shut down the UPS systems for their work and Kevin Teker, our Facilities Manager cut us over to generator power. The transition went smoothly. The UPS systems were put into bypass mode and then shut down. We expect this work to take a few hours. Updates will be posted as work progresses.


A bit of background...
One of the procedures we are having performed today is a capacity upgrade on one of our UPS systems. It is an MGE EPS 7000 which is capable of this neat trick. With a software upgrade and the addition of a some batteries it can scale from 300kVA to 500kVA. We took delivery of the additional batteries last week. As we've noted on previous posts here due to some unusual circumstances we move UPS gear into the facility by crane.


Above: Preparing the cabinet for flight through the roof.


Above: A view from inside the UPS room looking out (and up).


Above: The new cabinet landed, safe and sound.

We've done this enough times that it has now become rather routine. The crane arrive before dawn, and within an hour the work was done and the roof re-sealed.

Update, Noon PDT: Work continues apace. The UPS maintenance is about halfway finished. The electricians are wiring up the new battery cabinet on UPS 3.


Above: Electricians from VECA installing some conduit for the new cabinet.


Above: Technicians from ECS checking the batteries on UPS1.

The only thing out of the ordinary so far is the relative quiet of the room with all the equipment there turned off.


Update, 1 PM PDT: The UPS maintenance on UPS 1 & 2 is complete. Those systems will come back online momentarily.

Please see this post for details on the incident involving UPS 1 & 2's turn-up.

Update, 3 PM PDT:UPS 1 & 2 remain offline. The facility remains on generator power. Work is continuing on UPS 3.


Update, 4:40 PM PDT: The battery cabinet installation to UPS 3 is complete. The upgrade process is now underway. We plan to have it back online after 6 PM. It is on a different static switch than UPS 1 & 2.

Update, 6:15 PM PDT: UPS 3 came online flawlessly.

We are scheduled to bring UPS 1 & 2 back online in a couple of hours. Our UPS vendor and electricians are pursuing a plan to minimize the risk of a repeat of the earlier event as much as possible. Check back for details.

Update 7:45 PM: We are planning to bring UPS 1 & 2 back online in about 5 minutes.

Update 9:10 PM: UPS 1 & 2 failed to transfer. After two attempts, and long conversations with the UPS manufacturer we have decided to keep them offline for the night. We will continue to run the facility on generator power until this UPS issue can be resolved. We have moved what loads we could onto UPS 3 which successfully completed its upgrade and has spare capacity.

Preliminary analysis suggests that at least one large capacitor in UPS 2 has likely failed. MGE will have technicians here in the morning to investigate and remedy this situation ASAP.

posted by Chuck G. at 01:52 PM on Wednesday, October 22, 2008
Categories: Datacenter Expansion, Facility Maintenance, Scheduled Maintenance

As we posted last week we will be performing our semiannual UPS maintenance tomorrow, Wednesday, October 22nd. Additionally we will be performing an upgrade on UPS 3. This involves software on the unit itself, along with the installation of an additional battery cabinet. The new battery cabinet arrived late last week (pictures coming soon) and was lifted into position.

As is normal during such maintenance intervals the UPS systems will be in bypass and the facility will be running on generator power. There should be no interruption of service during this procedure.

We'll post updates when the work starts and updates as it progresses.


posted by Chuck G. at 05:15 PM on Tuesday, October 21, 2008
Categories: Datacenter Expansion, Facility Maintenance, Scheduled Maintenance

We are currently experiencing some difficulties with File Maker, specifically instant web publishing, on our durian server. Technicians are working to resolve the issue, and we apologize for any inconvenience this may have caused.

posted by digital.forest at 03:38 PM on Monday, October 20, 2008
Categories: durian.forest.net

On Wednesday, October 22nd we will be performing scheduled maintenance on our UPS Systems. UPS 1 & 2 will be undergoing their biannual preventative maintenance. UPS 3 will be undergoing an upgrade, which involves the installation of an additional battery cabinet.

As is normal during such maintenance intervals the UPS systems will be in bypass and the facility will be running on generator power.

We will post reminders as the date approaches, and updates as the maintenance progresses.

posted by Chuck G. at 11:43 AM on Monday, October 13, 2008
Categories: Datacenter Expansion, Facility Maintenance, Scheduled Maintenance

The server sol.wwwnexus.com is currently experiencing technical difficulties. Technicians are currently working to resolve the issue. We apologize for any inconvenience this may cause.

posted by digital.forest at 01:12 PM on Sunday, October 12, 2008
Categories: sol.wwwnexus.com

We have had an increase in support calls in the past week as a result of some large cable ISPs putting limits on their network. Comcast for example appears to be in the process of blocking all outbound mail from their network on the default SMTP port 25**. What this means to you is that something that has worked normally up to now, sending mail via your server here at digital.forest, suddenly stops working.

Fortunately it is easy to fix this, via several different ways:

1. Change your mail software to use an alternate SMTP (sending) port.
Our servers support SMTP over port 587. Every mail software will be a little different on how to make this configuration change. If you do a Google search on "(mail software) SMTP port" you will likely find many step-by-step instructions for your particular mail software. If you use Outlook, or Entourage then search for "Outlook SMTP port" or "Entourage SMTP port".

Find the SMTP port and change it from 25 to 587 and you will once again be able to send mail normally. If this does not work contact digital.forest technical support and we can share some other port numbers to try.

2. Send your mail via your ISP's mail servers.
For example, if Comcast is your ISP, configure your mail software to send mail via their servers. You will need to contact your ISP to get the configuration specifics. This should only be required if your mail software does not support changing the default port as described above. However, most modern mail software allows this configuration change.

3. Use Webmail.
All of our mail systems support access via a web browser. This is especially handy if you are not on your own computer, such as traveling and using an Internet cafe's public computers. To access webmail just type http://mail.(yourdomain) into your web browser and follow the page's login instructions.


** The reason that ISPs enable these blocks is to cut down on spam outbreaks. It is common for home computers to become infected with malware that turns them into spam-sending devices. The ISP can't fix the source of that problem, they can only block the traffic.

posted by Chuck G. at 01:21 PM on Friday, October 10, 2008
Categories: Mail