|
News archive: Emergency Maintenance
*****UPDATE 12:28 PDT******
Service has been restored. A filtering server crashed and interrupted mail service for all users. We are sorry for the inconvenience.
*****UPDATE*****
We care getting reports that mail service has been restored for some users, but other users continue to have trouble. Mail may shift between working and not working.
***************
Our Platinum mail server is currently experiencing an unknown issue at this time that is preventing users from sending or receiving via local mail applications like Outlook and Apple Mail. Admins are troubleshooting the issue. As a workaround, you can log in to the webmail interface at mail.(yourdomain), using your account name as the username (the part before the "@" symbol) and your mailbox password. Currently there is no ETA for a fix. We will update with more information as it becomes available.
posted by digital.forest at 11:40 AM on Thursday, October 14, 2010
Categories: Emergency Maintenance, Mail
******Update 12:43 hrs******
This repair is complete and all power systems are again functioning according to design.
******Non-Service Impacting Maintenance******
At 11:00 hrs PST Wednesday, January 27th, 2010 digital.forest will make an emergency repair to one of our UPS systems. This repair should have no noticeable impact to delivered power in the facility.
This repair is necessary to replace a failed capacitor on a static transfer switch that was discovered during a routine inspection by digital.forest staff on Tuesday, February 26th. The capacitor does not provide power to any critical systems but is a part of the transfer switch that isolates the UPS system from the critical load during maintenance. The failure of the capacitor is such that the dielectric (in this case oil) is leaking in small amounts into the static switch. This failure poses no risk of power outage, fire, or short-circuit in any of the digital.forest data center infrastructure. It is our policy to repair any failed component in our critical power infrastructure immediately, so long as the repair does not increase risk of power delivery failure.
The repair will not require wrapping power around the UPS system which would involve transferring power from utility to generator. The vendor engineer will perform the work while the equipment is energized. We estimate that the work will require one hour to complete and will provide an update here once all work is complete.
************
digital.forest remains committed to providing our customers with the highest level of service, the greatest degree of protection, and the most transparent communications. If you have any questions or concerns about the above maintenance, please contact your account manager. Our account management staff is available Monday through Friday from 08:00 hrs PST until 17:00 hrs PST at 877-720-0483 Option 2.
posted by digital.forest at 10:49 PM on Tuesday, January 26, 2010
Categories: Emergency Maintenance
Emergency systems maintenance is required this morning on boysenberry.forest.net. The server experienced a hardware failure this morning at approximately 6:30AM PST. Evaluation of the hardware issue and the legacy nature of the server have led us to conclude that users on boysenberry will be best served by a migration to our newest MySQL server, honeysuckle.forest.net.
Boysenberry's IP will be added to honeysuckle, eliminating a need to make changes to database code.
We are working to migrate customer data to the new server as quickly as possible.
posted by digital.forest at 10:27 AM on Tuesday, February 17, 2009
Categories: Emergency Maintenance
On Wednesday, October 22nd digital.forest experienced two electrical interruptions lasting between 6 to 8 milliseconds each. The first event occurred at 13:11, and the second at 19:30 Pacific Daylight Time. These interruptions were caused by a mechanical contact switch fault inside one of our UPS units. This fault occurred on a single phase of the three-phase power within that UPS. The fault caused a voltage drop to be passed into the datacenter along that particular phase. Most computers connected to that electrical phase experienced the voltage drop as a brief interruption of power. Roughly 17% of the servers at digital.forest were affected. Discovery of the root cause, repairing the UPS systems, and bringing the facility back to normal operations required 3 days of hard work by the digital.forest staff, VECA Electric, and MGE - the UPS manufacturer.
The events were triggered as we switched from Bypass mode (power routed around the UPS) to Protected mode (power routed through the UPS) following a scheduled preventative maintenance. This maintenance, which is performed by the UPS manufacturer twice each year, involves taking the UPS system offline, powering it down, inspecting all components, and checking each individual battery.
Following maintenance the UPS system must be transferred from Bypass mode to Protected mode - this switch is a near-zero risk operation. Switchover is handled in such a way that power is not interrupted, and failure during this operation is exceedingly rare. The MGE Service Manager noted that he has seen this operation fail only one other time in his career. digital.forest has performed this switch operation twice per year as a routine part of our maintenance procedures, without incident.
Upon completing the preventative maintenance our UPS vendor brought the system back online. During that process a mechanical contact switch inside one of the units, UPS 2, did not close completely to provide continuous electrical flow. The first time we performed the operation at 13:11, the UPS signaled a fault, and experienced the brief interruption of power on a single phase. The UPS system automatically went offline again, properly reverting to Bypass mode. Unfortunately the interruption on the single phase was long enough in duration to affect some servers downstream.
At this point neither digital.forest nor its vendors knew that a component had failed - only that the switch to Protected Mode was unsuccessful. According to the experts on-site, there was no apparent logical reason for the failure. MGE advised that we make some changes to our electrical distribution as a precautionary measure in preparation for a second transfer operation. At 19:30 power was again routed through the UPS system, and we experienced a second interruption identical to that of 13:11. At this point digital.forest ordered a stop to any further switch attempts and commenced a complete evaluation of UPS 1 & 2. MGE immediately dispatched a senior UPS engineer to our facility. Over the next two days comprehensive diagnosis and testing were performed on both UPS units, and the problems within UPS 2 were identified and repaired. After replacing an inverter and several control and communications cards, the root cause was traced to the fault in the contact switch.
You can view photographs of the faulty contact switch, and some of the damaged circuitry here:
An overall view of the contact switch mechanism.
A close-up view of the specific Phase-A contact that failed.
A close-up of a damaged communications circuit board in UPS 2.
UPS 2 is a relatively new unit, purchased in July of 2007. The physical failure of one of its contact switches is highly unusual. In fact, the manufacturer's specifications rate this component for ten million cycles, whereas we only engage it twice each year. The failed contact switch was inspected during every previous preventative maintenance and showed no signs of trouble, including the preventative maintenance performed earlier that same day.
Following the installation of new parts, we again closely inspected and tested every contact switch (there are 6 total) in both UPS 1 & 2. We also re-inspected and tested every other connection and circuit board inside both of these UPS units. After this comprehensive inspection we tested the UPS units with load banks at 100% power as well as tested the transfer operation under artificial load to validate the diagnosis and repair. At 22:10 on Friday, October 24th the UPS system was successfully brought online, and the datacenter was restored to normal operating conditions.
While this event was traced to a small component, many larger components of our facility, and our procedures performed as intended:
- By design, the bypass equipment properly and automatically re-routed power when the UPS system faulted. This action contained the interruption to a very short duration, and to a limited portion of the datacenter.
- High-level experts were immediately dispatched by our UPS vendor when it became clear that something was out of the ordinary, and parts were quickly flown in, reducing our repair time by days.
- The backup power generation equipment carried our full electrical load continuously and flawlessly for three days.
- Our contracted Diesel fuel vendor performed as we expected, making deliveries on demand with quality product. We topped our fuel tank on 3 separate occasions during the event.
- Most importantly our staff remained on-site, responsive and available to assist you with your servers, as well as assist our vendors with the restoration of our facility to normal operations.
Digital.forest remains committed to providing superior service and to continually examining and maintaining all of the systems upon which our customers rely. We deeply regret any inconvenience or interruption of service this event may have caused. We appreciate the patience of our customers and close cooperation of our partners in working through this event, and welcome any additional questions or comments you might have.
Kind Regards,
The digital.forest Executive Team
posted by Chuck G. at 07:11 PM on Thursday, October 30, 2008
Categories: , Emergency Maintenance
At 10:10 PM PDT we threw the bypass switch and brought UPS 1 & 2 back online. The transfer went seamlessly and everything is working within normal parameters. The facility is back on grid power with fully functioning UPS protection.
Thank you again for your patience and understanding as we dealt with this emergency situation.
posted by Chuck G. at 01:25 AM on Saturday, October 25, 2008
Categories: Emergency Maintenance
Starting a new support blog post on a new day to keep things easy to read.
Status as of 9:00 AM Friday, October 24, 2008:
* Facility remains on generator power. We have enough fuel on-site for a run until Monday.
* UPS 3 is online and functioning properly
* UPS 1 & 2 remain offline. UPS 1 is working properly but can not be brought online until UPS 2 is repaired.
* UPS 2 showed inverter errors whenever it was loaded last night. The spare parts package contained the wrong inverter. The correct one is on its way from California on a commercial flight this morning.
* When it arrives we will install and begin testing procedures again with an eye on going back to grid power sometime tonight.
Update 2:45 PM: Two bits of news...
* We have found a faulty connector inside of UPS 2. The fault is very minor, but it could have contributed to some of the issues we have experienced. We are replacing the whole assembly to be on the safe side. We'll post photos soon.
* We just finished topping off our fuel supply with 1300 gallons of Diesel and now have enough for a continuous generator run through Tuesday afternoon (October 28th).
Update 4:00 PM: Several updates...
* The replacement inverter is installed in UPS 2 and we have begun load testing that unit. If all goes well we will sync UPS 1 & 2 and begin load testing them together.
* The replacement connector component has an ETA of 7:30 PM. Once that arrives we should be able to proceed swiftly.
* We are planning to switch back to grid power this evening between 8 and 9 PM PDT.
* Cummins Northwest, our generator maintenance contractor just finished an inspection of our generator system and proclaimed it in excellent condition.
Update 4:30 PM: UPS 2 is confirmed operational by the MGE tech on-site. We are testing the units synced now.
Based on a client request, the timing for the switch back to grid power has been postponed until 10:00 PM PDT
Update 8:00 PM: All the required parts are here, we are finalizing tests and preparing to perform the switch back to grid power tonight just after 10 PM.
Update 8:35 PM: Here are some photos of the faulty connector. Earlier I stated that it was from the static switch, it is in fact from inside UPS 2. It was just my misunderstanding of what was reported to me, and I've corrected that statement above. This connector contains three large switches, and one of them has a slim gap which is making a poor connection.

Above: The connector, with a socket wrench for scale. The connector in the photo is in an "open" state.
Below: Here is a close up shot with the connector closed. The top red arrow is pointing to a solid connection. The bottom red arrow is pointing to the bad connector portion. Note the shadow cast from the camera's flash. There should be no shadow visible if the connector is tight.

We are certain that this is the root cause of our power event on Wednesday. The connection was loose enough to cause poor conduction, which is why we saw a voltage drop and then falling back to a bypass state. It would also explain the damage to the comm board in UPS 2 we showed you yesterday, and the damage to the inverter. UPS 2 was purchased just over a year ago and should not have failed in this manner. Since UPS 1 is an identical unit we have removed the same connector and inspected it. UPS 1's connector is good and solid. We tested UPS 1's connector in UPS 2 and it works perfectly. The technician from MGE is replacing this connector in UPS 1 right now with a new one, and we'll begin testing both of the units soon.
Update 9:30 PM: We are "a go." Both UPS units pass all tests, several times. Everything is working as it should under load. We are preparing for coming off generator power at just after 10 PM.
Update 10:00 PM: 10:10 PM is our specific target time for bringing the repaired UPS system back online.
We will continue to update this post as we learn more information. Thank you for your continued patience and understanding as we deal with this emergency situation.
Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc.
posted by Chuck G. at 11:37 AM on Friday, October 24, 2008
Categories: Emergency Maintenance
A briefing for you on the current situation regarding our power event yesterday:
* UPS 1 & 2 remain offline this morning, and the facility remains on generator power.
* UPS 3 is online and functioning normally. We migrated what circuits we could to this UPS system last night between 7 and 10 PM.
* We have enough fuel on-site for over four days of run time, but to be safe we have scheduled fuel deliveries for the next several days to ensure supply.
* We are performing our own hourly generator checks around the clock to ensure proper operation.
* Our generator maintenance vendor is also scheduled to arrive daily and inspect the operation of the system.
* Until we have a fully functioning UPS system we can not risk a power transfer back to the grid. We believe the fault is specific to UPS 2. Technicians will investigate further today.
* At 9 o'clock this morning we are meeting with our UPS manufacturer to determine next steps.
Update 11 AM: The UPS manufacturer has dispatched a high-level technician and is airfreighting a complete set of replacement parts for our UPS systems from California right now. ETA is mid-afternoon. We just received a shipment of load banks which are being installed on our roof. Here is the plan:
* Isolate UPS 1 & 2 from our facility input and load.
* Test every subsystem of each UPS, isolate and repair fault(s).
* Test each UPS under artificial load. If they fail, fix. If they pass...
* Sync the UPS' back into their parallel configuration, retest under artificial load. If they fail, fix. If they pass...
* Test again and confirm smooth transfer of artificial load.
* Plan for reinsertion of UPS 1 & 2 into our facility load.
Our initial ETA for full restoration of protected power is sometime after midnight tonight. We will update this schedule with a specific time once more details are known. Meanwhile we continue to run the facility on generator power. Even though we have only consumed about 20% of our available fuel we still plan on refueling early this afternoon.
Update 12:25 PM: Fuel delivery truck is here and we're topping up the fuel tank.
Update 1:30 PM: Senior-level technician from the UPS manufacturer has arrived on-site.
Update 3:45 PM: Root Cause Suspect...

This is part of the circuit board of UPS 2 that communicates with the static switch. A replacement board in en route.
We're not completely satisfied however, and continuing to give each part of UPS 2 close examination and test. Stay tuned for more details.
Update 7:15 PM: Both UPS units have been inspected and we're quite certain the item identified above is the only issue to be addressed. The replacement part is arriving at Sea-Tac airport very soon. Meanwhile we are working to test UPS 1 under artificial load. Once the replacement board arrives we will test UPS 2 as well. We should have a specific ETA for turn-up by the time we next update the site.
Informational Update 7:55 PM: We have successfully transferred a 100% artificial load on and off UPS 1 three times. We're certain that UPS 1 is operating properly. Repair on UPS 2 will begin soon.
Update 9:10 PM: Barring any new issues we are on track for a transfer back to grid power between midnight and 1 AM tonight. The replacement part has arrived and we should begin testing UPS 2 very soon.
Update 11:25 PM: Some further issues have been uncovered in testing UPS 2, so we have postponed a transfer back to grid power indefinitely. No transfer will take place tonight.
We will continue to update this post as we learn more information. Thank you for your continued patience and understanding as we deal with this emergency situation.
posted by Chuck G. at 11:26 AM on Thursday, October 23, 2008
Categories: Emergency Maintenance
sage.forest.net is back up. If you have any issues or questions, please let us know.
posted by digital.forest at 08:41 AM on Monday, September 29, 2008
Categories: Emergency Maintenance, sage.forest.net
Silverpine will be coming down for emergency maintenance shortly. It should be up again in less than 15 minutes.
posted by digital.forest at 12:04 AM on Monday, September 8, 2008
Categories: Emergency Maintenance
We will be taking thyme.forest.net offline for about 5-15 minuets starting at 11:45 AM PST today in order to address a performance degradation in the system.
This downtime will occur on Friday August 29th, 2008, starting at 11:45 AM PST and ending at about 12:00 PM PST.
Update: August 29th, 2008 11:55 PST: The server is online again. The server's performance has been greatly increased and the performance degradation that was effecting the server has been addressed.
posted by digital.forest at 01:32 PM on Friday, August 29, 2008
Categories: Emergency Maintenance
We are currently experiencing some issues with our Filemaker server savin. We are running maintenance and troubleshooting the server and hope to have it stable again as soon as possible.
We are notified immediately when the server is inaccessible and can correct the problem very shortly afterward, but if you notice that your database on savin is unavailable, please call us at 877-720-0483 and use option 3 to reach support.
posted by digital.forest at 03:02 PM on Thursday, August 28, 2008
Categories: Emergency Maintenance, savin.forest.net
Digital.forest has experienced a network level event that has caused portions of our network to be unavailable. Currently our Systems Administration and Network Administration as well as Network Engineering staff are on site working to resolve this issue.
If you are currently experiencing any problems connecting to any services at digital.forest this is the cause.
We will keep this page updated as we get more information on this event and a detailed post will be made when this event has been resolved.
Update 9:30 AM PDT
Connectivity to the majority of our shared hosting services has been restored.
Update 10:55 AM PDT
As of 09:00 PDT most of our network was back online and accepting incoming and outgoing traffic normally. We have confirmed that the connection is stable and we do not expect any further interruptions to service here at digital.forest.
Expect a full writeup on the situation soon.
posted by digital.forest at 11:07 AM on Saturday, August 9, 2008
Categories: Emergency Maintenance, Network
We have restored the Instant Web Publishing Functionality on Rosemary and would like to thank everyone for their understanding and patience.
posted by digital.forest at 11:59 PM on Tuesday, August 5, 2008
Categories: Emergency Maintenance, FileMaker 9, rosemary.forest.net
The server "rosemary.forest.net" is experiencing some issues at the moment. Specifically FileMaker Pro "Instant Web Publishing" has ceased to function properly. We are making efforts to have it working again as soon as possible. Rosemary may be offline for periods as we address these issues. Thanks for your patience.
posted by Chuck G. at 07:04 PM on Tuesday, August 5, 2008
Categories: Emergency Maintenance, FileMaker 9, rosemary.forest.net
The mail server "palm.forest.net" will require some emergency maintenance tonight around midnight PDT. This will involve a restart of the server, which means mail service will be interrupted for about 5 minutes. No inbound mail will be missed as it will spool on secondary mail servers, but users will not be able to send or read mail during the brief outage. We appreciate your patience while we perform this required update.
posted by Chuck G. at 06:18 PM on Wednesday, July 16, 2008
Categories: Emergency Maintenance, Mail, palm.forest.net
We will be taking date.forest.net offline for about 5-10 minuets this evening to install some additional hardware in order to increase performance and increase total server traffic capacity.
This downtime will occur on Wednesday July 16th, starting at 23:50 PDT and ending at about 00:00PDT.
posted by digital.forest at 06:08 PM on Wednesday, July 16, 2008
Categories: Emergency Maintenance, date.forest.net
celestial.wwwnexus.com is currently offline for emergency maintenance. Unfortunately we do not have a time frame for its return. We will update the blog when we have more information. Sorry for the inconvenience.
posted by digital.forest at 04:43 PM on Saturday, June 21, 2008
Categories: Emergency Maintenance
The mail server "treehouse.forest.net" which acts as a primary mail server for a portion of our users has had some issues with stability over the past 24 hours. Our diagnosis points to memory-related errors. We have new RAM for this server and will replace it this evening around 7PM Pacific Daylight Time. The maintenance interval should be about 45 minutes in duration. No inbound mail will be lost, as it spools on secondary servers, however users will not be able to send or receive electronic mail during this maintenance interval.
Treehouse may be unreachable for short periods of time during the day today as we wait until the after-hours maintenance window. If it becomes untenable during business hours we may elect to perform the RAM replacement earlier, but will do everything in our power to avoid interrupting mail server during the business day.
Thank you for your patience while we address this critical issue.
posted by Chuck G. at 09:51 AM on Thursday, June 12, 2008
Categories: Emergency Maintenance, Mail, treehouse.forest.net
Effective immediately we will be performing emergency maintenance on the date.forest.net Lasso web server in order to address some performance issues. This maintenance will impact all websites on the date.forest.net server and will cause them to be unavailable sporadically until this maintenance is complete.
We will strive to keep this outage as short as possible but we do not currently have an ETA for when this emergency maintenance will be complete.
Thanks for your understanding while we address this issue.
Update: date.forest.net is back online and we do not expect to have this server down anymore. This concludes the emergency maintenance for date.forest.net.
posted by digital.forest at 08:03 AM on Monday, June 9, 2008
Categories: Emergency Maintenance
The former Trident Networks/Speedyweb server "Neptune" has been experiencing issues lately. In order to ensure future stability and performance of the sites served from this machine we've decided to migrate them to more reliable servers. Accounts will be moved to newer and faster servers, either running UNIX or Windows, depending on the if the website relies on FrontPage extensions. E-mail and any MySQL databases will be migrated to UNIX servers.
We apologize for any inconvenience that this migration may cause you and will be more then happy to answer any questions you may have; we will be working with you to resolve any issues that may arise due to this migration.
Users on Neptune will be contacted directly via our helpdesk ticketing system with more details. Please respond to the helpdesk ticket and/or call us at 877-720-0483 option #3. We will have staff onsite 24 hours a day, 7 days a week during this migration and they will be able to help you with any problems that you may have. If you believe that your e-mail address with us may be out of date we highly recommend that you respond to this ticket or call us at 877-720-0483 option #2 during business hours and update your contact information with an Account Manager.
Thank you for your patience.
posted by Chuck G. at 05:17 PM on Tuesday, May 6, 2008
Categories: Emergency Maintenance, Hosting Servers, Mail, MySQL hosting
Just a reminder, tonight at 11 PM we will be performing an emergency maintenance on part of our electrical system. This will have an impact on a portion of our shared web and database servers, and eight dedicated and colocated servers. The shared hosting servers affected are:
acacia, acorn, alder, aralia, arrowwood, ash, aspen, avocado, balsa, balsam, bamboo, banana, banyan, bayberry, beech, bigleaf, blackberry, boojum, boxwood, bubinga, buckeye, cactus, cedar, cherry, chestnut, cholla, cinnamon, clover, columbia, commerce2, commerce3, cork, cottonwood, db, deerwood, dogwood, ebony, elm, evergreen, ficus, fig, filbert, fuji, gingko, grape, grapevine, hackberry, hazel, hemlock, hickory, ironbark, ivy, kentia, kola, kudzu, larch, laurel, lilac, lime, madrona, magnolia, mango, mangrove, maple, mesquite, mimosa, moringa, mulberry, myrtle, newninewire, ninewire2, olive, orchid, palmetto, papaya, pear, pecan, plum, poplar, privet, quince, redbud, sassafras, savin, sequoia, sherwood, snowberry, spiceberry, spruce, strawberry, sycamore, tamarack, teak, truffula, tutsan, walnut, woodpecker, yucca, yulan/fern
Every effort will be made to minimize the downtime. Our electricians have estimated the time required to complete the task at 2 hours.
Thank you for your patience and understanding while we strive to build a better facility.
Update 1:15 AM: The maintenance went very smoothly and was completed in under 90 minutes. We were able to supply alternate power to the few dedicated and colo servers involved in the maintenance. We also took the opportunity to replace some failed parts in a couple of servers (acorn for example whose fan inside the power supply had stopped working.)
Again, thank you so much for your patience during this critical maintenance interval.
posted by Chuck G. at 07:30 PM on Tuesday, March 11, 2008
Categories: Emergency Maintenance, Hosting Servers, alder.forest.net, arrowwood.forest.net, aspen.forest.net, balsa.forest.net, bamboo.forest.net, banana.forest.net, bayberry.forest.net, bigleaf.forest.net, boysenberry.forest.net, cactus.forest.net, cedar.forest.net, chestnut.forest.net, cinnamon.forest.net, columbia.forest.net, elm.forest.net, evergreen.forest.net, fuji.forest.net, hazel.forest.net, kentia.forest.net, kola.forest.net, laurel.forest.net, lime.forest.net, olive.forest.net, orchid.forest.net, palmetto.forest.net, pear.forest.net, quince.forest.net, sassafras.forest.net, sherwood.forest.net, spruce.forest.net, sumac.forest.net, sycamore.forest.net, tamarack.forest.net, tutsan.forest.net, www.ninewire.com
Tuesday night March 11th, Wednesday morning March 12th we will be performing emergency maintenance on our power infrastructure related to the installation of our new UPS system. This maintenance will impact a small portion of our shared hosting clients. We will provide specific times in the next couple days as they firm up.
We will strive to keep this outage as short as possible but we are allowing up to two hours of downtime.
Thanks for your understanding while we grow to serve you better.
posted by Shawn Hammer at 05:25 PM on Friday, March 7, 2008
Categories: Emergency Maintenance, Hosting Servers
The gold.wwwnexus.com hosting server has experienced a hardware failure. We are working on repairing it, and will update as soon as we have more information.
posted by Bill D. at 01:05 PM on Friday, March 7, 2008
Categories: Emergency Maintenance
One of our mail servers, smtp.forest.net, is down to investigate recurring problems. We hope to have the server back up within an hour to ninety minutes. Thanks for your patience.
Update (4:38AM PST): smtp.forest.net is back up and running.
posted by digital.forest at 03:43 AM on Tuesday, January 15, 2008
Categories: Emergency Maintenance, Mail, smtp.forest.net
At 2am tomorrow morning we will be shutting down the mail server "treehouse" to install new memory in it. We have been seeing RAM-related errors which caused the server some problems last week. We figured tonight would be a good time to bring it down and perform the hardware installation. Downtime should be limited to 15 minutes or less.
posted by Chuck G. at 07:44 PM on Monday, December 31, 2007
Categories: Emergency Maintenance, Mail, treehouse.forest.net
One of our mail servers, treehouse.forest.net, is experiencing problems and is currently down. We are working to restore it as soon as possible.
Thanks for your patience.
Update 11:25AM: treehouse is back up and running.
Final Report 12:30PM: Treehouse was experiencing memory-related errors and when we rebooted it the hardware self-test showed a failed RAM card. This particular server uses RAM in pairs so we had to source a pair of equivalent cards from our inventory to get the mail server running again. We have ordered additional RAM for both treehouse and our inventory and will likely schedule a brief downtime for treehouse over the holidays to perform this work.
posted by digital.forest at 11:22 AM on Wednesday, December 19, 2007
Categories: Emergency Maintenance, Mail, treehouse.forest.net
Starting at around 2AM we started seeing temps rise in the datacenter, more so in DC 1. We've called our HVAC-systems vendor and they should have an emergency technician out here shortly. We've taken steps to mitigate temps in the meantime. We'll post updates as more information becomes available.
UPDATE: 03:55 Our HVAC vendor arrived within a few minutes of the posting above. They've corrected the issue and we are recovering nicely. Things should be back to normal in a few minutes. We'll post an update when we have definitive data on the cause.
posted by Chuck G. at 03:41 AM on Wednesday, December 5, 2007
Categories: Emergency Maintenance
One of our shared hosting servers, souari, is having trouble and is currently down. We apologize for the inconvenience.
Update 10:30AM: souari is back up.
posted by digital.forest at 10:29 AM on Tuesday, December 4, 2007
Categories: Emergency Maintenance
One of our legacy Trident servers, celestial.wwwnexus.com, is currently experiencing technical difficulties and is not serving pages.
We apologize for the inconvenience this presents, and are currently working to bring it back to full operation. We'll edit the support blog when there's progress to report.
Update 9:20AM: celestial is now back online.
posted by digital.forest at 07:29 AM on Monday, December 3, 2007
Categories: Emergency Maintenance
This morning we discovered a hardware failure with callisto.forest.net.
Thankfully all data has been recovered and we have already found replacement hardware. The new hardware is currently running some preliminary tests and checking its hardware, callisto should be back as soon as this is complete. We will update this blog when callisto has fully recovered. Sorry for any inconvenience, and thanks for you patience.
Update 11/21/2007 09:49 PST: The callisto server has returned to operational status and all services of that server should be working properly again.
posted by digital.forest at 09:21 AM on Wednesday, November 21, 2007
Categories: Emergency Maintenance
We will be taking the FileMaker 9 Server named rosemary offline this morning to perform some emergency maintenance on the FileMaker Instant Web Publishing. We do not currently have an ETA for when this work will be completed but we will update this entry as we have more information.
Thank You
11/07/2007 10:25AM: Rosemary is now back online and the error in IWP has been corrected.
posted by digital.forest at 01:34 AM on Wednesday, November 7, 2007
Categories: Emergency Maintenance, Hosting Servers
Last night around 19:30, one of the three primary compressors of one of our two HVAC systems failed. This was noted by NOC personnel who contacted our HVAC contractor who set the controls to bypass the failed unit. We did run on single-stage cooling from one unit for several hours, which lead to both datacenters reaching temperatures in the low-80s F/high-20s C. Once the unit was bypassed 3-stage cooling was achieved and temps dropped to their normal mid-60s F/high-teens C. With outside summertime temperatures in the high-80s F/low-30s C possible we require 4-stage cooling at peak hours.
We will be replacing the failed compressor tonight between 21:00 and 23:00. We have secured portable HVAC units to supplement our secondary HVAC while the primary unit is down for repairs. We will update this website with more information as the repair progresses.
Update: 8:00 AM The new compressor was installed last night. We've resumed normal operations.
Regards,
--Chuck Goolsbee
VP, Tech Ops
digital.forest
posted by Chuck G. at 03:38 PM on Thursday, July 5, 2007
Categories: Emergency Maintenance, Facility Maintenance
We are currently performing emergency maintenance on our Souari server (216.168.37.73). We currently have a team of technicians working on the problem to resolve it as quickly as possible and will update you as soon as we have more information. Unfortunately we do not currently have an ETA for when the server will be back up but we will let you know as soon as it is.
Please accept our sincere apologies for this service outage and please let us know if you have any questions or concerns.
Thank You
Update (10:50 PDT): Apache on the souari.forest.net is online and all websites on the server are accessible at this time. We are still working to resolve an issue with the sendmail functionality on the server and our technicians hope to have that service online soon. Unfortunately we do not have and ETA for the repairs to the sendmail service at this time.
Update (11:44 PDT): Currently FTP on the souari.forest.net server is not available for the same reason that the sendmail service is not functioning on the server. We currently have technicians working to resolve these issues and hope to have an ETA for you soon.
Update (12:32 PDT): To correct a problem with bad library files on Souari, there will be another brief downtime, beginning immediately. We will update again when the server is back up and running.
Update (13:52 PDT): While correcting the problem with the bad libraries on souari.forest.net we discovered an additional problem and have been working to resolve that problem. This has resulted in additional emergency maintenance down time. Please know that we are doing everything that we can to resolve these issues as quickly as possible and have assigned out entire technical department to work on this issue.
Update (17:31 PDT): The Emergency Maintenance on the souari.forest.net server has been concluded and the server has returned to normal operating status.
posted by digital.forest at 09:44 AM on Monday, July 2, 2007
Categories: Emergency Maintenance
The shrubbery.forest.net server has been taken offline in order to perform some emergency maintenance. We have technicians working on the server and will have it back online as soon as possible.
Please accept our apologies for this service outage and please let us know if you have any questions or concerns.
Digital Forest Technical Support
877-720-0483 option #3
Update (10:55 PDT): We have concluded our emergency maintenance on Shrubbery.forest.net and it is now online and functioning normally.
posted by digital.forest at 09:40 AM on Monday, July 2, 2007
Categories: Emergency Maintenance
One of the distribution switches in our facility, specifically in the rack-colocation area in rows 11 & 12 in DC1, has been showing errors and causing some network issues for servers in that area today. Our switch vendor believes that this is being caused by a bad gigabit port which uplinks that switch to our network core. The other possibility is a bad switch engine. Thankfully the former of these has some built-in redundancy. The latter is an easy card swap, and we have spares. So in a few moments we will be manually failing the gigabit uplink to the redundant port. It is unlikely that this will be any more noticeable than the intermittent issues that servers have seen on this network segment, namely some dropped packets and retransmissions. If that does not improve things, we'll replace the switch engine blade later tonight during a maintenance window.
Update 5:00 PM: The card reset seems to have solved the issue. We'll keep an eye on things over the next few days to be sure. We'll also order a replacement card for the switch. Thanks for your patience.

Above: Network Manager Kyle Murray pulls the problem gigabit card from the switch.
posted by Chuck G. at 08:28 AM on Monday, June 11, 2007
Categories: Emergency Maintenance, Network
The source of our HVAC system issue today was a malfunction in our fire suppression system. The fire system is one that is designed to suppress fires without damaging computer equipment. It works by sealing the datacenter and flooding it with a gas which smothers the fire. The first stage in the process, "sealing the room", is done by shutting the air conditioning by using motorized dampers in the ducting.
For reasons we do not know yet, these dampers closed due to some malfunction.
We were able to open them manually and then bypass the fire supression system's control over the HVAC system. By this time however, temperatures in the datacenter were out of the normal range. We elected to shut down as many systems as possible to accelerate the cooling and minimize the heat load.
We will have our Fire Suppression system vendor out to diagnose and correct this malfunction as soon as possible.
The timing of this event coincided with what is predicted to be the last of a streak of relatively hot days here in the Seattle area. We doubt this is linked to the cause, but it certainly contributed to the overall problem and the time taken to recover.
We will continue to update the support blog post below this one with breaking news as it happens. Thank you for your patience during this event.
posted by Chuck G. at 07:41 AM on Sunday, June 3, 2007
Categories: Emergency Maintenance
We are experiencing an HVAC-related issue right now. Our HVAC maintenance vendor is en-route. We may be shutting down systems to lower the datacenter temperature. Please stay tuned for more information.
Update: 12:35 PM Vendor is on site, the system is running again, but datacenter temperature is high. We are shutting down as many systems as possible to reduce the heat load on the facility in order to get the temperature back down as swiftly as possible.
Update: 12:45 PM It appears the HVAC was shut down by a malfunction in our fire suppression system. We have bypassed that system's HVAC controls for the time being to prevent it from reoccurring.
Update: 12:50 PM Temperatures are beginning to come back down.
Update: 1:30 PM We have additional staff on-site now to handle phone calls and server shut-downs/startups.
Update: 2:50 PM We will start bringing shared servers online very soon.
Update: 4:30 PM Datacenter temperature has stabilized and normal. We have restarted most servers, but some still remain down due to miscellaneous problems. If any of these appear to have longer-term issues, we will begin to post individual lists and updates separate from this entry.
Thanks for your patience and understanding.
posted by Chuck G. at 05:27 AM on Sunday, June 3, 2007
Categories: Emergency Maintenance
At 9:50 AM this morning one of our Metropolitan Ethernet providers, OnFiber had an equipment failure here in Seattle. We connect to one of our network peers, NTT/America at The Westin Building via this circuit. This caused us to have have intermittent connectivity over that particular circuit to NTT/America. Some digital.forest clients may have had "slow" or "intermittent" issues reaching servers here for a short period of time while we diagnosed the issue with the NOC's of NTT & OnFiber
We have shut down our BGP connection to NTT/America while OnFiber fixes the problems on their network. At the moment we are running on two of our three network connections. We will update this post when we bring the third circuit back online.
Update: As of 11:02 AM PDT this issue is completely resolved. The OnFiber circuit was manually moved to a different port. After a successful 10-minute testing of the new circuit we turned up our BGP session with NTT/America.
We maintained connectivity to our other BGP network peers through this event, so at no time was our network "down". We do like to keep our clients informed of events here at our datacenter, even if they have no direct impact on your servers. In this case, it was a classic example of Internet Architecture and how it handles outages. The often-used phrase is that it "routes around damage." In this instance when one of our circuits had an issue our traffic just shifted to our other circuits. It is likely that none of our clients even noticed. If they did notice it would have been an intermittent connectivity for a brief period of time. Such is the nature and reason for designing redundant systems. Our fiber optic connectivity to the rest of the Internet flows over multiple physical paths. Those paths do not converge until they are physically inside our datacenter facility. This prevents complete outages through equipment failure or accidental fiber cut. Today's event confirms the built-in redundancies work as designed.
--Chuck Goolsbee
VP Technical Operations
digital.forest, Inc.
posted by Chuck G. at 11:00 AM on Tuesday, May 8, 2007
Categories: Emergency Maintenance, Miscellaneous, Network
Tonight at 8:00 pm we will be performing a software update on our helpdesk system. This will involve approximately 5 minutes of downtime. At that time our online trouble ticket system will be unavailable.
All other support options (telephone, emergency pager, etc) will remain online throughout the maintenance window. The trouble ticket system should be back online by 8:05 PM.
We apologize for any inconvenience this may cause.
posted by Chuck G. at 02:59 PM on Tuesday, May 1, 2007
Categories: Emergency Maintenance
4. Wrap up & Summary
We're pleased to report that the repair on our HVAC system is complete, and finished without incident. The final bit of work required brazing & welding within the unit itself. To mitigate any risk of having the pre-action fire suppression system discharging its gasses, we had our vendor Fire Chief, come out and disable the system. Part of our annual maintenance procedure for the fire suppression system involves the shut down of the HVAC system anyway, so Fire Chief took advantage of the situation to perform that maintenance.

Above: Technicians from Fire Chief perform preventative maintenance on the Fire Detection and Suppression system.
During the HVAC system shutdown, digital.forest staff monitored temperatures in various locations around the datacenter, while our Facilities Manager bounced between the roof and the datacenter monitoring our vendors. Below you can see digital.forest Tech Support member Will Winslow and Facilities Manager Kevin Teker in the darkened datacenter just after the HVAC shutdown occurred. They're carrying their temperature monitors and about to spread out to their stations. You can see the high-CFM fans mentioned earlier today in the open door behind them.

All of our preparation paid off, plus a bit of luck from the weather (it stayed very cool, plus it didn't rain) so that the natural tendency for the facility to warm up was mitigated by the combination of pre-cooling and the fans pulling outside air into the facility. We're happy to report that our highest temperature reached was about what we see here on a "normal" day. Our temporary portable HVAC units never even needed to be turned on.

Interesting conclusions:
Electrical capacity is a hot topic in the datacenter management business these days. There are various rules of thumb concerning the estimataion of power usage split between "floor" (meaning the servers) and "mechanical" (meaning the HVAC systems to cool the servers.) The variable is the delta between outside and inside ambient temperatures. The hotter it is outside, the harder the HVAC systems have to work to chill the inside. We're blessed to be located in a very moderate climate here in Seattle. It rarely gets very hot here. Nor does it get very cold. Our average temperature is actually quite a bit lower than ideal datacenter temperature. Even in summer, it cools enough at night to keep our average right at ideal datacenter temperature. We monitor electricity usage at several points, along the flow for a lot of reasons, but on our main panel in the datacenter we can check at a glance and see how much power is being used in total. The ammeter for example read this way earlier today when we were running the rooftop HVAC and 100% using outside air:

That reads 274 Amps. That is 274 Amps of 3-phase power as it comes in off the grid. Our feed is 2000 Amps so as you can see we have a lot of room for growth with regards to electricity. This is one of the things that really attracted us to this facility when we moved here just over two years ago. With so many datacenter operations running at nearly 100% of their power capacity we felt it important to be able to accommodate our clients expanding needs and requirements. This maintenance interval provided us some real-time data concerning the power needs of our mechanical infrastructure. Those rules of thumb mentioned earlier say "for every 1 amp you feed the floor, you feed the mechanical 1 to 1.75 amps." This seems to have been proven in our experience, but rounded down due to our temperate, if not downright cool location here in Seattle. Here is a shot of the ammeter with the HVAC system shut down completely:

That is 219 Amps of 3-phase power. Looking at our monitoring history, we hit our maximum of 400 Amps last July when we had a week of temperatures in the 90-95° F (32-35°C) range. That means we are running at a roughly 1:1 floor:mechanical ratio in terms of electricity at our peak consumption. If anything we are favoring the floor, which is a great advantage in this industry.
Yet another benefit of colocation at digital.forest in cool Seattle!
posted by Chuck G. at 02:03 PM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance
3. Shut Down Interval.
At 12:07 the entire HVAC system was shut down. Datacenter temps are well within reasonable tolerances after 20 minutes on fans alone. We'll update again with more information after the HVAC is retuned to service.
Update: 12:32 PM PDT
HVAC systems are running again. We'll summarize the day's work soon.
posted by Chuck G. at 12:27 PM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance
2. Repair Work
Thankfully it has remained nice and cold outside today so our HVAC system, which is designed to use outside cool air if available to reduce compressor load, is running 100% on outside air. This allowed us to continue to run the HVAC while the technicians remove the old compressor and install the new one. So from the perspective of the datacenter things appear no different than a normal day here at digital.forest. All the action is happening up on the roof:


In the top image above the techs wrestle the new compressor up a temporary ramp and into place. In the bottom shot you can see the new compressor in place, and the old broken one on the handtruck, ready to be removed.
The Trane Intellipak is an excellent HVAC system that has a myriad of control options. Below you can catch a glimpse into the heart of the controls, which are usually locked behind a steel panel. We usually interface with these systems via software down in the office, but occasionally it is good to have a look at the atoms represented by the bits.

Above is a close up of the breakers and control units for the compressors. You can see that several breakers are in the "off" position, providing safety for the technicians while they work. Others remain "on" so that the system can still function and provide air handling for the datacenter.

Above: digital.forest Facilities Manager Kevin Teker explains how all of this works.
The next step requires the complete shutdown of the HVAC and Fire Suppression Systems, as the HVAC technicians braze some plumbing. Stay tuned.
posted by Chuck G. at 11:57 AM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance
1. Preparations
In our world Electricity is transformed into Bits, with the by-product of BTUs (heat). Our job is to handle (route) bits and manage (cool) BTUs. Despite the fact that we are fairly certain that the technicians can get their work done with minimal downtime of the HVAC system, we are living by the old adage "hope for the best, but prepare for the worst." To that end we have performed the following preparations. We are pretty intimate with our facility and know where the heavy users of electricity are located. We have the "hot spots" identified and covered by portable AC units.


We also have the ability to pull outside air into the facility in large volume, and use smaller local fans to provide ventilation to "warm spots." The outside temperature at the moment is 39°F/4°C, so it is a fairly good day to be performing this task.


This process of course requires a bit of preparation itself. High CFM fans and portable HVAC units are not exactly light users of electricity themselves, and to protect the servers you depend upon we can't just plug them in wherever there's an open outlet. Mechanical motors put variable strains on electrical circuits and it is not smart to put them on the same circuit being used by computers. Therefore we have used building electricity circuits for these devices, rather than the power from our PDUs that feeds the racks. We've taken the extra step to lay extension cords to the various mechanical units, and gaffer-taped those to the floor. Additionally our Facilities Manager has diagramed the circuits and breakers involved in feeding the mechanical units and calculated the amperage loads so we can avoid popping breakers.

We have also deployed some temperature probes in critical locations to monitor the ambient temperatures in "cool rows" to see what the intake air is like for servers. Finally, overnight we dropped the datacenter temperature several degrees below our normal 65°F/18°C to provide some "breathing room".


More info coming soon.
posted by Chuck G. at 09:58 AM on Wednesday, March 14, 2007
Categories: Emergency Maintenance
Last week we had a single compressor unit in our Trane Intellipak cooling system fail during an unseasonably warm day. The system has built-in redundancies to handle such situations so we recovered quickly from the condition. In order to prepare for the warmer weather coming soon, we have elected to replace this failed unit now. So tomorrow (Wednesday, March 14) we will have a vendor here replacing the compressor. This will involve occasional, brief shutdowns of our HVAC system.
We have brought in industrial sized high-CFM fans, to maintain air circulation in the facility during the maintenance. Additionally we have several portable 1-ton HVAC systems which we can deploy on an as-needed basis should any areas of our datacenter exceed standard temperatures. We have deployed temperature probes throughout the datacenter to monitor this as the maintenance progresses. As such we are confident that this event will have minimal-to-no impact on operations, since we will be prepared to mitigate any heat issues should we see temperatures rise.
We apologize for the short notice, and we hope you understand the reasons why. We strive to maintain our facility to the highest standard, as well as keep you informed as we take steps to do so. We will post updates throughout the day tomorrow.
Chuck Goolsbee
VP, Technical Operations
digital.forest, Inc.
posted by Chuck G. at 10:22 PM on Tuesday, March 13, 2007
Categories: Emergency Maintenance, Facility Maintenance
The server www.ninewire.com is currently experiencing hardware problems. We are working on the server and will have the issues resolved as soon as possible.
posted by digital.forest at 08:58 PM on Monday, November 13, 2006
Categories: Emergency Maintenance
Sorry for the wait everyone.
Celestial is back up and running. All webpages should be working again.
posted by digital.forest at 02:04 PM on Friday, November 3, 2006
Categories: Emergency Maintenance
We've been experiencing problems with our "Celestial" server for the past few days, and today it's had to come down for hardware maintenance. Any user accounts hosted on celestial will therefore be unavailable until the maintenance is completed. Our apologies for the inconvenience, we'll try and have it back up as soon as possible.
posted by digital.forest at 09:25 AM on Thursday, November 2, 2006
Categories: Emergency Maintenance
The hosting server Titan is currently down for maintenance. We expect to have all of the sites operational by mid-morning on October 10th.
posted by digital.forest at 12:36 AM on Tuesday, October 10, 2006
Categories: Emergency Maintenance
Most accounts have now been moved to the new hardware and are functioning normally. A few accounts are still experienceing some difficulties and we are working hard to find a resolution to each of their unique problems.
If you find that your site is is available but not working properly, please submit a trouble ticket through http://www.forest.net/helpdesk providing the main URL and the URL and functionality that is not working properly. We will continue to give these issues top priority until all of the sites on Sabre are working properly again.
posted by digital.forest at 12:38 AM on Friday, June 23, 2006
Categories: Emergency Maintenance
The Windows ColdFusion and IIS hosting server Sabre is going down for emergency maintenance. We will be taking two maintenance windows today to improve the performance of the server. The first window will be at 10:30 AM Pacific and will last for approximately 15 minutes. The second will take place at 2:00 PM Pacific and will last for approximately 15 minutes.
We're working hard to improve the performance of this server and hope to see noticable differences after the second maintenance window.
posted by digital.forest at 10:22 AM on Tuesday, June 20, 2006
Categories: Emergency Maintenance
Mars is currently down for an emergency maintenance due to a failed hardware component. Unfortunately at this time Mars is still down due to this problem. If you are a former Trident/Speedyweb customer and your website and email are currently not working, you are affected by this problem.
Update @ 3:30am PDT: The proper technicans have been notified and we are currently evaluating the situation in order to get this resolved as soon as possible. We anticipate that Mars will be back online around 9:30 AM PDT.
Update @ 8:30am PDT: We are executing our plan to get the machine fully back up.
posted by digital.forest at 10:50 PM on Wednesday, June 7, 2006
Categories: Emergency Maintenance, Hosting Servers
Update @ 12:30 PM PDT: Neptune is now back online
Currently Neptune.wwwnexus.com is down due to a catastrophic hardware failure. We are currently moving it over to new hardware and expect that Neptune will be fully operational by 2:30 PM PDT.
-digital.forest technical support
posted by digital.forest at 09:40 AM on Wednesday, April 26, 2006
Categories: Emergency Maintenance
Final Update:
Mango is now back online. All files are recovered and it has received a major upgrade from what the hardware was located on. If you are hosted on Mango and you are experiencing problems, please notify our technical support immediately. -Yvo
UPDATE:
Mango suffered a catastrophic hardware failure. We are currently finding alternative hardware for the server now.
Currently, mango.forest.net is down and is being repaired. downtime should not be more than 1 hour.
posted by digital.forest at 01:37 PM on Friday, April 7, 2006
Categories: Emergency Maintenance, Hosting Servers
Currently FTP service is disabled on Souari as we are in the process of moving over the files to another service. In light of yesterday's events we have decided to rebuild Souari and get rid of any problems it had. Web Service is NOT be affected by this, only the FTP service (the service that allows you to make changes to your website) is affected while this move takes place.
If it is an absolute emergency (such as your website is broken) we will stop the process. We apologize for this major inconvenience however we are doing this to ensure the privacy of your data.
Please monitor this website for any updates concerning this situation.
posted by at 10:50 AM on Wednesday, March 29, 2006
Categories: Emergency Maintenance, souari.forest.net
Neptune.wwwnexus.com continues to be down at this time. We have replaced the power supply, however we think the power supply failed while the hard drives were writing to the drives. Because of the 'dirty' shutdown by the power supply, some of the services (such as the web service) are not starting up.
We are working hard on resolving this situation and this time can't give an estimated time of repair.
Please stay tuned for any further updates.
posted by at 06:04 PM on Friday, March 10, 2006
Categories: Emergency Maintenance
At 1:30 PM PST the machine neptune.wwwnexus.com shut down due to a failed power supply. We are currently replacing the powersupply and testing the machine's file system for any problems.
We anticipate that Neptune will be back online at approximately 2:45 PM.
posted by at 02:08 PM on Friday, March 10, 2006
Categories: Emergency Maintenance
We will be taking sage.forest.net (FileMaker 7 Server hosting) down in a few minutes in order to diagnose a problem we are experiencing with the machine.
The machine should be back up by 5:45pm PST
posted by at 04:53 PM on Friday, February 17, 2006
Categories: Emergency Maintenance
On Thursday night, catalpa.forest.net (smtp.forest.net) suffered a hard drive failure at approximately 7pm PST. Currently we are spooling the mail for the accounts hosted on this machine on another mail server and we have pointed smtp.forest.net to another mail server located at digital.forest. In the meantime we are migrating all the affected accounts over to another mail server.
Accounts on other email servers such as treehouse, palm (infoasis) and ninewire are not affected unless their mail client's outbound (SMTP) server is set to smtp.forest.net. If this is the case please change the outbound SMTP to the same as your incoming (POP3 or IMAP) mail server. Be sure to enable SMTP authentication in your mail client when you make this change.
We currently do not have an estimated time of repair, but we anticipate that we will have the affecting customers moved over quickly.
If there any updates concerning catalpa.forest.net we will post them here as soon as they are available.
digital.forest technical support
posted by digital.forest at 01:34 PM on Friday, January 27, 2006
Categories: Emergency Maintenance, Mail, catalpa.forest.net
At approximately 1:45 AM PST this morning we experienced a power overload condition on a single electrical circuit, which services half of two racks in our facility. This tripped a breaker in one of our Power Distribution Units. We reset the breaker and used power monitoring equipment to measure the load on that circuit as servers rebooted. With the data collected rerouted some power cables in those two racks to spread the load in a manner which should prevent this from happening again.
Most of the servers affected are digital.forest shared hosting servers, however one of the two racks contained some colocated and one dedicated server. We will be contacting the affected clients during the business day with a follow-up.
Server downtime was limited to about 20 minutes maximum, with most servers being down less that 15 minutes.
We apologize for any inconvenience.
Chuck Goolsbee
VP Technical Operations
posted by Chuck G. at 02:48 AM on Friday, December 30, 2005
Categories: Colocated & Dedicated Servers, Emergency Maintenance, Hosting Servers, Miscellaneous
We are currently troubleshooting a performance issue with Lasso on Banana and appreciate your patience.
---update---
Banana has stabilized.
posted by at 10:50 AM on Saturday, December 10, 2005
Categories: Emergency Maintenance
Treehouse will be shutdown at 10pm PDT and we will be migrating all of the users to a new server on a different platform.
Stay tuned for updates by monitoring this space.
Thank you for your continued patience
digital.forest technical support
posted by digital.forest at 04:36 PM on Tuesday, November 1, 2005
Categories: Emergency Maintenance, Mail, treehouse.forest.net
The mail server is currently back online and churning through the queue that has been build up due the downtime. It is still performing poorly at this time.
Connections via your mail client (such as Eudora, Outlook, Entourage, etc.) will most likely not be working at all due to the state of the mail server.
Right now your best bet will be to access the web mail portion by going to http://mail.(yourdomain) and then to log in using your email username and password. This may not as well, but will be more reliable then using a mail client.
Tonight we will replace the mail server, treehouse.forest.net, with a different mail server in order to restore complete functionality to our email.
We thank you for your patience,
digital.forest technical support
posted by digital.forest at 01:49 PM on Tuesday, November 1, 2005
Categories: Emergency Maintenance, Mail, treehouse.forest.net
In the next 30 minutes we expect to have an update from our email system administrator on a potential return to service time.
More information will be posted to this website at that time.
While message loss is a concern to everyone, we don't anticipate any significant email losses.
Thank you for your continuing patience,
digital.forest technical support
posted by digital.forest at 12:32 PM on Tuesday, November 1, 2005
Categories: Emergency Maintenance, Mail, treehouse.forest.net
In our continuing effort to get our mail server, treehouse.forest.net, fully back online, we will be taking it off line for emergency maintenance. It is down until further notice.
Please monitor this space for further updates.
Thank you for your continuing patience,
digital.forest technical support
posted by digital.forest at 11:33 AM on Tuesday, November 1, 2005
Categories: Emergency Maintenance, Mail, treehouse.forest.net
Update: 8:45 AM Mail.ninewire.com will be fixed next. Downtime should be around 15 minutes, starting at 9 AM PST.
Update: 8:25 AM Treehouse is back up. Still some issues to work out (TLS/SSL related) stay tuned.
We are carrying out emergency maintenance on two of our five mailservers over the next few minutes. This is to address the problem which we noted last night which was crashing the server. "Treehouse.forest.net" will be the first, with "mail.ninewire.com" being the second. We apologize for any inconvenience this may cause, but these problems require immediate attention in order to provide stable and reliable email service going forward.
posted by Chuck G. at 08:03 AM on Tuesday, November 1, 2005
Categories: Emergency Maintenance, treehouse.forest.net
A network configuration change made during last night's scheduled maintenance has caused a minor issue with clients behind our shared firewall service. A reboot of the firewall will clear this issue and allow proper functioning again. We will reboot the firewall at noon PDT today. Downtime should only be a few seconds while the firewall restarts.
This should have no affect on the rest of our clients or network.
Update: 11:15 AM PDT The firewall reboot has been cancelled as it is no longer required. We have addressed the issue with other means.
For the terminally curious here are the details:
Last night just after midnight, as part of our plan for dramatically increasing our levels of network redundancy, we migrated one of our upstream fiber connections to our second boundary router. We also finished enabling Spanning Tree Protocol on all of our Ethernet switches to recognize redundant trunks we will be deploying in the coming weeks.
In this case it was the gigabit Ethernet connection from XO Communications (AS 2828), that we moved from our original Cisco 6509 router, over to our secondary Cisco 6509 router.
When we did this, all network operations appeared to acknowledge the change via iBGP and OSPF protocols as expected.
Unfortunately our managed firewall device did not. We began to get calls from some clients concerning reachability of certain servers around 6:30 this morning. By 7:30 we had isolated the problem to the change made last night, and actually shut down the gigabit connection to XO to guarantee connectivity to shared firewall clients while we worked out how to address this problem with minimal downtime for the affected clients.
We planned a config change and reboot of the firewall for noon today, but in the meantime we were able to forestall that action by redistributing static routes between the firewall and the two different routers via OSPF.
That action was completed at 11 AM PDT today, and should prevent any such future routing issues like we experienced last night.
Please be aware of the following:
This did not mean that servers were "down." The firewall remained up, and all servers behind it were reachable via normal network channels. The issue was that if OUTBOUND traffic from those servers was destined for the XO connection, then the firewall had the incorrect routing information and was unable to send it. XO carries approximately 20-30% of our outbound traffic.
posted by Chuck G. at 10:24 AM on Thursday, October 27, 2005
Categories: Emergency Maintenance, Managed Firewall Services, Network
The Neptune server has been taken down for emergency Maintenance. It will likely be down the remainder of the night and returned to service by 9AM Pacific Time.
We appologize for the inconvenience and hope to avoid further issues through this proceedure.
posted by digital.forest at 11:08 PM on Wednesday, October 19, 2005
Categories: Emergency Maintenance
We have learned of a bug, which has been confirmed by FileMaker Inc. that involves FileMakerPro Advanced Server Seven running on a dual-CPU machine. Sage.forest.net is a twin-CPU Xserve and it is experiencing very poor perfomance under high load due to this bug. Thankfully we have a single CPU Xserve chassis that we can swap into sage. We will be shutting down sage.forest.net for about 5 minutes within the next 30 minutes to address this issue.
Thanks for your patience while we perform this much needed work.
Update: 13:40 PDT.
This work has been completed.
You can read more info about this particular bug here on FileMaker's support FAQ website. Their suggestion for turning off a CPU in firmware seemed a bit over the top for us, especially when we have spare hardware at hand and the Xserve lends itself well to component swapping. Additionally we added 512 MBs of RAM to sage while we had it open.
Again, thanks for your patience while we addressed this issue.
posted by Chuck G. at 01:09 PM on Thursday, October 6, 2005
Categories: Emergency Maintenance
|
|