digital.forest Technical Support
News archive: March 2007

Earlier tonight a colocated server on our network was subjected to a Denial of Service (DoS) attack. It began around 7:20 pm, when the attacker was denied the specific target, they later broadened the attack at an entire network segment. Clients with servers on a single particular subnet here may have had trouble reaching their servers between 8:30 and 8:44 pm PDT. No other subnets were affected.

We've taken steps to minimize the chances of it happening again, and will post updates if required.

posted by Chuck G. at 11:57 PM on Wednesday, March 28, 2007
Categories: Network

4. Wrap up & Summary

We're pleased to report that the repair on our HVAC system is complete, and finished without incident. The final bit of work required brazing & welding within the unit itself. To mitigate any risk of having the pre-action fire suppression system discharging its gasses, we had our vendor Fire Chief, come out and disable the system. Part of our annual maintenance procedure for the fire suppression system involves the shut down of the HVAC system anyway, so Fire Chief took advantage of the situation to perform that maintenance.

Above: Technicians from Fire Chief perform preventative maintenance on the Fire Detection and Suppression system.

During the HVAC system shutdown, digital.forest staff monitored temperatures in various locations around the datacenter, while our Facilities Manager bounced between the roof and the datacenter monitoring our vendors. Below you can see digital.forest Tech Support member Will Winslow and Facilities Manager Kevin Teker in the darkened datacenter just after the HVAC shutdown occurred. They're carrying their temperature monitors and about to spread out to their stations. You can see the high-CFM fans mentioned earlier today in the open door behind them.

All of our preparation paid off, plus a bit of luck from the weather (it stayed very cool, plus it didn't rain) so that the natural tendency for the facility to warm up was mitigated by the combination of pre-cooling and the fans pulling outside air into the facility. We're happy to report that our highest temperature reached was about what we see here on a "normal" day. Our temporary portable HVAC units never even needed to be turned on.



Interesting conclusions:
Electrical capacity is a hot topic in the datacenter management business these days. There are various rules of thumb concerning the estimataion of power usage split between "floor" (meaning the servers) and "mechanical" (meaning the HVAC systems to cool the servers.) The variable is the delta between outside and inside ambient temperatures. The hotter it is outside, the harder the HVAC systems have to work to chill the inside. We're blessed to be located in a very moderate climate here in Seattle. It rarely gets very hot here. Nor does it get very cold. Our average temperature is actually quite a bit lower than ideal datacenter temperature. Even in summer, it cools enough at night to keep our average right at ideal datacenter temperature. We monitor electricity usage at several points, along the flow for a lot of reasons, but on our main panel in the datacenter we can check at a glance and see how much power is being used in total. The ammeter for example read this way earlier today when we were running the rooftop HVAC and 100% using outside air:

That reads 274 Amps. That is 274 Amps of 3-phase power as it comes in off the grid. Our feed is 2000 Amps so as you can see we have a lot of room for growth with regards to electricity. This is one of the things that really attracted us to this facility when we moved here just over two years ago. With so many datacenter operations running at nearly 100% of their power capacity we felt it important to be able to accommodate our clients expanding needs and requirements. This maintenance interval provided us some real-time data concerning the power needs of our mechanical infrastructure. Those rules of thumb mentioned earlier say "for every 1 amp you feed the floor, you feed the mechanical 1 to 1.75 amps." This seems to have been proven in our experience, but rounded down due to our temperate, if not downright cool location here in Seattle. Here is a shot of the ammeter with the HVAC system shut down completely:

That is 219 Amps of 3-phase power. Looking at our monitoring history, we hit our maximum of 400 Amps last July when we had a week of temperatures in the 90-95° F (32-35°C) range. That means we are running at a roughly 1:1 floor:mechanical ratio in terms of electricity at our peak consumption. If anything we are favoring the floor, which is a great advantage in this industry.

Yet another benefit of colocation at digital.forest in cool Seattle!

posted by Chuck G. at 02:03 PM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance

3. Shut Down Interval.

At 12:07 the entire HVAC system was shut down. Datacenter temps are well within reasonable tolerances after 20 minutes on fans alone. We'll update again with more information after the HVAC is retuned to service.

Update: 12:32 PM PDT
HVAC systems are running again. We'll summarize the day's work soon.

posted by Chuck G. at 12:27 PM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance

2. Repair Work

Thankfully it has remained nice and cold outside today so our HVAC system, which is designed to use outside cool air if available to reduce compressor load, is running 100% on outside air. This allowed us to continue to run the HVAC while the technicians remove the old compressor and install the new one. So from the perspective of the datacenter things appear no different than a normal day here at digital.forest. All the action is happening up on the roof:

In the top image above the techs wrestle the new compressor up a temporary ramp and into place. In the bottom shot you can see the new compressor in place, and the old broken one on the handtruck, ready to be removed.

The Trane Intellipak is an excellent HVAC system that has a myriad of control options. Below you can catch a glimpse into the heart of the controls, which are usually locked behind a steel panel. We usually interface with these systems via software down in the office, but occasionally it is good to have a look at the atoms represented by the bits.

Above is a close up of the breakers and control units for the compressors. You can see that several breakers are in the "off" position, providing safety for the technicians while they work. Others remain "on" so that the system can still function and provide air handling for the datacenter.

Above: digital.forest Facilities Manager Kevin Teker explains how all of this works.

The next step requires the complete shutdown of the HVAC and Fire Suppression Systems, as the HVAC technicians braze some plumbing. Stay tuned.

posted by Chuck G. at 11:57 AM on Wednesday, March 14, 2007
Categories: Emergency Maintenance, Facility Maintenance

1. Preparations

In our world Electricity is transformed into Bits, with the by-product of BTUs (heat). Our job is to handle (route) bits and manage (cool) BTUs. Despite the fact that we are fairly certain that the technicians can get their work done with minimal downtime of the HVAC system, we are living by the old adage "hope for the best, but prepare for the worst." To that end we have performed the following preparations. We are pretty intimate with our facility and know where the heavy users of electricity are located. We have the "hot spots" identified and covered by portable AC units.

We also have the ability to pull outside air into the facility in large volume, and use smaller local fans to provide ventilation to "warm spots." The outside temperature at the moment is 39°F/4°C, so it is a fairly good day to be performing this task.

This process of course requires a bit of preparation itself. High CFM fans and portable HVAC units are not exactly light users of electricity themselves, and to protect the servers you depend upon we can't just plug them in wherever there's an open outlet. Mechanical motors put variable strains on electrical circuits and it is not smart to put them on the same circuit being used by computers. Therefore we have used building electricity circuits for these devices, rather than the power from our PDUs that feeds the racks. We've taken the extra step to lay extension cords to the various mechanical units, and gaffer-taped those to the floor. Additionally our Facilities Manager has diagramed the circuits and breakers involved in feeding the mechanical units and calculated the amperage loads so we can avoid popping breakers.

We have also deployed some temperature probes in critical locations to monitor the ambient temperatures in "cool rows" to see what the intake air is like for servers. Finally, overnight we dropped the datacenter temperature several degrees below our normal 65°F/18°C to provide some "breathing room".

More info coming soon.

posted by Chuck G. at 09:58 AM on Wednesday, March 14, 2007
Categories: Emergency Maintenance

Last week we had a single compressor unit in our Trane Intellipak cooling system fail during an unseasonably warm day. The system has built-in redundancies to handle such situations so we recovered quickly from the condition. In order to prepare for the warmer weather coming soon, we have elected to replace this failed unit now. So tomorrow (Wednesday, March 14) we will have a vendor here replacing the compressor. This will involve occasional, brief shutdowns of our HVAC system.

We have brought in industrial sized high-CFM fans, to maintain air circulation in the facility during the maintenance. Additionally we have several portable 1-ton HVAC systems which we can deploy on an as-needed basis should any areas of our datacenter exceed standard temperatures. We have deployed temperature probes throughout the datacenter to monitor this as the maintenance progresses. As such we are confident that this event will have minimal-to-no impact on operations, since we will be prepared to mitigate any heat issues should we see temperatures rise.

We apologize for the short notice, and we hope you understand the reasons why. We strive to maintain our facility to the highest standard, as well as keep you informed as we take steps to do so. We will post updates throughout the day tomorrow.

Chuck Goolsbee
VP, Technical Operations
digital.forest, Inc.

posted by Chuck G. at 10:22 PM on Tuesday, March 13, 2007
Categories: Emergency Maintenance, Facility Maintenance

Tonight, a little before midnight, a client created a forwarding mail loop, back to themselves via an external address. A single message queued on the server "treehouse" looped out via our mailhub, to an external mail address, which then looped back via postini, and into treehouse. You will note that this loop is asynchronous, which prevented the built-in mail loop detection features from stopping it.

Within seconds, this loop started clogging our outbound SMTP queues, and it was finally detected by our monitoring systems as the disks of our mail servers began to fill.

It took us well over an hour to get this under control, and required us to stop processing mail for several minutes at a time. As it was a forwarding loop, the message that looped grew in size every time it looped and so we had tens of thousands of looped messages each queued on multiple servers here. We were able to delete them from the SMTP queue on treehouse, but not on our outbound spam/virus filtering mail hub (due to limitations of that device's software.)

The loop will have lingering effects, and we're taking the following steps to mitigate them:

* We have removed the filtering mail hub out of its primary task of handling all of our outbound mail. It will take it some time to unload its outbound and inbound(bounces) queue.

* We have configured our mail servers to relay mail directly outbound. This will slow normal delivery as we can not filter outbound, and any pollution of the mail stream with spam (usually via forwards to external addresses) may cause remote servers to temporarily reject our mail via "greylisting".

* We have scoured the queues for copies of this looping message and deleted them. Some are inevitably still "out there" on external servers, so we have created filters to reject them.

* We will contact the client who created the mail loop and explain to them how NOT to do that in the future.

We apologize for any inconvenience this may cause you. NO MAIL WAS LOST during this event, but we do expect that delivery will be delayed throughout the day today as queues clear out. There is no way for us to prioritize some mail over others as it will really be up to the willingness of remote servers to accept our mail in a timely fashion, as the queues for external domains rotate to the head of the line.

In the "good old days" before the spam problem, this issue was solved via automated technical means, but now the ubiquitous deployment of spam filtering technologies has complicated the environment significantly. We took every possible measure to detect, and correct this issue before it escalated into an actual outage, or crash of the servers involved. We must now ask your patience as the resulting backlog clears itself.

Please note: For obvious reasons, I have elected not to use the email notification system of this blog. If you rely on email to be notified of digital.forest support updates, I suggest switching to an RSS reader. The link for doing so is just over to the right side of this page. -->

Regards,
Chuck Goolsbee
VP, Technical Operations
digital.forest, Inc.

posted by Chuck G. at 01:41 AM on Thursday, March 8, 2007
Categories: Mail

Tonight during our scheduled maintenance we will be applying security patches to all of our Windows hosting servers. We will reboot them one at a time and expect servers to be down no more than 5 minutes each.

The work will begin at 11:00 pm

posted by Kyle at 10:02 PM on Thursday, March 1, 2007
Categories: Windows Hosting Servers