digital.forest Technical Support
News archive: Network

Tomorrow night (Thursday, May 15th) between 11pm and Midnight Pacific Daylight Time, we will be shutting down a connection to a BGP peer. The circuit in question has been replaced with one of larger capacity. There should be no operational impact as traffic will automatically redistribute itself over our other connections. Persistent connections such as VPN tunnels may reset themselves if they were established over this particular route. This will have no impact on other forms of network traffic, such as web, email, FTP, etc.

posted by Kyle at 05:07 PM on Wednesday, May 14, 2008
Categories: Network

We're pleased to announce a new connectivity peer adding to our BGP routing, Level(3). The fiber optic connection went live this morning, and we should have BGP routing fully operational by mid-week. This connection is a full Gigabit Ethernet circuit and should add to our network performance and reliability.

Our goal here at digital.forest is to provide multiple layers of redundancy, as well as physical diversity in our upstream connectivity. Not only do we have multiple connections, they terminate in diverse physical locations around the Seattle metropolitan area. This circuit from Level(3) terminates at their Seattle POP ("Point Of Presence") at 1000 Denny. We also have two Gigabit Ethernet circuits, from two separate providers, over two different pathways, landing at the Westin Building in downtown Seattle where we peer with NTT/America & the Seattle Internet Exchange. Additionally we have fiber connectivity to a provider who is has a POP located right in our building, Time-Warner Telecom. There is another circuit which currently crosses the Intergate.Seattle datacenter campus and connects us to a provider whose main Seattle POP is in Intergate.East. This last circuit is scheduled for shutdown in early June as we planned to replace it with this new Level(3) connection. This Level(3) circuit has been in process for several months, and we expect it will provide our clients the a high-quality of connectivity and performance you expect as a digital.forest client.

Regards,
Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc.

posted by Chuck G. at 06:35 PM on Tuesday, May 6, 2008
Categories: Network

We're happy to announce that our new dedicated fiber link to the Westin Building is now online. This circuit will allow our clients to secure small/medium sized (10-500mb) bandwidth connections from any carrier at the Westin with very reasonable loop costs. This offers a "best of both worlds" situation for d.f clients; the finest colocation facilities available in the Seattle area connected to the carrier-rich connectivity of the Westin Building. If you are interested in dedicated connectivity to your equipment at digital.forest contact your Sales Representative today.

Additionally we are using the circuit to re-establish our connection to the Seattle Internet eXchange (SIX). That BGP session will be turned up tonight around 11pm. This change will likely result in better routing to regional access networks and select long-haul transit providers who also peer at the SIX.

Stay tuned for more connectivity-related announcements as we're adding another tier-1 transit provider to the mix of our BGP routing within the next week or so.

Regards,
--Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc

posted by Chuck G. at 02:00 PM on Thursday, May 1, 2008
Categories: Network

Our new Fiber Optic installation was terminated this afternoon. The contractor will be back tonight around 11 pm to test the circuit. We hope to have this circuit up and running by the end of the week.

In other connectivity news, we have a new Gigabit Ethernet connection arriving soon for a BGP session with another major provider. It was originally scheduled for January, but has been delayed by fiber and power issues at another location. That circuit may also be up and running by the end of this week. We'll share more details about that as the turn-up approaches.

posted by Chuck G. at 07:04 PM on Monday, April 21, 2008
Categories: Datacenter Expansion, Dedicated Westin Circuit, Network

We have a contractor here today pulling a new fiber optic connection into the building and our datacenter. Initially we'll be using this new fiber for a dedicated connection to the Westin Building in downtown Seattle. Over this connection we'll connect our network to the Seattle Internet eXchange aka "The SIX"... we were SIX members when we were located in Bothell and are looking forward to the peering opportunities there again now that we've settled into our new facility in Seattle. More importantly however, this circuit will allow relatively inexpensive direct cross-connects for our valued clients who are seeking low-to-moderate bandwidth (10mb-300mb) at the Westin. Acquiring high-bandwidth (1Gb+) at our location is very cost effective, but the backhaul and loop costs can be prohibitive for smaller scale purchases. We're seeking to remedy that via this installation and provide our clients with more and better choices for direct connectivity.


Installing the fiber meant running a new 24-strand bundle from the vault out in front of our building up to the network core in our datacenter six floors above. The contractors arrived and found the vaults with a bit of water in them, not surprising due to our rainy climate. They ran a pump and removed the water so as to make working in the vaults less... wet. The water was pumped out onto the parking lot, where it ran down into a nearby storm drain.

Above: Looking down into one of the two fiber vaults in front of the building. Just a bit of water down there. The vaults are designed in such a way to keep the fiber conduits above the level of pooled water, but even so, the fiber itself is well-protected. Each strand is insulated and the whole bundle is encased in a weatherproof jacket seal, and then the bundles are run through plastic "innerduct".

Above: A view of the work from a balcony on our floor. The grassy area and shoulder of Tukwila International Blvd (SR 99) at the top of the frame is where virtually all the fiber optics that run southbound out of the Seattle metropolitan area located. This is the principal reason why the Intergate.Seattle datacenter campus was built here.

Above: Kevin from digital.forest holds the ladder while Kevin from the cable contractor opens up the fiber junction box at the top of the conduit run from the vault six floors below.

Above: Preparing to run the innerduct.

Above: Cable Installer pulling the innerduct through to the network core. He is standing up on our ladder racking, which is about eight feet (2.5m) above the floor. All our previous fiber optic cable installations are on the left. If it comes from outside the building it arrives wrapped in innerduct. If it comes from within the building it is inside a simple jacket. We have some multi-pair bundles, as well as a few single-pair runs to other parts of the datacenter (usually for storage area networks.) All of our connectivity from the core out to the datacenter for IP networking is on the right hand side. Don't worry, no cables are ever stepped on, and we rarely climb up on the racking.

Above: A view down to the parking lot where the first pull from ground level is going up.

Above: The fiber bundle has arrived at the junction box (you can see the cable pulling harness hanging out of the innerduct to the left of the installer. He is on a radio to the installer at the other end of each innerduct. They pulled the full length of slack to here next, then made the final pull to the network core.

Above: Starting the last pull. They use cloth tape, which is pre-installed inside the innerduct to pull the cable itself though.

Once the pull is complete they leave large slack loops at either end. On Monday another team will come and terminate the fiber into a panel here in the datacenter, and splice the other end down in the vault. You can see three Fiber Termination Panels just to the left of the installer's head. Another one of these will appear Monday.

The final step was tying down the innerduct to the ladder rack and labeling the install...

Stay tuned for an update on Monday evening.


posted by Chuck G. at 12:04 PM on Friday, April 18, 2008
Categories: Datacenter Expansion, Network

Between 4:45 am and 5:45 am on Monday, March 31st one of our upstream peers (Time Warner) will be replacing a faulty card in one of their routers.

This will cause an outage of up to 10 minutes on this circuit. Our other peers will handle all of our traffic during this maintenance.

posted by Kyle at 04:40 PM on Saturday, March 29, 2008
Categories: Network

Sometime between 12:01am and 4:00am on 03/20/08 one of our upstream providers will be starting maintenance on our circuit. Once they start the maintenance should last about 30 minutes. They are going to be moving our connection from it's current port to a new port. This will cause an outage of about 5 minutes on our connection with them.

Our other providers will carry our traffic while this maintenance is occurring.

posted by Kyle at 02:37 PM on Monday, March 17, 2008
Categories: Network

UPDATE 2/23/08 1:29 PM PST Work is complete and the circuit is stable again. We will now bring our BGP peer back up to restore full redundancy at our border.

This morning at approximately 10:00 am the transport provider to one of our BGP peers starting experiencing intermittent connectivity issues. They have narrowed the problem down to a bad card in one of their switches in the Westin building.

They have a tech en route and should have this resolved in a few hours. In order to minimize customer impact while they work we have shut down the affected BGP session. Our other peers will take up the slack in the mean time.

posted by Kyle at 12:22 PM on Saturday, February 23, 2008
Categories: Network

No, that is not Rodin's Thinker it is our fiber optic testing contractor running a certification check of the new fiber circuit we created for a customer yesterday. Clients of digital.forest not only benefit from our well-managed BGP-meshed network, they can also choose to directly connect to bandwidth providers here in the building, and in the Intergate.Seattle datacenter campus. In the business this is what is called being "carrier-neutral", so unlike for example an AT&T facility who only provides AT&T bandwidth, digital.forest clients have choice.

In this case our customer bought a small circuit from InterNAP, who has a network point of presence across the campus from us. We provided the connection to the campus fiber network and did end-to-end testing of the circuit after it was complete.

Very soon we will be formally announcing our own point of presence at The Westin Building in downtown Seattle. This will allow our customers to provision low cost dedicated circuits between their equipment at digital.forest and the myriad of providers found at the Westin. Contact your digital.forest sales or account manager today, or watch our support blog for more information.

posted by Chuck G. at 12:44 PM on Wednesday, February 20, 2008
Categories: Datacenter Expansion, Network

In our continued efforts to upgrade and improve our network we will be moving the GigE connections to one of our distribution switches to the new GigE modules in the core switches.

This maintenance will take place during our scheduled maintenance window on Thursday, Feb. 14th between 11:00 pm and midnight. The expected impact of this will be about 1 minute of downtime. This will affect servers in rows 11, 12 & 13 of Datacenter 1.

posted by Kyle at 03:18 PM on Tuesday, February 12, 2008
Categories: Network

Tonight we installed two new 16-port fiber optic cards in our network core. One of the two is pictured above - it is the one with the empty ports slotted between the 8-port fiber card above and the 48-port copper card in the middle. We've seen a sharp increase in clients requesting a fiber connection to our network, as well as more sophisticated connectivity such as a BGP routing with our AS combined with fail over protocols such as HSRP. Additionally we are part of the way through a project to seriously upgrade and expand our network; better external connectivity, and more connectivity options for our clients. These cards are a small, but important part of that project. We'll have more news and some exciting announcements in the coming weeks. Stay tuned!

posted by Chuck G. at 11:23 PM on Tuesday, February 5, 2008
Categories: Datacenter Expansion, Network

This morning at 10:37 Pacific Standard Time, one of our network peers experienced an unusual event on their network. They related to us the following information: "A router in Salt Lake City dropped its routing tables, which caused a router in Chicago to lose several BGP sessions."

The event was felt here on our network as a sudden loss of traffic going out in their direction. It lasted about five minutes, as things returned to normal by 10:47 AM. The traffic appeared to shift to our other connections as they increased proportionally as the one decreased. Some clients may have noticed this in the form of resets on persistent connections such as VPNs. It was likely invisible to the average "web surfer" however.

We are still awaiting an official explanation from the NOC of the provider in question, but we felt it important to state what we know now so that you are aware. When we here more, we'll update this post.

posted by Chuck G. at 02:43 PM on Tuesday, February 5, 2008
Categories: Network

We finally turned up the new circuit over the weekend. Friday night to be specific. Close observation over the weekend proved that the circuit was performing exactly as we had anticipated. At this time we're happy to declare the installation a success and start focussing on our next project. Stay tuned for news on that very soon.

Thanks for your patience as we completed this installation.

posted by Chuck G. at 10:35 AM on Monday, February 4, 2008
Categories: Network, Scheduled Maintenance

UPDATE 02/01/08: As of 11:07 PM PST we are fully up and running on our newest BGP Peer.

Last week I announced that we were adding a new BGP peer. It was originally scheduled for last weekend, but ended up not happening on schedule.

After clearing a few technical and procedural hurdles this week, we're finally ready for this to actually happen, and it is now scheduled for Friday night. If you are extremely curious as to the nature of the delay, feel free to read on.

---

Please accept our apologies for the delay which unfortunately was completely beyond our control. To explain what happened I need to provide a bit of background on how the Internet works. Please remember that I am vastly simplifying a very complex system in order to condense this into a small blog post... books as heavy as boat anchors have been written on this subject but I really can't go into the minutiae here without everyone's eyes glazing over... so here is the Cliff Notes version:

* The Internet is a collection of autonomous networks, all interconnected.
* Networks are collections of hosts each given a unique address.
* The glue that holds the networks together is called BGP.
* BGP sees networks as aggregate collections of addresses called "prefixes"

So when we connect to another network, we announce our prefixes to them and they announce theirs to us. At either end of the connection are network devices called routers and they do filtering and weighting to decide what routes work best for your traffic. Filtering is important because it allows networks to send & receive the proper traffic and ignore improper traffic. For example if digital.forest has a connection with both "Network A" and "Network B". However we do not want to be a transit point BETWEEN "Network A" and "Network B" so we filter appropriately. We only want "our" traffic to go over these routes, not the whole world's traffic. Every network does this to a certain extent if they are connecting to multiple other autonomous networks.

Most large transit networks use routing databases to associate autonomous networks with their announced prefixes. This acts as a security & authentication layer, as well as a basis for filtering policies as the networks that query the databases. The databases are maintained (usually) by the entities that allocate the addresses, so they are a trusted source. The databases are then replicated and shared among the network operators. There are also "route servers" and "looking glasses" at various locations around the Internet for network operators to check to see how they fit into this big meshed network and verify that what they want to happen, is in fact happening.

Mind you all of the above is a vast simplification, so if you knew nothing about this until now, it is hopefully understandable. If you already know how all this works you know I left plenty of detail out, but you should hopefully recognize that it is all basically correct. Now on to what happened over the past week...


Here at digital.forest we announce several prefixes. A few of our own, and several on behalf of our customers who have been allocated specific IP address ranges different than ours. Last weekend we turned on our new circuit in the wee hours one night and from here it looked great - traffic flowed at a rate we expected it to. But before we went too far along in time we consulted the various route servers out there to see what the Internet saw: How did this new connection look from the outside looking in? What we saw was just one of our prefixes being carried by this new connection. Not wanting to risk weird routing issues we shut the new circuit down and got in contact with the provider's NOC to see why the all the prefixes we announced were not picked up by their network. This prompted a round of paperwork and approvals on their end, as we discovered that the do not rely on the routing databases to determine their route filtering policies. Instead they do it manually. I will not make any judgement calls as to that policy of theirs... I understand why some entities choose manual methods over automatic ones, after all I shift my own gears when I drive... sometimes manual systems are a better choice. In this case though it certainly slowed down the process. We submitted our full prefix list to them early in the week. It took them until yesterday to enter them in their systems. We are waiting a full 48 hours for the projected propagation time so that their entire network, and their BGP peers pick up the changes, then we will re enable the circuit. Kyle Murray, our Network Manager has been the man on point throughout this process and has done an excellent job making sure it all goes well.


Several of our clients are looking hopefully at this new circuit with some expected performance increases as it is a recognized "better" network than the circuit we are replacing. These clients are also some of the specific secondary prefixes that we announce. We wanted to make sure that this circuit turn up goes very well with no possibility for unusual behavior of our clients' traffic. Hence the delays to make sure everything was exactly as it should be. We are now very confident, but will go through the same process as last time: turn up, then check and see how it looks both from within and without. Trust, but verify.

My goal in these posts is to provide you with clarity as to what happens here at an operational level at digital.forest. We are blessed with excellent staff, and truly the best clients a company could hope for. I enjoy sharing this information and I hope it serves to boost your confidence in us as we care for your vital systems in our facility and on our network. I know that you look to us to "just make it work" but it can only help for us to communicate on an ongoing basis what is involved behind the scenes to accomplish that task.

Regards,
Chuck Goolsbee
VP Technical Operations
digital.forest, Inc


posted by Chuck G. at 03:04 PM on Thursday, January 31, 2008
Categories: Network, Scheduled Maintenance

Our maintenance could not be completed this morning. We will be doing the work during our scheduled maintenance window this evening between 11:00 pm and midnight.

Our network maintenance originally scheduled for the early tomorrow morning has been pushed back a bit by the vendor. It is now scheduled for the early morning hours of Monday, January 28th.


posted by Chuck G. at 03:09 PM on Friday, January 25, 2008
Categories: Miscellaneous, Network, Scheduled Maintenance

We will be adding a new BGP Peer over the weekend. The actual cross-connect of the fiber circuit is happening today and BGP turn-up will happen sometime in the wee hours of Friday or Saturday. We'll be adding AS4323, also known as Time-Warner Telecom via a gigabit Ethernet connection. TWTC will be replacing our Fast Ethernet circuit we've had with AS2828, also known as XO Communications.

This should have no impact on service for any of our clients, just a change in routing at our network boundary. If anything we should see an improvement in performance overall. No changes to our other connections is scheduled.

In a few weeks we will be adding another Gigabit Ethernet circuit with AS3356, also known as Level(3). We'll post more news on that as it approaches.


posted by Chuck G. at 09:12 AM on Thursday, January 24, 2008
Categories: Miscellaneous, Network, Scheduled Maintenance

On Saturday, December 29th during our scheduled maintenance window we will be making configuration changes to one of our upstream BGP peering sessions. This change will require a reset of the BGP session to complete. The reset will go unnoticed for the most part as our other peers will handle our traffic during the maintenance.

The maintenance should last about 30 seconds and will occur between 11:00 pm and 1:00 am.

posted by Kyle at 12:00 AM on Thursday, December 27, 2007
Categories: Network

Between the hours of 12:01 AM and 1:00 AM Thursday, November 29th, and Friday, November 30th, we will be performing some hardware upgrades on portions of our network.

We will be replacing gigabit Ethernet modules on a few switches deployed in the colocation portions of our facility. Expect intermittent outages, likely less than a minute in duration, but possibly lasting up to five minutes while we install these improvements. Every effort will be made to minimize downtime.

Thank you for your patience.

posted by Chuck G. at 08:59 AM on Wednesday, November 28, 2007
Categories: Colocated & Dedicated Servers, Network

UPDATE: 11-11-07 01:36 PDT Our upstream has resolved their routing issue by replacing a failed router processor card. We have turned up our interface and traffic is back to normal.

Tonight around 20:20 PDT we started seeing an issue with one of our network peers. Customers reported intermittent connectivity issues to our network. We contacted our peer and confirmed the issue and by 21:15 we determined the best course of action was to shut down our interface with them until the issue on their network was resolved. So for the time being one of our upstream connections is offline. This is not an issue really as we have more than enough bandwidth on our other connections to handle our full load and then some.

Our peer will notify us when they have corrected the issue on their network and we'll re-establish our connectivity with them at that time.

We'll update this post with more news as required.

posted by Chuck G. at 10:10 PM on Friday, November 16, 2007
Categories: Network

Tonight during our scheduled maintenance window (11:00pm-2:00am) we will be making some changes to our network configuration. This will momentarily interrupt service on one of our upstream network connections. There should no service impact as our other connections should carry all our traffic.

This is being done to address the Comcast routing issue posted yesterday.

posted by Chuck G. at 01:19 PM on Thursday, October 18, 2007
Categories: Network, Scheduled Maintenance

We've noted an unusual network issue affecting Comcast customers in the Pacific Northwest. Packets coming into and out of our network are taking unusual paths to get to Comcast users, frequently via California, which is adding latency and occasionally causing timeouts when accessing services on our network.

The issue started some time late yesterday. We're keeping an eye on the situation and will update this notice when we have any further information.

posted by Chuck G. at 08:57 AM on Wednesday, October 17, 2007
Categories: Mail, Network

UPDATE 2007-10-04 11:35 PM: Tonights maintenance will be rescheduled for a later date. No maintenance activity will occur tonight.

Tonight during our scheduled maintenance window we will be changing the switch processor engine in one of our distribution switches in order to resolve some intermittent issues with the switch. We will be doing this by moving each connection on the current switch to a new switch one at a time. Customers will experience an outage of up to 30 seconds as each connection is moved. In reality this outage will most likely be less than 5 seconds.

This maintenance will affect servers that are in rows 11,12 & 13 of the datacenter. The maintenance will occur between 11:00 pm and 1:00 am.

posted by Kyle at 03:55 AM on Thursday, October 4, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be reseting one of our BGP sessions in order to balance our outbound traffic. Our other providers will take the traffic during the reset.

The reset will occur between 11:00 pm and 1:00 am.

posted by Kyle at 02:06 AM on Tuesday, September 4, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be making some changes to the BGP config on our border routers. In order to complete these config changes we will need to reset the BGP peer sessions. As each session is reset our other peers will handle the traffic.

We will also be implementing a BGP trigger router that will give us more control over traffic in order to mitigate of DoS attacks. While this will not be service affecting it is an important milestone in our efforts to build the most secure network possible.

The maintenance will occur between 11:00 pm and 1:00 am

posted by Kyle at 09:21 AM on Thursday, August 30, 2007
Categories: Network

One of our network peers is experiencing problems on their backbone. Our BGP session with them started failing at 4:18 PM PDT. We shut down that interface in order to prevent their issues from causing our customers any problems. We are in contact with their Network Operations Center and will re-enable our connectivity to them as soon as their issue(s) are resolved. At this time they have no ETA for a fix.

Our other circuits are taking our traffic, so this issue should be mostly invisible to our clients.


Update: 5:40 PM PDT We have a very solid understanding of what happened a little over an hour ago with regards to the upstream network event. One of our neighbor networks experienced a routing problem, we saw the routing table from them shrink from roughly 225,000 entries down to 84 entries over a period of several minutes. Unlike a link failure, this was a circuit whose performance was rapidly degrading, but still "up". This prevented normal, automatic fail-over procedures from working. We shut down the interface that connects our two networks, and things failed over gracefully at that point.

That circuit will remain shut down until we have confirmation from them that their network has stabilized.

We will update this post when we have more news.

Update: 8:50 PM PDT Our upstream peers problems have been resolved. We will be bringing up our BGP session with them tonight at 10:00 PM.


Update: 10:03 PM PDT BGP session is up. Peer is taking traffic and route table is fully populated.

posted by Chuck G. at 09:48 AM on Wednesday, August 29, 2007
Categories: Network

Tonight we will be making some configuration changes to our border routers in our continued efforts to provide the most reliable and secure network possible. These changes will be non-service affecting.

This work will be performed between 9:00 pm and midnight.

posted by Kyle at 06:17 AM on Thursday, August 16, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be making a minor configuration change to one of our upstream connections. This may cause a brief interruption of the connection, however, our other upstreams will handle the traffic during this time.

This maintenance will occur between 11:00pm and midnight tonight.

posted by Kyle at 03:43 AM on Monday, August 13, 2007
Categories: Network

Final Summary

It is 3:00 PM PDT on Monday, August 6, as of 1:15 PM PDT everything here at digital.forest is back to normal. At that time we brought up our connection with 'Network X' and it has been stable ever since. As promised I will provide a short timeline of events over the weekend, a summary of what we did to mitigate this attack, and what we have done to prevent future attacks from having a similar effect. This data, together with what we posted last week (below) should serve as a total recap of the entire event.

We have had varying levels of success working with our peer networks. One has been excellent, one has been awful, and the other just "OK"... for this reason I have chosen to make them anonymous using the "Network *" as a substitute for their names. The one network which provided excellent support is one we have been working with for many years. The other two are both up for contract renewal before the end of the year and the one we are very unhappy with, "Network X" is unlikely to keep our business. Please keep that in mind as you read the following.

The attack which started in the early morning hours Friday was best mitigated by identifying the attack source and destination, and configuring the routers in between to ignore that traffic. This requires coordination with the networks we connect to directly, and ideally the networks that connect to the attack sources directly. We started by making those configuration changes on our routers, then contacting our network peers and requesting thy make a similar configuration change. The NOC staff at "Network Y" were VERY responsive and immediately made the changes we requested. This is what allowed us to come back online at 7:56 AM PDT Friday morning. Half of the attack traffic was coming in via the connection with "Network X" and we could not get a positive response from their NOC. We had opened a trouble ticket with them, but nothing was done on that ticket through all of Friday. The circuit stayed down until 7:15 PM PDT on Saturday. Unfortunately when that circuit came back up, the attack resumed. We once again shut it down on our end. "Network X" stayed offline over the entire weekend.

Our Network Manager, Kyle Murray performed some forensic analysis on the data we collected during the attack. The attack seemed to be coming from a single IP address allocated to a company in New York which appears to have been out of business since 2004. The source address was likely a forgery, as the amount of traffic we saw coming inbound was impossible to generate with a single computer. It was also coming into our network over several network peers. This is what lead us to believe that in reality it was a distributed attack. The source network of that IP was being announced to the Internet routing table by a network we'll call "Network C". We contacted their NOC to let them know about the attack we were seeing which theoretically came from their network. They agreed to null route that address as well. Kyle was hoping to hear back from them today with more information, but they are based on the east coast and their Security Staff have already left for the day.

Today at 1:15 PM PDT we finally brought up our BGP connection with "Network X" and the attack has been completely blocked.

We have fingerprinted the attack profile and created an alarm that pages us if any traffic matches the behavior of this attack traffic. We are building some automated systems to detect and null route such traffic.


digital.forest was hit with a massive distributed denial of service (ddos) attack this morning. We are working with our network peers to mitigate this as much as possible. Please be patient while we dedicate all available resources to resist this attack.

Update 8:25am As of 7:30 am we have good connectivity with one of our network peers. Another is partially up, with attack traffic coming in, but at a lower volume than before. One network connection is still down. We are working with the NOC staff of all our external networks to resolve this issue as best we can.

Update 9:51am PDT The ddos against our network this morning has been stopped and as such our network has returned to normal operating status. All sites and servers should again be functioning normally and accessible. If you are still experiencing any difficulties getting onto your website or server please give us a call at our technical support line 877-720-0483 option 3.

Final Analysis and Time Line:
Some specific details are still under investigation, however we have a very good understanding of what happened early this morning and are prepared to share in general terms the following information.

* Around 3:45 AM PDT a Denial of Service attack, directed at a single IP address inside our network began. At first it was not very large.

* By 4:10 AM PDT it had grown large enough to set off alarms in our network monitoring systems. Emergency pages went out to NOC staff and our Network Manager.

* At 4:27 AM PDT we lost the BGP session with one of our network peers, we'll call them "Network X".

* BGP was reestablished with Network X at 4:28 AM PDT.

* 4:35 AM PDT Network Manager was awake and gathering data from d.f NetFlow server. Recognized the traffic patterns as a Denial of Service Attack. Was in contact with NOC staff on site at digital.forest.

* 4:40 AM PDT Discovered the target of the attack via NetFlow reports. DoS traffic now at 30,900 flows per second.

* 4:44 AM PDT added route to black hole DoS target to Boundary Router 1

* 4:59 AM PDT added route to black hole DoS target to Boundary Router 2

In past experience, this step has stopped every other attempted denail of service attack on our network. By telling the world that the target does not exist, the attack usually stops. What followed instead was more of the same. The attack continued, and in fact intensified.

* 5:10 AM PDT BGP with Network X goes down.

* 5:11 AM PDT BGP with Network X reestablished.

* 5:16 AM PDT BGP with Network X goes down.

* 5:17 AM PDT BGP with Network X reestablished.

"Blackholing" the attack target has had no effect. Our attempts to get attack source data from our NetFlow server is fruitless, it is unable to keep up with processing as flows begin to exceed 50,000 flows per second/3,000,000 flows per minute. If we can get source data, we can start making attempts to block the source, or work with our peer networks to block the attack.

* 5:40 AM PDT Boundary Router 1 goes non-responsive.

* 5:43 AM PDT On site tech restarts BR1 under direction from Network Manager.

* 5:52 AM PDT BR1 up again.

* 5:52 AM PDT BGP session with another provider we'll call "Network Y" is lost. This network is terminated on a separate router, Boundary Router 2.

* 5:53 AM PDT BGP with Network Y restored.

While we maintained BGP connectivity with one of our three providers ("Network Z") throughout the event, the attack traffic at times consumed 100% of the CPU of one, or both routers, causing such high latency that we were, for all intents and purposes, not passing traffic. At this time Network Manager calls the Vice President of Technical Operations and informs him of the situation. Ops VP starts calling technical support staff to have them get to the office and assist with telephone calls. Also informs the CEO and VP of Sales.

* 6:08 AM PDT Boundary Router 2 goes non-responsive and it restarted by on site staff.

* 6:12 AM PDT Boundary Router 2 is back up. Attack traffic has effectively blinded both routers. NetFlow server records over 70,000 flows per second/4.2million flows per minute before it goes non-responsive as well.

* 6:20 - 7:30 AM PDT Network Manager and Ops VP contacting the NOCs of peer networks to have them assist in DoS Mitigation. Network Manager logs trouble tickets with Network Y and related Metro Ethernet provider, proceeds to datacenter from home. Ops VP logs tickets with Network X and Network Z... (after much phone tree navigation... way too much with "Network X")

* 7:37 AM PDT Network Manager on site at d.f and making very good progress with NOC staff of Network Y.

* 7:56 AM PDT BGP session back up with Networks Y & Z. We are back "on the air" again, though down by one provider. Attack continues, but is being mitigated actively by Network Y. Network Z is up, steady and not included in the attack. Network X is still down. Most of the tech & customer service staff is on site, taking calls from customers.

* 8:03 AM PDT BGP with Network Y lost.

* 8:10 AM PDT BGP with Network Y restored.

* 8:50 AM PDT Ops VP leaves home headed for digital.forest.

* 8:58 AM PDT BGP with Network Y lost.

* 9:03 AM PDT BGP with Network Y restored.

* 9:06 AM PDT BGP with Network Y lost.

* 9:07 AM PDT BGP with Network Y restored.

* 9:10 AM PDT BGP with Network Y lost.

* 9:11 AM PDT BGP with Network Y restored. Attack now completely blocked. Traffic on Network Y stabilizes and returns to normal. Network Z has remained up and stable since 7:56 AM PDT. Network X still down.

* ~10:00 AM PDT Server which was the target of the attack is brought back online.

As of now, 11:55 PM PDT, Network X, the first of our network peers to be lost, is still down. We have been calling their NOC and have a trouble ticket logged. We strongly suspect that the port on their equipment we connect to in downtown Seattle has failed an auto-negotiation. Hopefully we'll have this resolved soon.

Our other circuits, Networks Y & Z are stable and handling all our traffic normally.

We would like to thank our clients from their patience and understanding during this event. We will continue to work on this issue with the intent of learning as much as we can. We have been subjected to denial of service attacks before, but in each of those cases we have been able to successfully mitigate them, usually before they had any noticeable impact on our network. This was the first attack on our network since December 23, 2001 that had more than a few minutes impact on our ability to stay online. We've spoken to a number of DoS Mitigation experts today, and will continue to do so. We've made some configuration changes and will continue to harden our infrastructure against attacks.

As always, if you have questions or concerns, feel free to contact us.

Regards,
--Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc.


posted by Chuck G. at 06:15 AM on Friday, August 3, 2007
Categories: Network

Today @ 10:40 am we experienced a processor switchover on both of our core switches.

Both of these switches have redundant processor engines where the standby processor checks the primary processor every 5 to 10 milliseconds. If the standby processor can not communicate with the primary it will take over causing it to become the active processor.

Today this happened on our secondary core switch and then approximately 30 seconds later on our primary core switch. It appears that this was a precautionary measure taken by the on board diagnostics on the switches as both devices are functioning normally at this time.

Since these switches also act as our border routers this had the effect of disconnecting the BGP sessions on both devices. There was less than a minute of downtime while the BGP sessions re-established themselves.

Kyle Murray
Network Manager
digital.forest

posted by Kyle at 02:04 PM on Thursday, July 19, 2007
Categories: Network

UPDATE: The problem port has been disabled

The GigE card replacement that occurred on July 9th has not resolved the problems with the switch. We are still experiencing intermittent outages. To prevent further downtime we will disable the port that is causing problems. There will be a very brief outage as the traffic reverts to the backup uplink.

We will continue to work with our vendor to find a resolution to this problem.

posted by Kyle at 12:13 PM on Wednesday, July 11, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be inserting a GigE card in one of our distribution switches. This is being done after the replacement of a fiber that was causing some intermittent connectivity problems.

There will be a very brief outage when the GigE card is re-enabled as it is connected to the primary core switch and traffic will be re-rerouted through this card.

The maintenance will occur between 11:00 pm and midnight tonight.

posted by Kyle at 07:43 PM on Monday, July 9, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be making some changes to our BGP configuration in our continued efforts to better balance traffic and reduce latency. During the window we will be resetting each of our BGP peers. As each peer is reset the other peers will take the traffic so downtime should be no more than a few seconds.

The maintenance will occur between 11:00 pm this evening and 1:00 am tomorrow morning.

posted by Kyle at 03:32 PM on Monday, June 18, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be making some changes to our BGP configuration in our continued efforts to better balance traffic and reduce latency. During the window we will be resetting each of our BGP peers. As each peer is reset the other peers will take the traffic so downtime should be no more than a few seconds.

The maintenance will occur between 11:00 pm this evening and 1:00 am tomorrow morning.

posted by Kyle at 02:57 PM on Friday, June 15, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be making some changes to our BGP configuration in order to better balance traffic and reduce latency. During the window we will be resetting one of our BGP peers. When this peer is reset the other peers will take the traffic so downtime should be no more than a few seconds.

The maintenance will occur between 11:00 pm this evening and 1:00 am tomorrow morning.

posted by Kyle at 09:58 PM on Thursday, June 14, 2007
Categories: Network

One of the distribution switches in our facility, specifically in the rack-colocation area in rows 11 & 12 in DC1, has been showing errors and causing some network issues for servers in that area today. Our switch vendor believes that this is being caused by a bad gigabit port which uplinks that switch to our network core. The other possibility is a bad switch engine. Thankfully the former of these has some built-in redundancy. The latter is an easy card swap, and we have spares. So in a few moments we will be manually failing the gigabit uplink to the redundant port. It is unlikely that this will be any more noticeable than the intermittent issues that servers have seen on this network segment, namely some dropped packets and retransmissions. If that does not improve things, we'll replace the switch engine blade later tonight during a maintenance window.

Update 5:00 PM: The card reset seems to have solved the issue. We'll keep an eye on things over the next few days to be sure. We'll also order a replacement card for the switch. Thanks for your patience.

Above: Network Manager Kyle Murray pulls the problem gigabit card from the switch.

posted by Chuck G. at 08:28 AM on Monday, June 11, 2007
Categories: Emergency Maintenance, Network

Tonight during our scheduled maintenance window we will be making some changes to our BGP configuration in order to better balance traffic and reduce latency. During the window we will be resetting each of our BGP peers. As each peer is reset the other peers will take the traffic so downtime should be no more than a few seconds.

The maintenance will occur between 11:00 pm this evening and 1:00 am tomorrow morning.

posted by Kyle at 04:19 AM on Thursday, June 7, 2007
Categories: Network

At 9:50 AM this morning one of our Metropolitan Ethernet providers, OnFiber had an equipment failure here in Seattle. We connect to one of our network peers, NTT/America at The Westin Building via this circuit. This caused us to have have intermittent connectivity over that particular circuit to NTT/America. Some digital.forest clients may have had "slow" or "intermittent" issues reaching servers here for a short period of time while we diagnosed the issue with the NOC's of NTT & OnFiber

We have shut down our BGP connection to NTT/America while OnFiber fixes the problems on their network. At the moment we are running on two of our three network connections. We will update this post when we bring the third circuit back online.

Update: As of 11:02 AM PDT this issue is completely resolved. The OnFiber circuit was manually moved to a different port. After a successful 10-minute testing of the new circuit we turned up our BGP session with NTT/America.

We maintained connectivity to our other BGP network peers through this event, so at no time was our network "down". We do like to keep our clients informed of events here at our datacenter, even if they have no direct impact on your servers. In this case, it was a classic example of Internet Architecture and how it handles outages. The often-used phrase is that it "routes around damage." In this instance when one of our circuits had an issue our traffic just shifted to our other circuits. It is likely that none of our clients even noticed. If they did notice it would have been an intermittent connectivity for a brief period of time. Such is the nature and reason for designing redundant systems. Our fiber optic connectivity to the rest of the Internet flows over multiple physical paths. Those paths do not converge until they are physically inside our datacenter facility. This prevents complete outages through equipment failure or accidental fiber cut. Today's event confirms the built-in redundancies work as designed.

--Chuck Goolsbee
VP Technical Operations
digital.forest, Inc.

posted by Chuck G. at 11:00 AM on Tuesday, May 8, 2007
Categories: Emergency Maintenance, Miscellaneous, Network

Earlier tonight a colocated server on our network was subjected to a Denial of Service (DoS) attack. It began around 7:20 pm, when the attacker was denied the specific target, they later broadened the attack at an entire network segment. Clients with servers on a single particular subnet here may have had trouble reaching their servers between 8:30 and 8:44 pm PDT. No other subnets were affected.

We've taken steps to minimize the chances of it happening again, and will post updates if required.

posted by Chuck G. at 11:57 PM on Wednesday, March 28, 2007
Categories: Network

It is an ironic fact that Network Geeks like us love to have access to more fiber optic cable, but have an almost irrational fear of the equipment that performs the task, namely the backhoe. Yet another fiber provider is landing in our building, but the process to get there involves trenching into one of the fiber vaults. Needless to say, we've been keeping a sharp eye on the crew out in front of the building. (Thankfully our facility enjoys multiple fiber paths so even if there were a backhoe disaster out front, only 33% of our current Internet bandwidth would be at risk.) We've been impressed with the precision and care of this particular crew so far. Their work should wrap up later today.

What does this mean for you, the current or potential digital.forest client? Choices of course! More choices. You can choose to connect to our well-peered BGP4 network. You can choose to connect directly to your preferred carrier, right here in our facility or building. You can choose to use our fiber network to connect to any major Seattle Exchange Point, such as the Westin Building. You can choose to connect your office directly to your servers at digital.forest via a Metropolitan Ethernet connection. You can mix and match any of the above!

We're happy to assist you in the process. Talk to your digital.forest Account Manager for more information. Not yet a digital.forest customer? Contact our Sales staff at 877-720-0483, option 2.

posted by Chuck G. at 12:28 PM on Monday, February 12, 2007
Categories: Intergate.West Move, Network

Tonight during our scheduled maintenance window we will be resetting the BGP sessions with 2 of our service providers. The sessions will be reset one at a time causing a brief outage of the connection. Our other providers will carry all traffic during this time.

The maintenance will take place between 11:30 pm and 12:30 am PST.

posted by Kyle at 11:10 PM on Wednesday, January 17, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be changing a failing fan tray in one of our distribution switches. This, unfortunately, is the only non-hotswappable part in the switch. Because of this the downtime for this switch will be 5-10 minutes. This will only affect servers in row 6 of the datacenter.

The work will begin @ 11:00 pm PST.

posted by Kyle at 10:46 PM on Thursday, January 4, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be resetting the BGP sessions with 2 of our service providers. The sessions will be reset one at a time causing a brief outage of the connection. Our other providers will carry all traffic during this time.

The maintenance will take place between 12:00 am and 1:00 am PST.

posted by Kyle at 10:42 PM on Thursday, January 4, 2007
Categories: Network

Tonight during our scheduled maintenance window we will be resetting the BGP sessions with 2 of our service providers. The sessions will be reset one at a time causing a brief outage of the connection. Our other providers will carry all traffic during this time.

The maintenance will take place between 1:00 am and 2:00 am PST.

posted by Kyle at 09:44 PM on Wednesday, December 27, 2006
Categories: Network

As part of normal procedure in dealing with a potential security breach, we've suspended/blocked many administrative access protocols into our network while we investigate. We realize this is a major inconvenience for our clients with regards to routine management and development on their servers colocated here in our datacenter. However to mitigate the risks it is our responsibility to take this step. Better to be safe than sorry. Please be patient while we investigate this issue.

Last time we invoked these blocks, we noted the clients who required exceptions and have maintained connectivity for those IP ranges. If your business requires these protocols to function, please contact us to discuss the possibility of enabling an exception. Otherwise, please be patient while we conclude our work.

We will post an update when we re-enable administrative access protocols.

UPDATE 2:10 PM PST: The blocks have been removed. We apologize for any inconvenience this may have caused.


posted by Chuck G. at 10:09 AM on Friday, December 8, 2006
Categories: Network

In observance of the Thanksgiving Holiday, digital.forest will be closed Thursday November 23rd and Friday November 24th. We will resume regular business hours at 8am PST Monday November 27th.

Technical Support staff will remain on-site 24 hours a day throughout the holiday period. Please be aware that we'll have limited staff coverage for telephone tech support during the next few days. Please note that our building will be locked throughout the weekend and clients requiring access to the datacenter to work on colocated servers will have to call or email first to be allowed access to the building. Additionally we will likely take advantage of the holiday lulls to perform maintenance and upgrades on core equipment. We will post notice of these beforehand. Finally, we are closing the phone support queue for a while on Wednesday, November 22nd in the afternoon for a Tech Department meeting. We appreciate your patience.

All of us at digital.forest wish you a happy Thanksgiving and a wonderful holiday season!

Chuck Goolsbee
VP, Tech Ops
digital.forest

posted by Chuck G. at 01:52 PM on Wednesday, November 22, 2006
Categories: Colocated & Dedicated Servers, Miscellaneous, Network, Phone System, Scheduled Maintenance

The Internet is buzzing with news concerning a potential malware threat from a Microsoft Windows vulnerability which was patched this past Tuesday. I'd like to take this opportunity to remind our valued clients of our policies and procedures in instances such as this.

* We do our best to protect the hosts inside our network from such threats, by both patching and port-blocking on our boundary network and firewall devices.

* We ask that you also stay current on your patches, not only on your servers here, but also on any internal hosts used to access them. This is crucial because many of our clients use VPN technology to communicate with servers in our datacenter. Our port blocking and firewalling efforts have NO AFFECT on the contents and payload of VPN-tunnelled/excrypted traffic. This means that even if we have successfully stopped the malware from entering our network from "the wild" you or your users can still "infect" your own servers via a VPN connection.

* If an outbreak of some malware does occur, our first priority will be to secure our network from further spread. If your servers are infected, and being used to spread further malware or similarly abusive traffic, we will have no choice but to disconnect them from the network. We reserve the right to block any malicious traffic, or remove any system from our network being used to generate malicious traffic.

* We are available to assist clients in patching or repairing systems, but be aware that our priorities in the midst of an event will be protecting those clients and systems that are NOT affected first. In other words we may not be available to assist immediately as our resources will be focussed on prevention of the spread before curing of the ill.

It is therefore in your best interest to patch your systems now.

For more information on this issue, please see:
http://www.microsoft.com/technet/security/bulletin/ms06-040.mspx

http://www.dhs.gov/dhspublic/display?content=5789

http://www.eweek.com/article2/0,1895,2002142,00.asp

Excellent sources of up-to-date information should an event occur are:
SANS' Internet Storm Center
CERT
US-CERT


Regards,
--Chuck Goolsbee
V.P., Technical Operations
digital.forest

posted by Chuck G. at 02:16 PM on Friday, August 11, 2006
Categories: Colocated & Dedicated Servers, Network, Worms and Virii

Tonight during our scheduled maintenance window we will be performing config changes on several of our distribution switches. These changes require a reboot of the switch to complete. Once completed we will be able to add more subnets to these switches.

The work will begin @ 10:00 pm and will involve about 2 minutes of downtime per switch. The work will affect servers in rows 1,4,5,6 & 8.

posted by Kyle at 02:16 PM on Tuesday, August 1, 2006
Categories: Network

Tonight during our scheduled maintenance window we will be changing a failing fan tray in one of our distribution switches. This, unfortunately, is the only non-hotswappable part in the switch. Because of this the downtime for this switch will be 5-10 minutes. This will only affect servers in row 1 of the datacenter.

The work will begin @ 11:00 pm PST.

posted by Kyle at 10:30 AM on Thursday, July 6, 2006
Categories: Network

Update: We will also be changing the fan tray in another of the distribution switches. This, unfortunately, is the only non-hotswappable part in the switch. Because of this the downtime for this switch will be 5-10 minutes. Servers affected will be: oberon, europa, nafa, newbiz, pluto and ara.

Tonight during our scheduled maintenance window we will be replacing a module in one of our distribution switches. This will cause an outage of about 2 minutes for the servers connected to this module (8 ports).

The work will begin at 10:00 pm PST

posted by Kyle at 11:34 AM on Tuesday, June 20, 2006
Categories: Network

Tonight during our scheduled maintenance window we will be completing the implementation of our redundant network. During the course of this work there will be brief interuptions of service lasting no more than a few seconds.

Once completed this will give us a fully redundant (layer 2 and 3), meshed network both externally and internally.

The work will begin at midnight tonight and will be completed by 3:00 am.

posted by Kyle at 04:15 PM on Thursday, June 1, 2006
Categories: Network

Wednesday night during our maintenance window we will be moving several servers to a new switch. This will allow them to be dual homed to our core switches, increasing redundancy.

Downtime for each server should be less than a minute. The maintenance will take place between midnight and 2:00 am.

posted by Kyle at 10:33 PM on Tuesday, March 21, 2006
Categories: Network

Tonight during our maintenance window we will be resetting our BGP sessions. This is required to complete a config change to resolve some assymetrical routing issues.

There will be minimal impact from this as the other links will carry the load as each BGP session is reset. The work will take place between 11:00pm and midnight.

posted by Kyle at 02:56 PM on Wednesday, February 22, 2006
Categories: Network

Tonight during our maintenance window we will be resetting our BGP sessions. This is required to complete a config change to resolve some assymetrical routing issues.

There will be minimal impact from this as the other links will carry the load as each BGP session is reset. The work will take place between 11:00pm and midnight.

posted by Kyle at 10:24 AM on Monday, February 13, 2006
Categories: Network

Tonight during our scheduled maintenance window we will be adding additional RAM to one of our primary switches. This work will incur 2 small outages of approximately 1-3 minutes each.

This additional RAM will give us more processing power to be able to have a real time view of the type and source/destination of all traffic going through our border routers. This will enable us to respond even more rapidly to any threats to our infrastructure.

The work will begin at 11:00 pm with the outages happening between 11:00pm and 1:00 am.

posted by Kyle at 11:02 AM on Wednesday, January 25, 2006
Categories: Network

Tomorrow evening (Tuesday, Jan 10th) we will be upgrading the OS on one of our border routers. To do this requires upgrading hardware so that there will be room for the new OS.

What this means is there will be 2 outages as a result of the OS upgrade. One for the hardware upgrade and one for the OS upgrade. Both outages should last no more than a couple minutes. This is the time that it takes for the primary router module to fail over to the secondary. During the hardware upgrade there will be a forced failover after the secondary is upgraded so that the primary can be upgraded. During the OS upgrade the there will be an automatic failover to the secondary when the primary is booted to the new OS (the secondary will have already booted to standby mode with the new OS).

This upgrade is one of the final steps in preparation for a fully redundant border mesh so that in the future we will be able to work on either border router without any significant outage.

The work will begin at 11:00 pm with the outages happening between 11:00pm and 1:00 am.

posted by Kyle at 02:54 PM on Monday, January 9, 2006
Categories: Network

During our scheduled maintenance window on Wednesday, Dec. 7th we will be migrating customers behind our shared firewall to a larger capacity switch. This move will also allow the addition of a redundant connection to our core switches.

The work will cause downtime of about 5 minutes and will occur between 11:00pm and midnight.

posted by Kyle at 05:42 PM on Tuesday, December 6, 2005
Categories: Network

Alder3 on alder.forest.net and Web Help Desk on helpdesk.forest.net are now back online.

Thank you for your continued patience,

digital.forest technical support

posted by digital.forest at 07:07 PM on Tuesday, November 15, 2005
Categories: Network, alder.forest.net

We experienced a failure of a supervisor card in one of our boundary routers at 16:52 PDT this evening. This caused a short (less than 1 second) outage on one of our upstream connections, and left us with some routing instability for the following several minutes as built-in redundancies took over and routes reconverged. The failed card was replaced and all systems were back to normal again by 17:15 PDT.

posted by Chuck G. at 05:40 PM on Tuesday, November 15, 2005
Categories: Network

Currently a key internal database that hosts both Alder & our Web Help Desk is experiencing major technical difficulties. We are working very hard to restore the connectivity and expect this to be fixed very soon.

digital.forest technical support

posted by digital.forest at 03:42 PM on Tuesday, November 15, 2005
Categories: Network, alder.forest.net

During our scheduled maintenance window tomorrow night (11-09-05) we will be moving Colo connections to a new switch. The move will only affect customers with half or full rack installations (in the Colo area) and will be about 30 seconds per connection.

This is another step towards increasing the redundancy of our network and will allow us to have dual redundant connections from the Colo area to the core.

The maintenance will take place between 11:00 pm 11-09-05 and 3:00 am 11-10-05.

posted by Kyle at 04:28 PM on Tuesday, November 8, 2005
Categories: Network

A network configuration change made during last night's scheduled maintenance has caused a minor issue with clients behind our shared firewall service. A reboot of the firewall will clear this issue and allow proper functioning again. We will reboot the firewall at noon PDT today. Downtime should only be a few seconds while the firewall restarts.

This should have no affect on the rest of our clients or network.

Update: 11:15 AM PDT The firewall reboot has been cancelled as it is no longer required. We have addressed the issue with other means.


For the terminally curious here are the details:

Last night just after midnight, as part of our plan for dramatically increasing our levels of network redundancy, we migrated one of our upstream fiber connections to our second boundary router. We also finished enabling Spanning Tree Protocol on all of our Ethernet switches to recognize redundant trunks we will be deploying in the coming weeks.

In this case it was the gigabit Ethernet connection from XO Communications (AS 2828), that we moved from our original Cisco 6509 router, over to our secondary Cisco 6509 router.

When we did this, all network operations appeared to acknowledge the change via iBGP and OSPF protocols as expected.

Unfortunately our managed firewall device did not. We began to get calls from some clients concerning reachability of certain servers around 6:30 this morning. By 7:30 we had isolated the problem to the change made last night, and actually shut down the gigabit connection to XO to guarantee connectivity to shared firewall clients while we worked out how to address this problem with minimal downtime for the affected clients.


We planned a config change and reboot of the firewall for noon today, but in the meantime we were able to forestall that action by redistributing static routes between the firewall and the two different routers via OSPF.

That action was completed at 11 AM PDT today, and should prevent any such future routing issues like we experienced last night.


Please be aware of the following:

This did not mean that servers were "down." The firewall remained up, and all servers behind it were reachable via normal network channels. The issue was that if OUTBOUND traffic from those servers was destined for the XO connection, then the firewall had the incorrect routing information and was unable to send it. XO carries approximately 20-30% of our outbound traffic.

posted by Chuck G. at 10:24 AM on Thursday, October 27, 2005
Categories: Emergency Maintenance, Managed Firewall Services, Network

During our scheduled maintenance window tonight we are performing some network configuration changes on various ethernet switches in the datacenter. There should be virtually no effect on network traffic beyond a few second pause as we update each switch.

Maintenance will take place between midnight and 1 AM PDT.

posted by Chuck G. at 09:27 PM on Wednesday, October 26, 2005
Categories: Network, Scheduled Maintenance

Tonight, during our scheduled maintenance window, we will be moving one of the Gigabit Ethernet connections from an upstream provider to our second boundary router. This should have minimal impact on operations as our remaining upstreams will remain active during the switch over. The move will happen just after midnight PDT and should take no longer than two minutes.

This is the first of many steps we are taking to complete a full-mesh network configuration upgrade. When complete our internal network will be as fully meshed as our external network connections have been since we implemented BGP routing in the mid-90s. This will provide yet another layer of redundancy to your connectivity here at digital.forest.

Update: We will also be replacing a failed fan in an Ethernet switch in row 4 of our datacenter. This may require a reboot of the switch, which would result in a few moments of lost connectivity for the servers in row 4.

posted by Chuck G. at 09:09 AM on Wednesday, October 12, 2005
Categories: Network

Starting this evening at 10:00 PM PDT and ending tomorrow morning at 6:00 AM PDT our upstream network provider that was having issues earlier in the week will be performing maintenance on their switches.

They will be rebooting 12 switches one at a time during the maintenance window. The effect of this will be minimal as our other providers will handle all traffic during the reboots.

posted by Kyle at 04:33 PM on Friday, September 30, 2005
Categories: Network

Between 12:00 Midnight Monday and 2:00 a.m. Tuesday we will be doing tests on the border routers to determine the cause of Sunday's network issues.

There is a small chance of impact to the network from this activity. Downtime if any will be limited to no more than a couple minutes.

posted by Kyle at 07:21 PM on Monday, September 26, 2005
Categories: Network

We experienced two unrelated network issues that caused some connectivity problems for some customers.

At 4:45 AM PDT one of our upstream networks went down for a scheduled maintenance. This is usually not an issue, but in this case the circuit was down for several hours, instead of the few minutes that was expected.

At 11:08 AM PDT this morning our core switch experienced some sort of network storm and shut down several ports/subnets due to excessive port errors. It took us about 30 minutes to diagnose this situation and another 30 minutes to manually bring up each port/subnet. Not all subnets were affected. It did affect one of our four mail servers (treehouse) which was the last port brought up around 12:20 PM PDT.

None of our other mail servers, and several colocation subnets were not affected and remained operational throughout.

Thanks for your patience, and as we learn more we will post details here.

Chuck Goolsbee,
V.P., Technical Operations,
digital.forest, Inc.

posted by Chuck G. at 12:37 PM on Sunday, September 25, 2005
Categories: Network

Tonight @ 10:15 we will be rebooting the shared firewall in order to complete the upgrade of the firwall O/S. The downtime should only be a minute or so.

UPDATE: Maintenance on shared firewall is complete.

posted by Kyle at 10:10 PM on Monday, September 5, 2005
Categories: Network

Tonight at 10:00 pm we will be making a small change to the BGP configuration on our border router. The upstream links will be modified one at a time with the other links handling all traffic during the change.

There should be little or no impact due to this change.

posted by Kyle at 07:21 PM on Wednesday, June 8, 2005
Categories: Network

We had a network event this evening between about 4 and 6 PM PDT. It appears a client's server here was the subject of, or the participant in, a denial of service attack. The result of this was the saturation of one of our upstream network connections. Latency remained low on that link, but packet loss varied between 5% and 20% while the event was in progress. Our other connections were unaffected.

The server has been removed from the network and all is back to a nice quiet normal Friday night.

posted by Chuck G. at 07:36 PM on Friday, May 20, 2005
Categories: Network

During our regularly scheduled maintenance window tonight we will be moving one of our connections from the current temporary fiber to a new permanent one. The work will occur between 12:00 midnight and 1:00 am.

Actual downtime will be about 5 to 10 minutes and during that time our other connections will carry all of our traffic so there will be little if any impact.

Update: 12:23 AM PDT The fiber move is complete.

posted by Kyle at 09:33 PM on Tuesday, April 19, 2005
Categories: Network

We must replace a cable that runs from one of the colocation areas of our datacenter to the network core. The cut-over will be very brief, likely less than 15 seconds, and should happen in the next 10 minutes. This is being done to correct some errors that we are seeing under high load, as a client here is seeing a particularly high-traffic day. We Are replacing a copper cable with a fiber-optic one.

Update: 10:11 AM PDT The work is complete.

posted by Chuck G. at 10:03 AM on Wednesday, April 13, 2005
Categories: Network

During our regularly scheduled maintenance window tonight we will be resetting the BGP sessions with our network peers on our Seattle router. This will take about 5 seconds for each peer and will happen at some point between 12:01 AM and 12:15 AM PST Friday morning. There should be minimal impact on network traffic. As each peer is reset our other connections will carry the traffic while this work is performed.

This is being done to complete the balancing to our outbound traffic.

posted by Kyle at 04:46 PM on Thursday, April 7, 2005
Categories: Network

During our regularly scheduled maintenance window tonight we will be resetting the BGP session with one of our network peers on our Seattle router. This will take about 5 seconds and will happen at some point between 12:01 AM and 12:15 AM PST Tuesday morning. There should be minimal impact on network traffic, as our other connections will carry the traffic while this work is performed.

This is being done to make a configuration change and bring better balance to our outbound traffic.

posted by Chuck G. at 06:03 PM on Monday, March 28, 2005
Categories: Network

We experienced a BGP-related routing issue tonight that reduced our visibility to the outside world for about 5-15 minutes, depending upon where on the Internet you accessed us. It started about 8:15 PM PST, and was corrected just before 8:30 PM PST.

We are investigating and will post more details when they become available.

Update: The issue was the result of a small configuration error in our Seattle router. The slow DNS-like nature of BGP propagation is what caused it to not have immediate impact. Our FastEthernet circuit with NTT/Verio caused us some problems while we worked to migrate it to Seattle overnight, and our Network Management staff spent most of the day today dealing with wrapping up that process. This was completed at exactly 6:30 PM PST this evening. However, at some point during that process an error was introduced into our BGP configuration that caused this issue to arise later in the evening. Thankfully all of our senior network management were still here on site and could address the issue swiftly.

The good news is, that after almost three months of constant configuration changes to all of our routers, we have arrived at a point in our move where no significant network changes will be made for the foreseeable future. All servers, subnets, and connections to the Internet are now fully migrated to the Seattle facility and are here to stay. As I said in yesterday's move update, plenty of work remains to complete the new datacenter, and decommission the Bothell facility, but the server migrations are complete.

posted by Chuck G. at 08:36 PM on Thursday, March 10, 2005
Categories: Intergate.West Move, Network

We have received a number of calls this morning from customers unable to reach their sites at digital.forest, our site, or their mail servers. We have now received confirmation from Qwest that they are experiencing routing issues, which they hope to have resolved "in a couple of hours". If you are a Qwest customer, it may be helpful for you to provide them with any troubleshooting data you have collected (traceroutes, for example); if not, you may still be affected if your traffic is routed through their network at some point. Either way, there is little to do but wait until they resolve the issue.

posted by Bill D. at 09:43 AM on Tuesday, March 8, 2005
Categories: Network

One of our clients behind our shared firewall was running an open mail relay. This was discovered by a spammer and has been exploited. Not only did they relay off that host, they are now attempting to relay through the entire firewalled subnet. We have had to block port 25 to that subnet in order to allow any "normal" traffic at all to and from the servers behind the firewall.

This block was in place most of the night.

We have lifted it (for all except the open relay server of course) as of 5:30 AM, though the SMTP traffic remains unusually high. We will monitor the situation and respond as required.

Please remember that a firewall is not a magical protection device. If you have vulnerable software on an open port, you can be compromised.

UPDATE 6:05 AM: We have been able to isolate the network (in Russia) performing the brute-force SMTP relay attack, and block it at our network boundary.

posted by Chuck G. at 05:47 AM on Wednesday, March 2, 2005
Categories: Colocated & Dedicated Servers, Mail, Network

The servers we moved tonight are those behind our shared firewall. We had some delays leaving Bothell due to some struggles with rackmount Dell "Versa Rails"... stripped screws specifically. We also are having to rebuild our Firewall config from backup, as the device did not seem to like its new location. Please be patient while we sort these issues out.

Thanks.

posted by Chuck G. at 03:12 AM on Monday, February 21, 2005
Categories: Colocated & Dedicated Servers, Intergate.West Move, Network
We took delivery on a very important piece of new equipment this week. A new Cisco Router/Switch. It is a 6509 series router.

Above: It arrived in a huge box, shrink wrapped to a wooden pallet. It took three of us to wrestle it out onto a cart so we could wheel it into the datacenter. It weighs close to 300lbs (136 kgs). Our new router.

Since our move to Bothell from Seattle in 1998, we have employed Cisco 7206 routers for our boundary. We have upgraded them along the way to 7206VXR's. Our core routing was originally done with a Cisco 3640 router, which was eventually replaced with an Extreme Summit 48 (and later 48i) switches. We are taking to opportunity afforded by this move to upgrade again, this time consolidating the boundary routing and core switching functions into the well-proven Cisco Catalyst 6500-series hardware. The router pictured is the first of two we have acquired, which will be deployed as a redundant pair.

The move to a high-end carrier-class facility really requires similar carrier-class equipment. A large portion of the Internet's traffic is carried through Cisco 6500-series hardware. Very soon digital.forest's traffic will too. These will also allow us to expand our internal network core as well. Take one last look before it becomes obscured by a lot of wire.

Above: Racked and ready to go.

This week (1/10-1/15) will likely be a slow week on the move blog here while we deal with the "people" side of the office move. If you have feedback or questions about the move, or anything covered in the blog to date, feel free to drop me an email. Thanks.

posted by Chuck G. at 01:11 PM on Saturday, January 8, 2005
Categories: Intergate.West Move, Network

At 5:01 PM PDT today, our Gigabit Ethernet provider, OnFiber experienced a card failure in their switch that connects us to the Seattle gigapop at the Westin Building. We failed over to a secondary card, but since that time we have seen significant packet loss on that connection. OnFiber is currently working on it, and we are in constant communications with them, and the IP carrier on that circuit, NTT/Verio.

We may shut off that router port if it takes too long for the issue to be resolved. Traffic would just follow different routes. Details will be posted as information changes.

UPDATE: 7:30 PM The problem has been resolved.

posted by Chuck G. at 07:13 PM on Saturday, August 14, 2004
Categories: Network

Please note that Apple has released some security-related updates in the past weeks, including a new OS version (10.3.4) which includes these updates. Experience has shown that staying up-to-date with security-related patches can limit exposure if an exploit is released "into the wild." While we have yet to see a MacOS X worm or similar malware, we still prefer to update when the vendor releases a patch. OS updates are handled a little differently, as they usually involve changes outside the realm of security, and require some testing prior to deployment.

NOTE: If we have administrative access to a client-owned server we almost always install security patches when they are released. The nature of our network, directly connected to very high bandwidth "backbone" connections, means that we have a much greater risk and exposure to newly released malware, especially of the "worm" type, as they spread automatically.

These recent patches from Apple cover a set of vulnerabilities which are classified as "Trojan Horses" which means they require user intervention to activate. However we felt it necessary to apply the patches, if only to set a precedence for our MacOS X using clients with regards to how we handle security-related patches.

If you have a server colocated here running MacOS X, or MacOS X Server that you manage yourself, we strongly suggest that you run Software Update on a regular schedule. Install any security-related patches as soon as you are comfortable. Based on our experience with other platforms, it is better to be patched prior to the release of an actual exploit.

Our methodology for handling unpatched machines if there is a known exploit "in the wild" is to remove them from our network until patched. This is how we were able to survive high-profile issues such as CodeRed, SQLSlammer, etc with minimal downtime and very low infection rates. Our Windows and UNIX using clients already know this, but given the recent widespread publicity about these MacOS X issues, we thought we should make the rest of our clients aware of this policy.

Thank you for your attention to this matter.

posted by Chuck G. at 12:04 PM on Thursday, May 27, 2004
Categories: Colocated & Dedicated Servers, Hosting Servers, Network, Worms and Virii

At approximately 2:20PM we experienced a routing issue that interfered with some of the traffic to our network. The issue was quickly resolved.

posted by digital.forest at 03:29 PM on Tuesday, May 18, 2004
Categories: Network

As of 02:45 PST we are once again seeing full connectivity to and from the Verio/NTT network. We will keep a close eye on things through the night and into the morning, but at the moment we believe this issue is completely resolved.

Thanks for your patience.

posted by Chuck G. at 03:00 AM on Friday, March 26, 2004
Categories: Network

The Good News: Further testing has pinpointed the exact issue. The problem does indeed lie with Verio (always the first trouble when dealing with Internet Service Providers: proving where the problem really is).

The Bad News: We have to wait for Verio to fix it.

The Current Status: The Verio connection will remain shut down until we hear from them that it is fixed. Traffic continues to flow via other circuits, so this will remain invisible to our clients.


The Technical Details: Yesterday, some automatic route-filtering scripts on Verio's network decided to change our network size from roughly 8000 IP addresses down to around 4000. From our initial investigation this is based on some very old (circa 2001) Routing Database info. After we get this current issue fixed, we will investigate what caused that to happen. The result of this action is that the "upper" half of our network, where the majority of our colocation subnets lie, was route-filtered by Verio. This lead to intermittent connectivity to those subnets if your traffic passed through Verio's network. Thankfully the time frame for the connectivity issues was in the middle of the night, when traffic is usually pretty light. Our connection with Verio is a Fast Ethernet one, via our OnFiber Gigabit link. We have enough capacity on our other circuits to handle 100% of our traffic so we can safely leave the Verio link shutdown, however we prefer to spread our traffic out as it is more efficient. As soon as Verio confirms that they are advertising our full network in their routing tables we will bring up our link with them again. This could happen at any time today, with the latest possible time being 7pm CST when their automatic route filtering scripts start their nightly run. Of course we are urging them to not wait for that event. We will update this page with news as it happens.

Thanks for your patience.

posted by Chuck G. at 08:34 AM on Thursday, March 25, 2004
Categories: Colocated & Dedicated Servers, Network

At 05:40 PST our routing issue with Verio reappeared. We are working with them to resolve. Traffic is currently routed to other circuits.

posted by Chuck G. at 06:44 AM on Thursday, March 25, 2004
Categories: Network

Our routing issue with Verio has been resolved. We are receiving and sending traffic via their connection once again, as of 00:42 PST. There will likely be some intermittent accessibility issues on incoming traffic as the route is likely dampened due to our /up/down state as we addressed this issue. It should take no more than an hour for those to clear up.

Thanks for your patience, while we worked this issue out.


Please Note: We were never "down" at any time tonight, as our other upstreams took over the load, but some clients likely saw connectivity issues here and there between ~9pm and ~midnight.

posted by Chuck G. at 01:15 AM on Thursday, March 25, 2004
Categories: Network

To prevent any further packet loss or routing issues, we have shut down the connection with Verio as of 22:30 PST until they have the situation resolved. We will update this site as soon as we have the problem solved.

posted by Chuck G. at 10:47 PM on Wednesday, March 24, 2004
Categories: Network

One of our upstream network providers is having some sort of issue. At exactly 20:24 PST we lost 100% of our *incoming* and 80% of our *outgoing* traffic to Verio. We are in constant contact with their Network Operations Center with an open ticket, and will work to get it resolved ASAP.

Our other upstream connections are unaffected.

posted by Chuck G. at 09:11 PM on Wednesday, March 24, 2004
Categories: Network

Due to an expected large scale bandwidth event expected tonight we are making a minor change to our BGP configuration of our boundary router. The change is being made to tune outbound traffic to slightly favor a particular link.

There should be no downtime associated with this change.

The expected maintenance window is 17:00-17:15 PST.

posted by Chuck G. at 01:28 PM on Wednesday, January 28, 2004
Categories: Network

We are taking advantage of the quiet time over the holiday to perform some network maintenance. Mostly this is in the form of cable cleanup and tie-down. It will have no affect on the network performance beyond an occasional reboot of a switch, which should take no longer than 5 seconds.

Our offices and technical support are closed for the day - as always, emergency support is available via normal channels.

Happy New Year.

posted by digital.forest at 10:12 AM on Thursday, January 1, 2004
Categories: Network, Scheduled Maintenance

digital.forest experienced a brief network outage today at about 11:15AM. The cause was a denial-of-service attack against one of our colocation customers. We identified the servers targeted and can, and will, remove them from the network immediately should the attack resume.

posted by Bill D. at 11:40 AM on Tuesday, December 30, 2003
Categories: Network

Our AT&T DS3 circuit is down as of 11:33 AM. A trouble ticket is open with AT&T and Verizon (the provider of the SONET ring) to address the issue. Currently traffic is flowing fine on our other upstream connections.

Update: 12:25 AM: Verizon tech is has been dispatched to Everett Central Office. It appears they inadvertently put the circuit in loopback state. ETA is 30 minutes to test/fix.

Final Update As of 3:03 PM the AT&T circuit was back online.

posted by Chuck G. at 11:53 AM on Tuesday, November 4, 2003
Categories: Network

We have experienced a significant increase in ICMP traffic, from all corners of the 'Net recently. Today it caused a slowdown and some packet loss on our core switch due to the rapid growth of the forwarding table. We attempted to mitigate this with different switch configurations, but were unable to achieve a proper balance. At 2:30 PM we elected to just block ICMP echo traffic at our boundary router. We realize that this may "break" some monitoring systems, and make some troubleshooting more difficult - but the overall performance of our network is far more critical. We apologize for any inconvenience you may experience.

Should the "worms du jour" that generate this kind of traffic subside, we will enable ICMP echo packets again.