UPDATE 02/01/08: As of 11:07 PM PST we are fully up and running on our newest BGP Peer.
Last week I announced that we were adding a new BGP peer. It was originally scheduled for last weekend, but ended up not happening on schedule.
After clearing a few technical and procedural hurdles this week, we're finally ready for this to actually happen, and it is now scheduled for Friday night. If you are extremely curious as to the nature of the delay, feel free to read on.
---
Please accept our apologies for the delay which unfortunately was completely beyond our control. To explain what happened I need to provide a bit of background on how the Internet works. Please remember that I am vastly simplifying a very complex system in order to condense this into a small blog post... books as heavy as boat anchors have been written on this subject but I really can't go into the minutiae here without everyone's eyes glazing over... so here is the Cliff Notes version:
* The Internet is a collection of autonomous networks, all interconnected.
* Networks are collections of hosts each given a unique address.
* The glue that holds the networks together is called BGP.
* BGP sees networks as aggregate collections of addresses called "prefixes"
So when we connect to another network, we announce our prefixes to them and they announce theirs to us. At either end of the connection are network devices called routers and they do filtering and weighting to decide what routes work best for your traffic. Filtering is important because it allows networks to send & receive the proper traffic and ignore improper traffic. For example if digital.forest has a connection with both "Network A" and "Network B". However we do not want to be a transit point BETWEEN "Network A" and "Network B" so we filter appropriately. We only want "our" traffic to go over these routes, not the whole world's traffic. Every network does this to a certain extent if they are connecting to multiple other autonomous networks.
Most large transit networks use routing databases to associate autonomous networks with their announced prefixes. This acts as a security & authentication layer, as well as a basis for filtering policies as the networks that query the databases. The databases are maintained (usually) by the entities that allocate the addresses, so they are a trusted source. The databases are then replicated and shared among the network operators. There are also "route servers" and "looking glasses" at various locations around the Internet for network operators to check to see how they fit into this big meshed network and verify that what they want to happen, is in fact happening.
Mind you all of the above is a vast simplification, so if you knew nothing about this until now, it is hopefully understandable. If you already know how all this works you know I left plenty of detail out, but you should hopefully recognize that it is all basically correct. Now on to what happened over the past week...
Here at digital.forest we announce several prefixes. A few of our own, and several on behalf of our customers who have been allocated specific IP address ranges different than ours. Last weekend we turned on our new circuit in the wee hours one night and from here it looked great - traffic flowed at a rate we expected it to. But before we went too far along in time we consulted the various route servers out there to see what the Internet saw: How did this new connection look from the outside looking in? What we saw was just one of our prefixes being carried by this new connection. Not wanting to risk weird routing issues we shut the new circuit down and got in contact with the provider's NOC to see why the all the prefixes we announced were not picked up by their network. This prompted a round of paperwork and approvals on their end, as we discovered that the do not rely on the routing databases to determine their route filtering policies. Instead they do it manually. I will not make any judgement calls as to that policy of theirs... I understand why some entities choose manual methods over automatic ones, after all I shift my own gears when I drive... sometimes manual systems are a better choice. In this case though it certainly slowed down the process. We submitted our full prefix list to them early in the week. It took them until yesterday to enter them in their systems. We are waiting a full 48 hours for the projected propagation time so that their entire network, and their BGP peers pick up the changes, then we will re enable the circuit. Kyle Murray, our Network Manager has been the man on point throughout this process and has done an excellent job making sure it all goes well.
Several of our clients are looking hopefully at this new circuit with some expected performance increases as it is a recognized "better" network than the circuit we are replacing. These clients are also some of the specific secondary prefixes that we announce. We wanted to make sure that this circuit turn up goes very well with no possibility for unusual behavior of our clients' traffic. Hence the delays to make sure everything was exactly as it should be. We are now very confident, but will go through the same process as last time: turn up, then check and see how it looks both from within and without. Trust, but verify.
My goal in these posts is to provide you with clarity as to what happens here at an operational level at digital.forest. We are blessed with excellent staff, and truly the best clients a company could hope for. I enjoy sharing this information and I hope it serves to boost your confidence in us as we care for your vital systems in our facility and on our network. I know that you look to us to "just make it work" but it can only help for us to communicate on an ongoing basis what is involved behind the scenes to accomplish that task.
Regards,
Chuck Goolsbee
VP Technical Operations
digital.forest, Inc
posted by Chuck G. at 03:04 PM on Thursday, January 31, 2008
Categories: Network,
Scheduled Maintenance