One of the distribution switches in our facility, specifically in the rack-colocation area in rows 11 & 12 in DC1, has been showing errors and causing some network issues for servers in that area today. Our switch vendor believes that this is being caused by a bad gigabit port which uplinks that switch to our network core. The other possibility is a bad switch engine. Thankfully the former of these has some built-in redundancy. The latter is an easy card swap, and we have spares. So in a few moments we will be manually failing the gigabit uplink to the redundant port. It is unlikely that this will be any more noticeable than the intermittent issues that servers have seen on this network segment, namely some dropped packets and retransmissions. If that does not improve things, we'll replace the switch engine blade later tonight during a maintenance window.
Update 5:00 PM: The card reset seems to have solved the issue. We'll keep an eye on things over the next few days to be sure. We'll also order a replacement card for the switch. Thanks for your patience.

Above: Network Manager Kyle Murray pulls the problem gigabit card from the switch.
posted by Chuck G. at 08:28 AM on Monday, June 11, 2007
Categories: Emergency Maintenance,
Network