digital.forest Technical Support
Apology and explanation regarding mail server issues.

First let me say that we apologize deeply for the issues surrounding our mail servers over the past 24 hours. We TRULY do feel your pain since we use the very same servers for our own email. Secondly I want to tell you in detail what has transpired over the last 24 hours, and what steps we have taken to address the performance and reliability of the current mail systems. Finally, I'd like to take a moment to tell you what we are planning in order to prevent this or similar issues from happening again.

In short, a bug in the mail server software we use, Communigate Pro (CGP) caused two of our three CGP servers (and actually a few of the colocated CGP servers we manage for clients here) to crash last night at the rolling of the clock to November 1st (Midnight UTC). The software vendor suggested that the fix for this would be a downgrade to a previous version. We executed that change overnight for most of the servers with few issues, and were able to keep the other servers stable overnight by rolling their clocks back a day.

Treehouse is a large server, with a significant number of users on it, including ourselves. We hesitate to make changes to its configuration since we have had performance issues with it in the past and have had to deal with the vendor a lot to get it working properly when upgrades are made. But stuck between it crashing with certainty, or risking the downgrade, we did not have much choice. The downgrade unfortunately brought back some long-standing file-system related bugs that cause serious performance issues under heavy load. We have wrestled with this particular file-system bug several times over the past four years and finally thought we had it fixed. To say this is frustrating for us is a huge understatement.

We have built a new server, equal or greater in specification to the existing treehouse server, but running on a different platform (FreeBSD). We have performed a clean install of (the non-crashing bug version of) CGP on it, and will migrate our user and mail data to it tonight. I can not promise, or guarantee that this will fix everything, or not introduce some other software bug. Those are issues beyond our control. We do feel that it should stabilize the server enough for the next step.

What I CAN promise and guarantee is that we at digital.forest will dedicate all of our resources to replace the current mail system with something far more robust and scalable. By the end of this year, we will have tested, and begun to deploy on a wide scale a system to completely replace the one in use now. It must allow us to operate our servers in a cluster, so we have failover and load-balancing. It will support all the features you have come to expect: POP, IMAP and Webmail, as well as delegated administration of domains.

Again I apologize for this situation, and thank you for your patience today. As always we will post progress and updates here as we proceed.

Chuck Goolsbee
V.P. Technical Operations
digital.forest, Inc.

posted by Chuck G. at 09:46 PM on Tuesday, November 1, 2005
Categories: Mail