The mail server "treehouse" will be undergoing emergency maintenance today.
We will perform some quick maintenance to the existing server which is happening now, requiring a shutdown of mail service. The server should be back up and online around 8:45 AM PDT.
We will post details about what is going on as soon as possible.
Update: noon PDT
We are once again experiencing the mysterious slowdown bug. The server slows to a crawl with the primary symptom of slow disk writes to local mailboxes. This acts as a sea-anchor of sorts and slows the whole server down, which leads to the intermittent connectivity and glacial response that users experience.
Some background on the issue:
Communigate Systems, (formerly Stalker Software) the makers of the mail server software have never been able to determine the cause of this. Their suggestion each time has been to put the mail data on a faster disk subsystem. We kept upgrading every year and eventually migrated the data to a very high-end, high-performance FibreChannel disk array. yet, the problem returned. In the interest of troubleshooting, we moved the mail data back to the internal ATA disk of the server one time and sure enough, the problem went away. So this meant that it that it was NOT a hardware/performance issue of the underlying machine, but a true software problem within Communigate.
So our "cure" of late has been to switch file systems as soon as the early symptoms appear. We have successfully done that twice over the past year.
This time however, that did not fix the issue. In fact it got worse.
Short-term attack plan:
This morning we shut down treehouse for about 30 minutes and moved the mail data back from the array to the local filesystem and it seems to be staying even instead of falling behind.
Now that the patient is stable, we are preparing the transplant. We are building a new server, and plan on migrating treehouse over to it later tonight. It will be a completely different CPU and OS platform. It will have a different, but still very high-performance disk subsystem. We will post here when it goes online. Our hope here is that the software bug in CGP is specific to its particular interaction with the underlying architecture.
The search for a replacement:
In November of 2005 another software bug shut down treehouse, and our frustration level with Communigate went through the roof. We committed at that time to replace it as soon as possible. We have spent the past five months testing various replacement alternatives. Our ideal replacement system would match the current one for functionality and add a layer of redundancy that we really want and need for our mail system. This means a mail cluster. Unfortunately, we have yet to find something that meets those requirements. We have come close, and we were within weeks of deploying a system when a flaw in IMAP handling revealed itself. The project is "on hold" at the moment but given the events of this past week, we will have start work on it again. The developer of the cluster solution has released a new version with some fixes, and it is time to give it another thorough test.
We appreciate your patience while we sort though this. Stay tuned for updates.
posted by Chuck G. at 08:03 AM on Monday, May 15, 2006
Categories: Mail