System Downtime.

mclemore

Administrator
Staff member

Donor 15 years: 2010-2024
Joined
Apr 3, 2001
Messages
5,360
Reaction score
1,940
Location
Pasadena, California
As many of you saw (via the announcement here or by browsing), we had some down time on the main site on Wedneday. In short, our data center took some shortcuts. As a result, a lone single redundant raid drive beginning to fail took everything down. Essentially write transactions would back up and bring everything down every 2-10 minutes.

There were a multiple of issues, but one was that they were using some servers that had bios controlled raid (aka 'fake raid) on them, and nothing good came of that. We eventually jointly agreed that we couldn't even remove the bad drive without a good chance of nuking everything. I'm so used to hot swap pull and go.

We essentially made the site private and 'read-only', and brought up a new version of the site on new hardware at the same time. There was a tremendous amount of data to copy over, and a number of unexpected issues as we upgraded from CentOS 5.x to 6.x.

I personally put in a 21 hour no-break work day yesterday, and after 3 hours of sleep, am putting in another 16 hour day today. No data has been lost, though there is still a little work to do on the system.

I'm glad it's done and I'm already working on a plan to prevent this particular problem from ever happening again.
 
Back
Top Bottom