Posted on 2005-09-09 11:06:32, modified on 2006-01-09 16:29:23
Tags: Computers, RAID
Friday morning 07:05, I get a phone call. Problems with the database server. It gets lots of timeout on disks, but it doesn't show up on the controller as bad. Annoying, but nothing we can do about it. The problem is however that timeouts on a database server aren't making the users happy, so we figure out a way to find out what how to determine which disk is broken: Take out one disk of the RAID5 array, and see if it comes back. In theory a good approach. Now reality: Take out first disk, acknowledge this change in the RAID5 controller. Machine is coming back up, fsck fails again with lots of timeouts. Put disk back, take out second disk, acknowledge this in the... in the... oh shit.
By putting back the first disk and taking out the second disk at the same time, we broke the array. We should have put the first disk back, acknowledge it on the RAID5 controller and then take out the second disk. Oh oh oh oh oh...
Let's restore from backups...