The Munich CCC fileserver uses (as many other servers) software RAID 5 amongst its disks. We all (should) know that RAIDs are no substitute for backups, which was reinforced by a recent problem we had. While RAID level 5 can recover gracefully from a single failed disk, it generally can't cope with multiple failed disks at the same time.
One of the problems with large harddisks is, that there may be yet undetected errors on it, just because you haven't attempted to read that part for quite some time. Now when you start a rebuild of a RAID5, these errors quickly pop due to the rebuild process needing to read all the data. This is the main reason why you should regularly run complete surface scans on your RAID arrays.
Almost all RAID implementations tend to mark a whole disk as failed as soon as it contains a single error. This becomes a problem as soon as you detect a second error on your currently degraded RAID you are just attempting to rebuild.
Fortunately there is still hope. If the errors on your failing disks occur on non-overlapping points of the array, you can recover a complete copy of your data by assembling just the right pieces. But unfortunately there appears to be no hardware or software RAID solution able to do that out of the box. So we're left to try this manually.
more on this saga in part II, coming soon…