Without the underlying data, nothing else matters. Backing up your data is critical but what about silent data corruption?
The biggest problem with silent data corruption is that it is completely silent. You receive no error notifications or alerts and your systems continue on their merry way having written an erroneous bit of data to your drive. And what’s the downside? Maybe just a lost file or maybe a corrupted operating system. Who knows?
CERN’s Large Hadron Collider creates around 15 million gigabytes (15 petabytes) of data every year. That data is used to make conclusions about how the universe was created. So erroneous data would have a considerable impact on those conclusions. That’s why they ran some considerable tests to check the error rate in their data. When they checked 8.7 TB of user data for corruption – 33,700 files – they found 22 corrupted files, or 1 in every 1500 files.
That’s actually quite a considerable number when you think about it. That means that if you have a 1 TB disk in your home machine that, full with 4MB MP3s (legally purchased of course!), that you could expect 170 of them to be corrupted, unusable, lost forever.
“But surely this isn’t going to be a problem with Raid?” I hear you say? Actually one of the biggest culprits is a very common Raid configuration that we all know and love:
Raid 5
Raid 5 is very good. It creates a parity bit for the data that it writes to disk. It stripes that parity bit across the array. The parity bit value must leave the XOR of all the disks to be zero. That way if a disk is lost, you can recalculate what was on the disk by reversing the calculation.
However, every time you update the data on the disk, you must also update the parity bit. One way that errors occur, is when the data is updated, and power is lost before the parity bit is written. This leaves you with a mismatch between your check data and the real, underlying data. The only way that is going to ever be fixed is if the data is completely written over and the parity bit updated correctly. Otherwise, if you were to recreate the data from the parity check, you’d recreate data that was incorrect and you’d never know. This is called the Raid 5 Write Hole.
Because of the way this all ties together, there is also a significant performance impact to this way of working. If an update is made to the disk that is smaller than the size of a single Raid stripe, the whole stripe must be re-read in order to recalculate the parity bit for that stripe. You were making a very small write and that has incurred a much larger “read” and another, much larger “write”.
There are solutions to these problems but they are all expensive and definitely don’t meet the description of Redundant Array of Inexpensive Disks!
ZFS
ZFS is a transactional filesystem in exactly the same way as most database systems are transactional. All data is committed at once to the disk to prevent “writes” being partially completed. Each block of data has its own checksum which is saved at its pointer. This means that whenever a block of data is accessed it is compared with its checksum. If this is found to differ then the filesystem can heal the block of data so that it is correct before being used.
If ZFS is used with Hardware RAID then it can’t guarantee that at a hardware level you won’t end up with the same “hole” as I outlined for Raid 5. So it needs to be in complete control of underlying disks and the system should only use HBA (Host Bus Adapters) to access the drives.
ZFS isn’t a new technology by any means. We’ve been utilising it for many years. However, it is still relatively unknown and underused for what is effectively a Redundant Array of Inexpensive Disks!