[Info-vax] File Systems

Thu Mar 5 17:35:35 EST 2015

On 2015-03-05 21:54:32 +0000, mcleanjoh at gmail.com said:

> ZFS seems to have a lot of overheads.
> 
> If I read the information correctly, change one file and a new checksum 
> for that block (or those blocks plural) has to be calculated and 
> written, and a new checksum of the block that contains checksums ... 
> all the way up to the single checksum at the top of the tree of 
> checksums.

The CPU can often perform other tasks while it is waiting 
<http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html> 
for the slow mechanical disk and its platters to crawl slowly around, 
slowly passing the slowly-seeking disk heads slowly moving to slowly 
read the contents of the sectors slowly rotating past underneath the 
slowly-moving mechanical disk heads.

>From Jeff Dean at Google from a few years ago, the following slightly 
edited list of order-of-magnitude system performance numbers that might 
interest folks making these trade-offs:

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

Disks are... slow.   For various applications, it's actually faster to 
shovel the bits to another system than to write the same data out to 
local disks.  This using redundant servers, rather than RAID storage, 
and this further ignoring that RAID storage writes are inherently 
slower than those to a single disk.

Then there are the other weird behaviors of disks: 
<http://highscalability.com/blog/2013/6/13/busting-4-modern-hardware-myths-are-memory-hdds-and-ssds-rea.html> 

Then you discover that Itanium alignment faults are often really, 
really slow: <http://labs.hoffmanlabs.com/node/160>

Then you discover that the displays are slow, too: 
<http://superuser.com/questions/419070/transatlantic-ping-faster-than-sending-a-pixel-to-the-screen> 

Put another way, a performance-optimized checksum calculation are 
pretty speedy, too.   (Remember my kvetching over in the password-hash 
thread?  Same general issue.  But fast hash calculations are 
emphatically not good news for password hashes.)

I won't bore you with how bad some optical devices and optical media 
can be, either — between horrid firmware bugs I've encountered and read 
errors secondary to crappy media, I'm surprised optical works as well 
as it usually did, err, does.  Related: 
<http://www.rdrop.com/~half/General/CDRot/CDRot.html> 
<http://www.nbcnews.com/id/4908081/ns/technology_and_science-games/t/when-optical-discs-go-bad/#.VPjWvCn0b8s> 
 (R.I.P., Snark.)

> Maybe the calculation of each checksum is that messy (done in 
> hardware?) but there's all the disk I/O's that need to be done.  The 
> alternative, of keeping some checksums in memory and only occasionally 
> writing to disk doesn't seem that smart.

But why might ZFS checksums be nice?    Because there are empirical 
studies showing an average of between three and six uncorrectable 
errors per terabyte, on a number of disks.

-- 
Pure Personal Opinion | HoffmanLabs LLC