[ZODB-Dev] Use of fsync in FileStorage

Thu Jul 29 02:00:54 EDT 2004

[Jeremy Hylton]
> I had a rationalization for the current implementation.  I assumed the
> fsync() would not fail or that if it did it was so catastrophic that it
> was okay for tpc_finish() to fail, too.  At that time, a failure in
> tpc_finish() would cause the whole ZODB to stop accepting transactions
> (hosed).  I assumed that fsync() failures probably meant serious disk
> failures.

Sure, but for the sake of other storages participating in the transaction, a
serious disk failure on one machine is something to take extra care against,
not less.  I'll grant that tpc_finish() has to do some I/O regardless, even
if it's only to overwrite one byte, and a disk can fail at any time.  But a
transaction can be very large, and just flushing stdio buffers doesn't
necessarily touch the disk at all.  If the disk is going to fail, it's
likely the fat transaction data that will provoke it.

> Given that fsync() failures are very rare and fsync() is expensive,

Given the timings that have trickled in so far, it doesn't appear that the
expense of an fsync() will even be in the same ballpark across any two
specific boxes.  Seems to range from unthinkably expensive, through horridly
expensive, to a minor nuisance.

> I wanted to avoid an fsync() call in tpc_vote() in a ZEO server.

Obviously <wink>.

> In that case, the server calls flush, gets the data out of application
> buffers, and sends its response to the ZEO server.

Client was intended, right?

> The hope was that much of the data would already be written to disk
> by the time the client returned with a tpc_finish() call so that fsync()
> would go more quickly.  I never measured any of this so I don't know how
> naive it was.  It still seems that calling fsync() in the middle of the
> ZEO transaction is unfortunately slow.

That's why I'm asking people to run timings on many boxes.  I didn't realize
what a timing disaster *any* os.fsync() is on Windows -- but maybe that has
mostly to due with my server-class CPU but consumer-laptop-class IDE drive.

If we throw the laptops out of the results, it doesn't look so bad (but
still varies widely) across Linuxish boxes.  Christian has given the only
arguably "server-class machine" timing results so far here, and he reported
hundreds (> 400) of transactions (~2KB each) per second even with two fsyncs
and across an NFS-mounted partition.  God only knows what fsync() might mean
in that case, but those were the worst results he reported.

Jim suggested that if vote times in ZEO "are a problem", then that would be
better addressed by increasing concurrency on ZEO servers (easier said than
done, eh?).  Wouldn't help on my laptop, since os.fsync() made the test
purely I/O-bound.  I'm sure curious to see results on a wider range of
setups!

Alas, this appears to be moving in the direction of adding an
incomprehensible (to most people) config option or two, with a "can't win"
default.