[ZODB-Dev] Use of fsync in FileStorage

Mon Jul 26 18:09:39 EDT 2004

(resend)

On Monday 26 July 2004 18:13, Tim Peters wrote:
> [Toby Dickenson]
> ...
> > 1. Calling fsync exactly twice is sufficent to preserve data integrity
> > against "pull the power cord at an arbitrary time" problems.
> >
> > 2. Calling it only once is a recipe for data corruption.
> 
> Your faith in fsync() is ... interesting.  You snipped the references from
> the original msg to (a) the POSIX spec, which doesn't promise anything of
> the sort from fsync();

The linux behaviour of various filesystem has been well debated on the linux 
kernel list. The BSDs have a tranditional behaviour which is subtly 
different. Both aim to offer much stronger guarantees than POSIX promises.

> and, (b) the long python-dev discussion which 
> specifically questioned fsync()'s "reliability" on Linux.

I dont track python-dev.... I will read this and get back to you.

> > The backup tools recommended in that document also suffer from fsync
> > naivety. It is possible for a corrupt backup to be created if power is
> > lost soon after the backup completes.
> 
> The only backup tool recommended there is repozo.py.  It does its writes,
> explicitly closes the output file, then exits.  It does not do an fsync().
> I don't know why you believe adding an fsync() would make that bulletproof,
> if you do believe that. 

fsync alone is not sufficient. The easiest approach that is sufficent is to:
1. write the backup file content to a temporary file.
2. fflush, fsync
3. rename the temporary file to its proper filename.
4. for even nicer semantics, fsync the directory.

If the power is lost at any point, on recovery you will have either:
a. A file with the right contents at the right directory location, or
b. a temporary file that needs deleting.

The optional step 4 guarantees that you always have good file if the power 
loss occurs after the backup process terminated.

> > We neither expect nor need "all or nothing" behaviour from fsync over the
> > whole ZODB transaction....  In the design sketched out by Marius above
> > the second fsync is covering a change to only a single byte, and all
> > modern hardware can do that atomically.
> 
> Writing a byte atomically to a HW disk buffer isn't the same as writing a
> byte atomically to disk, and HW has its own ideas about when "a write" has
> finished.  The HW buffers may not even get written to disk in the order they
> were written (smart controllers dynamically reorder writes to minimize head
> movement), so it's still possible to get the "transaction succeeded" byte
> written to disk before all the data in the transaction appears on disk.

The design sketched by Marius has an fsync between writing the data, and 
writing that last byte. This acts as a barrier - all completed writes are 
guaranteed to be on disk before the fsync returns. (The BSD and linux 
semantics have subtle differences, but I believe this is one area where they 
agree).

Thats the idea anyway.... Ill check that python-dev thread to see if I have 
been misled.

<snip stuff I broadly agree on>

> > This storage has survived two days of intensive pull-the-power-cord
> > testing while under heavy write pressure, while running on ordinary IDE
> > hardware.
> 
> 1. There's nothing you can do manually that's "intensive" relative to
>    hardware speeds -- even relatively slow disk speeds.  
>    How many times  were you able to power cycle in this test?  

I had 8 hours of trying to beat it... tweaking the 'heavy write pressure test' 
to aim at other DirectoryStorage weak spots. As always, its hard to prove 
that no race conditions remain. I guess a cycle every 3 minutes, over that 8 
hour day. 

A second day similar day stressing backup and packing. That was slower through 
being less repetetive.

>    Is "survive" the same thing  as "no corruption"?

yes.

> 2. Did you also try this test without your fsync() calls?  If so, what
>    was the mean number of power cycles between observed corruptions?

Im confident in seeing corruption after a single power cycle.... but no I 
havent tried it. yet.

-- 
Toby Dickenson