[ZODB-Dev] Data file corruption and recovery
Erik Dahl
edahl@zentinel.net
Thu, 13 Feb 2003 17:44:03 -0500
Jeremy,
I never got the zeo server to tell me anything (I started it in debug
mode to make sure). Now I wasn't in a very patient mood so I didn't
wait more than 3 mins or so. I figured that the problem was some sort
of corruption as the older file loaded with no problem. I don't see any
.trN file but maybe this is due to my impatience. I have the original
file but it is fairly big (67MB when gziped). fsrecvoer didn't say
anything other than that 0 data was lost during recovery. As Toby
suggests it may be possible that the machine was writing bad data for a
while as a result of the cpu. Is it possible that there was no problem
and that I should have been more patient with the startup time of zeo?
-EAD
Jeremy Hylton wrote:
>On Thu, 2003-02-13 at 09:33, Erik Dahl wrote:
>
>
>>Yesterday I had a cpu failure on a box that caused the sudden reboot of
>>a zeo server. When the service was brought up on the other side of the
>>cluster it didn't start. I figured this was due to data corruption and
>>when using a backup the server started fine. The problem was the backup
>>was a little stale so I wanted to try recovering the corrupt file. I
>>found two methods for fixing the file running fsrecover.py or running
>>tranalyzer.py then using its output to truncate the data file.
>>fsrecover.py did fix my problem but only after running for around 6
>>hours and generating no output other than to say that no data was lost.
>> The tranalyzer method never worked. My questions are:
>>
>>
>
>What happened when you tried to start up the zeo server? I must admit
>that I haven't run into this problem in real deployment, and I don't
>remember what the storage / server is supposed to say.
>
>Did it create a .trN file? That would indicate it figured out what
>transactions to delete.
>
>Do you have the original file? If you run into file storage corruption
>problems, it's helpful to us developers if you keep a copy of the
>damaged file.
>
>
>
>>1. how can you figure out what the server is doing when you have a
>>corrupted file (I tried setting STUPID_LOG_SEVERITY to -300 with no
>>results).
>>
>>
>
>It did say something, right? Otherwise you would not have known that
>the storage was corrupted. There should be some complaint during the
>initial startup, but if it starts up successfully I wouldn't expect
>further error reports in the log.
>
>
>
>>2. any idea why taking transactions off the end of the file didn't fix
>>the problem?
>>
>>
>
>Do you know what changes fsrecover made to fix the problem?
>
>
>
>>3. would directory storage handle this situation better or do I need to
>>go to a berkeley db backend?
>>
>>
>
>It's surprising that truncating the file didn't solve the problem.
>FileStorage should be pretty robust against these sorts of crashes. It
>calls sync() in tpc_finish(), so it's quite unlikely that a reboot would
>do anything other than leave incomplete transaction data at the end.
>
>Jeremy
>
>
>