Coroner's toolkit for zope, or how to figure out what went wrong.

newer
Re: [Zope-dev] A Modest Proposal...

Romain Slootmaekers

12 Aug 2002 12 Aug '02

3:28 p.m.

Yo, we had a nasty crash of our zope server that we use for a b2b web application. The Data.fs ZODB lost a significant amount of data. At this point, we restored the Data.fs from our last backup and the server is back up and running. (breathing relieved) What worries me is that we have no clue whatsoever on what happened, besides the constatation that somehow, somewhere we lost a whole tree of objects. So does anyone have an object browser of the ZODB or some tools/procedures on how to find out what went wrong? I really hope we are not going to lose a lot of sleep because of this. TIA, Sloot.

Show replies by date

Joachim Werner

12 Aug 12 Aug

3:50 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Hi! I know of exactly two cases that could really cause a ZODB loose data: if you reach the 2GB limit with a Python not compiled for larger files and if you reach the physical limit of your storage. That is, if your case doesn't add a third one ... Have you already tried the usual things, i.e. run fstest.py and/or fsrecover.py? It's quite unlikely that you'd loose a whole tree, as the data is not physically stored in trees, but added sequentially. You might have deleted a tree, but that can be rolled back by getting rid of the ZODB transaction that did the delete. Cheers Joachim Werner ----- Original Message ----- From: "Romain Slootmaekers" <romain@zzict.com> To: <zope-dev@zope.org> Sent: Monday, August 12, 2002 5:28 PM Subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

...

Yo,

we had a nasty crash of our zope server that we use for a b2b web application. The Data.fs ZODB lost a significant amount of data.

At this point, we restored the Data.fs from our last backup and the server is back up and running. (breathing relieved)

What worries me is that we have no clue whatsoever on what happened, besides the constatation that somehow, somewhere we lost a whole tree of objects.

So does anyone have an object browser of the ZODB or some tools/procedures on how to find out what went wrong?

I really hope we are not going to lose a lot of sleep because of this.

TIA,

Sloot.

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

Toby Dickenson

4:08 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

On Monday 12 Aug 2002 4:50 pm, Joachim Werner wrote:

...

Hi!

I know of exactly two cases that could really cause a ZODB loose data: if you reach the 2GB limit with a Python not compiled for larger files and if you reach the physical limit of your storage. That is, if your case doesn't add a third one ...

FileStorage is robust and mature, but its not a good as this statement suggests. There have been a number of bugs that cause packing to delete more than it should (a few very small holes still remain), bugs that cause FileStorage to overwrite the middle of its log file, and bugs that cause its position index to get muddled.

...

Have you already tried the usual things, i.e. run fstest.py and/or fsrecover.py? It's quite unlikely that you'd loose a whole tree, as the data is not physically stored in trees, but added sequentially. You might have deleted a tree, but that can be rolled back by getting rid of the ZODB transaction that did the delete.

The first thing I would recommend trying today is shutting down, removing data.fs.index, and restarting. In recebnt versions data.fs.index make very heavy use of BTrees, and all released versions of the BTree code have small bugs. <plug>I am currently developing DirectoryStorage, and one design goal is fault tolerance. http://dirstorage.sourceforge.net/ </plug>

Romain Slootmaekers

5:17 p.m.

New subject: [Zope-dev] Follow up: Coroner's toolkit for zope, or how to figure out what went wrong.

Toby Dickenson wrote:

...

On Monday 12 Aug 2002 4:50 pm, Joachim Werner wrote:

...
Hi!

I know of exactly two cases that could really cause a ZODB loose data: if you reach the 2GB limit with a Python not compiled for larger files and if you reach the physical limit of your storage. That is, if your case doesn't add a third one ...

well, it isn't the 2GB limit, nor the storage limit,... BTW, i wish I still had your good faith in software :(

...

FileStorage is robust and mature, but its not a good as this statement suggests. There have been a number of bugs that cause packing to delete more than it should (a few very small holes still remain), bugs that cause FileStorage to overwrite the middle of its log file, and bugs that cause its position index to get muddled.

ouch. packing couldn't be the problem though.... (we haven't packed recently) After spending some times looking at the logs. I could dig up the following traceback : File "/home/zope/Zope-2.5.1-linux2-x86/lib/python/ZODB/Connection.py", line 46 3, in setstate raise ReadConflictError(object=object) ReadConflictError: database read conflict error (oid 000000000000bc8d, our code that causes it basically changes some attributes and then does a self._p_changed=1 in some persistent object. The problem is this: appearantly that object is also touched in some other thread causing the conflict error. So far I understand what's happening. but then it gets blurry. the resolution of the conflict handeling somehow drops everything in that object tree. Basically we have class DB (Persistent,Implicit): ..... def __init__(self): self.__dbItems=[] class DBItem(Persistent,Implicit): .... def setSomething(self,...): ... self._p_changed=1 and DB contains a set of DBItem objects, and touching one of them drops the DB object.

...

The first thing I would recommend trying today is shutting down, removing data.fs.index, and restarting. In recebnt versions data.fs.index make very heavy use of BTrees, and all released versions of the BTree code have small bugs.

hm, isn't there a policy on adding tests that expose the bugs to the set of unittests ? If we (our company that is)can't resolve the problem, we'll have to reconsider our strategy on data storage and perhaps even drop the use of the ZODB for anything but scripts and static content. all managed content types then have to stored in something more robust like some relational database, and we all know how well object trees fit into relational databases. :( Anyway, If we find more, then we'll post it here. Sloot.

Chris McDonough

6:06 p.m.

New subject: [Zope-dev] Follow up: Coroner's toolkit for zope, or how to figure out what went wrong.

...

ReadConflictError: database read conflict error (oid 000000000000bc8d,

The conflict error you have likely has nothing to do with your data loss, it's a normal artifact of Zope operation.

...

...
The first thing I would recommend trying today is shutting down, removing data.fs.index, and restarting. In recebnt versions data.fs.index make very heavy use of BTrees, and all released versions of the BTree code have small bugs.

Did you try this?

...

hm, isn't there a policy on adding tests that expose the bugs to the set of unittests ?

If you can replicate it, sure.

...

If we (our company that is)can't resolve the problem, we'll have to reconsider our strategy on data storage and perhaps even drop the use of the ZODB for anything but scripts and static content. all managed content types then have to stored in something more robust like some relational database, and we all know how well object trees fit into relational databases. :(

Yep. Fear, uncertainty, and doubt are sometimes more powerful than logic. ;-) - C

Jim Fulton

5:53 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Romain Slootmaekers wrote:

...

Yo,

we had a nasty crash of our zope server that we use for a b2b web application. The Data.fs ZODB lost a significant amount of data.

What sort of crash? Was this a hardware failure, or a software failure?

...

At this point, we restored the Data.fs from our last backup and the server is back up and running. (breathing relieved)

What worries me is that we have no clue whatsoever on what happened, besides the constatation that somehow, somewhere we lost a whole tree of objects.

Was this in the backup? Or in the damaged data file? Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (888) 344-4332 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

Romain Slootmaekers

6:10 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Jim Fulton wrote:

...

Romain Slootmaekers wrote:

...
Yo,

we had a nasty crash of our zope server that we use for a b2b web application. The Data.fs ZODB lost a significant amount of data.

What sort of crash? Was this a hardware failure, or a software failure?

software. basically, the server didn't crash, but our applications couldn't function anymore because some objects that really have to exist were gone. the Data.fs was NOT corrupted, but (so far I can tell) a bug in the conflict resolution code caused our object (the one upon we set self._p_changed=1) to be empty. This object is a container of other objects that are Persistent themselves and at this point, we don't believe the conflict resolution mechanism handles these cases correctly.

...

...
At this point, we restored the Data.fs from our last backup and the server is back up and running. (breathing relieved)

What worries me is that we have no clue whatsoever on what happened, besides the constatation that somehow, somewhere we lost a whole tree of objects.

Was this in the backup? Or in the damaged data file?

nope. the loss of data occured in the 12 hours after our last backup. so we only (well, it actually is quite a lot :( ) lost the transactions that happened between the backup and the restore/restart. The stack trace in the follow up mail gives some clue on where the problem is situated in the code. (as well as the exact version of the Zope installation) Anyway, Murphy's law is once again proven as this thing happened on the first day of my vacation. :| Sloot.

Jim Fulton

7:12 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Romain Slootmaekers wrote:

...

Jim Fulton wrote:

...
Romain Slootmaekers wrote:

...
Yo,

we had a nasty crash of our zope server that we use for a b2b web application. The Data.fs ZODB lost a significant amount of data.

What sort of crash? Was this a hardware failure, or a software failure?

software. basically, the server didn't crash, but our applications couldn't function anymore because some objects that really have to exist were gone.

the Data.fs was NOT corrupted, but (so far I can tell) a bug in the conflict resolution code caused our object (the one upon we set self._p_changed=1) to be empty. This object is a container of other objects that are Persistent themselves and at this point, we don't believe the conflict resolution mechanism handles these cases correctly.

I think you are pretty far off here. You said you saw a read conflict. No conflict resolution is done for a read conflict. Further, from the very brief description of your DB class, it doesn't appear to use any objects that actually try to resolve conflicts. I doubt seriously that this has anything to do with conflict resolution. It is very doubtful that a database error would cause your data to simply disappear without some sort of error, like a database corruption error or an error about invalid object ids (dangling references). Have you considered an application error? If you still have the data file with the lost data, it should be possible to analyze it to figure out what went wrong. In particular, it would be helpful to figure out just what transaction made the data go away to figure out what it might have been doing. ...

...

The stack trace in the follow up mail gives some clue on where the problem is situated in the code. (as well as the exact version of the Zope installation)

No, this is a reh hering. A read conflict can't cause loss of data. It simply causes the transaction with the read conflict to be reexecuted. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (888) 344-4332 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

Romain Slootmaekers

10:31 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Jim Fulton wrote:

...

Romain Slootmaekers wrote:

I think you are pretty far off here. You said you saw a read conflict. No conflict resolution is done for a read conflict. Further, from the very brief description of your DB class, it doesn't appear to use any objects that actually try to resolve conflicts. I doubt seriously that this has anything to do with conflict resolution. It is very doubtful that a database error would cause your data to simply disappear without some sort of error, like a database corruption error or an error about invalid object ids (dangling references). Have you considered an application error?

yes, that's the first thing one does: doubt your own code. the object in question is created once, and there is no code to delete it since in that application, it is of no use. The only thing that happens is that we add/moify/delete other object to that rootnode.

...

If you still have the data file with the lost data, it should be possible to analyze it to figure out what went wrong. In particular, it would be helpful to figure out just what transaction made the data go away to figure out what it might have been doing.

that was exactly the question I was asking in the first place : tools to browse the ZODB, to see where it broke.

...

It simply causes the transaction with the read conflict to be reexecuted.

Ok, I figured that out by now as well. the read conflict error has indeed nothing to do with our problem. sorry 'bout that... But we found something else: I included a script below that produces a stripped down analogy of our problem. (no zope needed, just ZODB, and you might wanna modify the first line to get it working) The script produces the following output: C:\zope\devel>bin\python.exe \temp\test.py <Foo instance at 008DCAC8> 0 <Foo instance at 008E1280> 0 Traceback (most recent call last): File "\temp\test.py", line 68, in ? get_transaction().commit() File "C:\zope\devel\lib\python\ZODB\Transaction.py", line 234, in commit j.commit(o,self) File "C:\zope\devel\lib\python\ZODB\Connection.py", line 348, in commit s=dbstore(oid,serial,p,version,transaction) File "C:\zope\devel\lib\python\ZODB\FileStorage.py", line 665, in store data=self.tryToResolveConflict(oid, oserial, serial, data) File "C:\zope\devel\lib\python\ZODB\ConflictResolution.py", line 108, in tryTo ResolveConflict resolved=resolve(old, committed, newstate) File "\temp\test.py", line 30, in _p_resolveConflict print savedState['data'].getHello() AttributeError: PersistentReference instance has no attribute 'getHello' The question is: is intended ZODB behaviour or not, and is there a work around ? have fun, Sloot. swhome=r'C:\zope\devel' import sys sys.path.insert(0, '%s/lib/python' % swhome) sys.path.insert(1, '%s/bin/lib' % swhome) import ZODB from Persistence import Persistent class Dummy(Persistent): def __init__(self): self.hello = "Hello there..." def getHello(self): return self.hello class Foo(Persistent): def __init__(self): self.data = Dummy() self.count = 0 def incCounter(self): self.count += 1 def getCount(self): return self.count def _p_resolveConflict(self, oldState, savedState, newState): print savedState['data'].getHello() print newState['data'].getHello() print oldState['data'].getHello() diffsaved = savedState['count'] - oldState['count'] diffnew = newState['count'] - oldState['count'] newState['count'] = oldState['count'] + diffsaved + diffnew return newState from ZODB import FileStorage, DB storage = FileStorage.FileStorage('/temp/test.fs') db = DB( storage ) # Init van test object conn = db.open() root = conn.root() root['foo'] = Foo() get_transaction().commit() conn.close() conn1 = db.open() root1 = conn1.root() foo1 = root1['foo'] conn2 = db.open() root2 = conn2.root() foo2 = root2['foo'] print foo1, foo1.getCount() print foo2, foo2.getCount() foo1.incCounter() get_transaction().commit() foo2.incCounter() get_transaction().commit() print foo1, foo1.getCount() print foo2, foo2.getCount()

Toby Dickenson

11:05 p.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

On Monday 12 Aug 2002 11:31 pm, Romain Slootmaekers wrote:

...

But we found something else:

I included a script below that produces a stripped down analogy of our problem. (no zope needed, just ZODB, and you might wanna modify the first line to get it working)

Ive read your sample, but not tried it.

...

The question is: is intended ZODB behaviour or not, and is there a work around ?

Yes, this is the intended behaviour. While excecuting the conflict resolution logic, your conflicting objects exist in a void - completely seperate from all other ZODB objects. Any attributes which you might expect to be other persistent objects are substituted for attribute-less placeholders. This is the 'PersistentReference' object in your traceback. Note that these attributes are only a 'PersistentReference' while processing the conflict resolution logic. If you access this attribute at any other time you get your normal class, with all its normal attributes.

Jim Fulton

13 Aug 13 Aug

11:35 a.m.

New subject: [Zope-dev] Coroner's toolkit for zope, or how to figure out what went wrong.

Romain Slootmaekers wrote:

...

Jim Fulton wrote:

...
Romain Slootmaekers wrote:

...

the object in question is created once, and there is no code to delete it since in that application, it is of no use. The only thing that happens is that we add/moify/delete other object to that rootnode.

Right. In the problem database, did that object disappear, or did it's contents disappear?

...

...
If you still have the data file with the lost data, it should be possible to analyze it to figure out what went wrong. In particular, it would be helpful to figure out just what transaction made the data go away to figure out what it might have been doing.

that was exactly the question I was asking in the first place : tools to browse the ZODB, to see where it broke.

It's not clear that the ZODB is broke. You didn't see any evidense of a broken ZODB. No invalid object ids. No invalid transaction or record data. The data in the database is not what you expect. ...

...

But we found something else:

I included a script below that produces a stripped down analogy of our problem. (no zope needed, just ZODB, and you might wanna modify the first line to get it working)

The script produces the following output:

C:\zope\devel>bin\python.exe \temp\test.py <Foo instance at 008DCAC8> 0 <Foo instance at 008E1280> 0 Traceback (most recent call last): File "\temp\test.py", line 68, in ? get_transaction().commit() File "C:\zope\devel\lib\python\ZODB\Transaction.py", line 234, in commit j.commit(o,self) File "C:\zope\devel\lib\python\ZODB\Connection.py", line 348, in commit s=dbstore(oid,serial,p,version,transaction) File "C:\zope\devel\lib\python\ZODB\FileStorage.py", line 665, in store data=self.tryToResolveConflict(oid, oserial, serial, data) File "C:\zope\devel\lib\python\ZODB\ConflictResolution.py", line 108, in tryTo ResolveConflict resolved=resolve(old, committed, newstate) File "\temp\test.py", line 30, in _p_resolveConflict print savedState['data'].getHello() AttributeError: PersistentReference instance has no attribute 'getHello'

The question is: is this intended ZODB behaviour or not,

Yes, it is. When doing conflict resolution, you only get to use the state of the object involved in the conflict. You don't get to see the state of persistent subobjects. All sub-object references are converted to persistent references. This is so, during conflict resolution, you can compare the object references for equality.

...

and is there a work around ?

No. Were you implementing application-level conflict resolution (_p_resolveConflict) for any of your applicaton objects? Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (888) 344-4332 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

8716

Age (days ago)

8717

Last active (days ago)

List overview

10 comments

5 participants

participants (5)

Chris McDonough
Jim Fulton
Joachim Werner
Romain Slootmaekers
Toby Dickenson