Conflict errors on startup, revisited
This is a topic that comes up occasionally, but since the symptom seems to be rare and hard to provoke, it never gets resolved. Well it's biting me again and I'm determined to fix it, but would like some input from the list. Related collector issues: http://www.zope.org/Collectors/Zope/300 http://www.zope.org/Collectors/Zope/1201 The issue: On rare occasions I have seen ConflictErrors raised during product installation. Today, I can finally provoke them at will with a pretty minimal setup. The setup: Zope 2.7.3, using ZEO, with DirectoryStorage on the ZEO server. Hardware: dual-processor PIII 800, some kinda scsi disk, ext3, some old redhat flavor of linux. To provoke it I need merely do ./bin/zopectl run foo.py ... where foo.py is empty. Products/ now contains only one dummy product, which I've named Conflictor, which contains only an empty __init__.py. I get a ConflictError during registration of Conflictor. Interestingly I can only provoke this ConflictError in this particular INSTANCE_HOME. I have several other instances and am not seeing the symptom there. I started trivially hacking OFS/Application.py to add some debugging output (a diff against 2.7.3 is attached), and here's what I get: [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ ls Products/ [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ mkdir Products/Conflicter [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ touch Products/Conflicter/__in it__.py [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ touch foo.py [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ ./bin/zopectl run foo.py installing product PluginIndexes...OK! installing product Conflicter... BZZT Traceback (most recent call last): File "<string>", line 1, in ? File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 51, in app File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 47, in startup File "/zope/Zope-2.7.3-SoftwareHome/lib/python/Zope/App/startup.py", line 96, in startup File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 279, in init ialize initializer.initialize() File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 306, in init ialize self.install_products() File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 561, in inst all_products return install_products(app) File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 597, in inst all_products folder_permissions, raise_exc=debug_mode) File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 803, in inst all_product get_transaction().commit() File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZODB/Transaction.py", line 252, in commit File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZODB/Connection.py", line 738, in tpc_vote File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZEO/ClientStorage.py", line 841 , in tpc_vote File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZEO/ClientStorage.py", line 825 , in _check_serials ZODB.POSException.ConflictError: database conflict error (serial this txn starte d with 0x035b25f36751f988 2005-02-10 18:59:24.215675, serial currently committed 0x035b743c6d186822 2005-02-24 17:00:25.569220) I guess what I'm asking, before I hack more on OFS/Application.py, is this: * any ideas re. what on earth could be causing this ConflictError? the product doesn't DO anything fer chrissakes. (Maybe registration is finishing too fast??) * Is the PIL stuff in the traceback a red herring or what? -- Paul Winkler http://www.slinkp.com
Two words: *hurt system*. Please, let me kill it. On Thu, 2005-02-24 at 14:29, Paul Winkler wrote:
This is a topic that comes up occasionally, but since the symptom seems to be rare and hard to provoke, it never gets resolved. Well it's biting me again and I'm determined to fix it, but would like some input from the list.
Related collector issues: http://www.zope.org/Collectors/Zope/300 http://www.zope.org/Collectors/Zope/1201
The issue: On rare occasions I have seen ConflictErrors raised during product installation.
Today, I can finally provoke them at will with a pretty minimal setup.
The setup: Zope 2.7.3, using ZEO, with DirectoryStorage on the ZEO server. Hardware: dual-processor PIII 800, some kinda scsi disk, ext3, some old redhat flavor of linux.
To provoke it I need merely do ./bin/zopectl run foo.py
... where foo.py is empty. Products/ now contains only one dummy product, which I've named Conflictor, which contains only an empty __init__.py. I get a ConflictError during registration of Conflictor.
Interestingly I can only provoke this ConflictError in this particular INSTANCE_HOME. I have several other instances and am not seeing the symptom there.
I started trivially hacking OFS/Application.py to add some debugging output (a diff against 2.7.3 is attached), and here's what I get:
[pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ ls Products/ [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ mkdir Products/Conflicter [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ touch Products/Conflicter/__in it__.py [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ touch foo.py [pwinkler@dev-zope-knox01 PaulMiniInstanceHome]$ ./bin/zopectl run foo.py installing product PluginIndexes...OK! installing product Conflicter... BZZT Traceback (most recent call last): File "<string>", line 1, in ? File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 51, in app
File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 47, in startup
File "/zope/Zope-2.7.3-SoftwareHome/lib/python/Zope/App/startup.py", line 96, in startup File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 279, in init ialize initializer.initialize() File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 306, in init ialize self.install_products() File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 561, in inst all_products return install_products(app) File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 597, in inst all_products folder_permissions, raise_exc=debug_mode) File "/zope/ZopeSoftwareHome/lib/python/OFS/Application.py", line 803, in inst all_product get_transaction().commit() File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZODB/Transaction.py", line 252, in commit File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZODB/Connection.py", line 738, in tpc_vote File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZEO/ClientStorage.py", line 841 , in tpc_vote File "/zope/Zope-2.7.3-SoftwareHome/lib/python/ZEO/ClientStorage.py", line 825 , in _check_serials ZODB.POSException.ConflictError: database conflict error (serial this txn starte d with 0x035b25f36751f988 2005-02-10 18:59:24.215675, serial currently committed 0x035b743c6d186822 2005-02-24 17:00:25.569220)
I guess what I'm asking, before I hack more on OFS/Application.py, is this:
* any ideas re. what on earth could be causing this ConflictError? the product doesn't DO anything fer chrissakes. (Maybe registration is finishing too fast??)
* Is the PIL stuff in the traceback a red herring or what?
On Thu, Feb 24, 2005 at 02:52:45PM -0500, Chris McDonough wrote:
Two words: *hurt system*. Please, let me kill it.
Interesting. How on earth did you determine that? You think it has to do with registering help for PluginIndexes? I'm gonna go test that theory - i'll hack registerHelp() to no-op. -- Paul Winkler http://www.slinkp.com
On Thu, 2005-02-24 at 14:57, Paul Winkler wrote:
On Thu, Feb 24, 2005 at 02:52:45PM -0500, Chris McDonough wrote:
Two words: *hurt system*. Please, let me kill it.
Interesting. How on earth did you determine that?
Womanly intuition.
You think it has to do with registering help for PluginIndexes? I'm gonna go test that theory - i'll hack registerHelp() to no-op.
That's a good verifier. - C
On Thu, Feb 24, 2005 at 02:58:01PM -0500, Chris McDonough wrote:
On Thu, 2005-02-24 at 14:57, Paul Winkler wrote:
On Thu, Feb 24, 2005 at 02:52:45PM -0500, Chris McDonough wrote:
Two words: *hurt system*. Please, let me kill it.
Interesting. How on earth did you determine that?
Womanly intuition.
You think it has to do with registering help for PluginIndexes? I'm gonna go test that theory - i'll hack registerHelp() to no-op.
That's a good verifier.
Well, no joy. I hacked ProductContext.register* to return from the first line, so even registerClass() now does nothing. Barf, I still get the ConflictError. Twiddling around with some old suggestions from Dieter Maurer (see http://aspn.activestate.com/ASPN/Mail/Message/Zope-CMF/1978842 ), I added a _p_jar.sync() where he suggested - right before the final "except:" in initialize_product(). No joy. But - duh - my traceback clearly shows the ConflictError happens before that, in the call to get_transaction().commit(). SO I moved the call to app._p_jar.sync() to right before hte call to get_transaction().commit(). Aaaand... no more ConflictError. Weird weird weird. I'd be happy to check this in to cvs / svc, except that A) I have no clue how to write a test that exercises the bug, and B) so far nobody seems to understand what's really happening. If nobody objects, I wouldnt' mind checking the fix in anyway :-) -- Paul Winkler http://www.slinkp.com
[Paul Winkler]
... But - duh - my traceback clearly shows the ConflictError happens before that, in the call to get_transaction().commit(). SO I moved the call to app._p_jar.sync() to right before hte call to get_transaction().commit(). Aaaand... no more ConflictError.
Well, that's certain: the first thing sync() does is abort the current transaction. Then of course commit() won't raise an exception: sync() threw away all the changes.
... If nobody objects, I wouldnt' mind checking the fix in anyway :-)
It would be much clearer instead to change the get_transaction().commit(). to get_transaction().abort() That would get rid of the ConflictError too, and for the same reason. It has the advantage of being obviously insane, though <wink>.
On Thu, Feb 24, 2005 at 03:16:15PM -0500, Paul Winkler wrote:
But - duh - my traceback clearly shows the ConflictError happens before that, in the call to get_transaction().commit(). SO I moved the call to app._p_jar.sync() to right before hte call to get_transaction().commit(). Aaaand... no more ConflictError.
Never mind, this breaks existing unit tests. Duh, it prevents products from getting initialized at all :-\ now checking out tim's suggestion... -- Paul Winkler http://www.slinkp.com
Paul Winkler wrote at 2005-2-24 15:16 -0500:
... Twiddling around with some old suggestions from Dieter Maurer (see http://aspn.activestate.com/ASPN/Mail/Message/Zope-CMF/1978842 ), I added a _p_jar.sync() where he suggested - right before the final "except:" in initialize_product(). No joy.
Note that the database is initialized in a separate connection -- in case it is empty. This may cause "ConflictError"s in the primary connection. -- Dieter
On Thu, Feb 24, 2005 at 02:29:50PM -0500, Paul Winkler wrote:
* Is the PIL stuff in the traceback a red herring or what?
Note that PIL is not referred to by anything in my Products directory. (Heck, *nothing* is referred to by anything in Products.) That's what's so baffling about it appearing in the traceback. -- Paul Winkler http://www.slinkp.com
[Paul Winkler]
...
ZODB.POSException.ConflictError: database conflict error ( serial this txn started with 0x035b25f36751f988 2005-02-10 18:59:24.215675, serial currently committed 0x035b743c6d186822 2005-02-24 17:00:25.569220)
No time here for thought, but those timestamps are two weeks apart. Yikes! Just thought I'd mention that <wink>. A write conflict (which this is) happens when (and in this time order): 1. A transaction reads an object O. 2. A different transaction modifies O and commits its new state. 3. The original transaction also modifies O, and tries to commit its new state. #3 fails, because the original transaction has, in general, no way to know whether the changes it tried to make to O's state are consistent with the changes the other transaction made to its state in step #2. For your specific exception above, step #1 started by reading the state of the object as it existed on 2005-02-10. In step #2, someone else modified the same object, and committed its state around 2005-02-24 17:00:25. The original transaction then failed in step #3. Do you use a persistent ZEO client cache? If so, one thing to try is to physically remove all the cache files -- maybe they've gotten into an insane state. It's not impossible that you "should" have a 2-week difference in timestamps here, but it does look unusual. It's certainly possible (even expected) if two transactions try to modify the same object, and the last time that object was changed was in fact two weeks ago.
[Tim Peters]
... Do you use a persistent ZEO client cache? If so, one thing to try is to physically remove all the cache files -- maybe they've gotten into an insane state.
Off list, Paul said "bingo!" to that one -- deleting his ZEO client cache files apparently made the problem go away. I'm not really happy with that, because I have no theory for how ZEO client cache files _get_ out of synch. I suspect they do, though, more often than stray gamma rays and pilot error could account for. Alas, I've got no time to give to that these days. If someone else wants to solve that, I'll pass on the cigar Paul offered to me <wink>.
On Thu, Feb 24, 2005 at 04:11:35PM -0500, Tim Peters wrote:
[Tim Peters]
... Do you use a persistent ZEO client cache? If so, one thing to try is to physically remove all the cache files -- maybe they've gotten into an insane state.
Off list, Paul said "bingo!" to that one -- deleting his ZEO client cache files apparently made the problem go away.
Yep, meant to mail the list, hit the wrong button. Thanks again Tim!
I'm not really happy with that, because I have no theory for how ZEO client cache files _get_ out of synch. I suspect they do, though, more often than stray gamma rays and pilot error could account for.
Alas, I've got no time to give to that these days. If someone else wants to solve that, I'll pass on the cigar Paul offered to me <wink>.
Oh-sure-pretend-you-didn't-want-the-bag-of-ham-ly y'rs, -- Paul Winkler http://www.slinkp.com
Tim Peters wrote:
I'm not really happy with that, because I have no theory for how ZEO client cache files _get_ out of synch.
The most common one for me is connecting a ZEO client to one storage server (say a dev one) and then changing the config to connect to another storage server (say a staging one) without making sure the cache files are gone. That seems ot do it most time, and I'm sure I've got POSKeyErrors and all manner of other weirdness too. My workaround is never to use persistent client caches on clients that I'm likely to connect to different servers. This seems a shame, I wish there was some kind of safety check that ZEO could do to go "wait, these caches are insane, they don't even apply to the server you're connecting to" cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk
Tim Peters wrote at 2005-2-24 15:29 -0500:
[Paul Winkler]
... ZODB.POSException.ConflictError: database conflict error ( serial this txn started with 0x035b25f36751f988 2005-02-10 18:59:24.215675, serial currently committed 0x035b743c6d186822 2005-02-24 17:00:25.569220)
Where is the "oid" in this "ConflictError" report? It is vital information to analyse which object causes the problem... -- Dieter
ZODB.POSException.ConflictError: database conflict error ( serial this txn started with 0x035b25f36751f988 2005-02-10 18:59:24.215675, serial currently committed 0x035b743c6d186822 2005-02-24 17:00:25.569220)
[Dieter Maurer]
Where is the "oid" in this "ConflictError" report? It is vital information to analyse which object causes the problem...
I don't know. I reformatted Paul's traceback to line up the timestamps for easier reading, but there was no oid in the original either. There wouldn't be, either, if no oid was passed to the ConflictError constructor. ConflictError.__str__ only includes info for the stuff it was told about: def __str__(self): extras = [] if self.oid: extras.append("oid %s" % oid_repr(self.oid)) if self.class_name: extras.append("class %s" % self.class_name) if self.serials: current, old = self.serials extras.append("serial this txn started with %s" % readable_tid_repr(old)) extras.append("serial currently committed %s" % readable_tid_repr(current)) if extras: return "%s (%s)" % (self.message, ", ".join(extras)) else: return self.message In Paul's case, the ConflictError occurred on the ZEO server, but the traceback came from the ZEO client (just raising the Exception object the ZEO server passed to it). There's no way then to guess what the traceback might have looked like on the ZEO server side (boooo!). Perhaps looking at the ZEO server log would have helped.
Paul Winkler wrote at 2005-2-24 14:29 -0500:
This is a topic that comes up occasionally, but since the symptom seems to be rare and hard to provoke, it never gets resolved. Well it's biting me again and I'm determined to fix it, but would like some input from the list.
I know that "ZopeProfiler" can cause such "ConflictError"s and (what is more) why it does this. "ZopeProfiler" (like some other products) wants to install itself in "Control_Panel". Because, this requires a database connection and because "App.ProductContext.ProductContest" tries hard to hide its application object, "ZopeProfiler" opens a new one, modifies "Control_Panel" and commits the transaction and then closes its connection. Such modifications via a second ZODB connection can cause unexpected (and puzzling) "ConflictError" on startup. The next "ZopeProfiler" version will access "ProductContext.__app" instead (using the mangled name). -- Dieter
On Thu, Feb 24, 2005 at 02:29:50PM -0500, Paul Winkler wrote:
This is a topic that comes up occasionally, but since the symptom seems to be rare and hard to provoke, it never gets resolved. Well it's biting me again and I'm determined to fix it, ...
And since my ConflictError turned out to be merely caused by a borked ZEO client cache, I've given up on the hunt ... again :-\ For all I know, ZEO cache may have been the cause of my previous run-ins with ConflictError on startup too. -PW -- Paul Winkler http://www.slinkp.com
participants (5)
-
Chris McDonough -
Chris Withers -
Dieter Maurer -
Paul Winkler -
Tim Peters