database conflicts and the _p_oid missing attribute bug
Jeremy & List -- I have been having problems with Zope locking up (deadlock?) when ZODB conflicts occur. I am running Zope 2.6.2b3 with the 3.1.2b2 version of the ZODB (that is, a vanilla install). Python is 2.1.3. One of the failing methods, QLOG, that is a Python Script that computes some parameters (including calling an External Method to create a unique id) and then invokes a ZSQLMethod to insert information into a MySQL database. The other failing method is the logged_in element of the commonly used Cookie Crumbler authentication system. It does a bit of Javascript setup and then invokes a Python Script that creates a bunch of session variables which it populates from MySQL (4 queries) and writes a row into a MySQL table. Each of the sql queries calls a PythonScript as part of the dynamic expansion of the query. The final query calls an External Method to get a unique identifier. The diagnostic is always: 2004-01-22T12:25:15 ERROR(200) ZODB A storage error occured during transaction abort. This shouldn't happen. Traceback (innermost last): File /home/zopesys/lib/python/ZODB/Transaction.py, line 262, in commit File /home/zopesys/lib/python/ZODB/Transaction.py, line 393, in _commit_error AttributeError: Surrogate instance has no attribute '_p_oid' As I understand what's happening (after reading the logs carefully), I have one or more folks trying to use the same object. Zope's ZODB serializes the access and aborts all but the oldest of the conflicting transactions. The diagnostic occurs because a registered object is missing the _p_oid attribute that's used to rollback the transaction. The big question is which of the several sub-transactions is the culprit--or is it a bug as Dieter has suggested, soluable by providing a default value for _p_oid -- Dieter wrote: You can fix this by substituting "getattr(o,'_p_oid','<not persistent>')" for "o._p_oid" in "_commit_error". I've been a bit unwilling to modify this part of Zope until I fully understand the problem. How can I find the offending object(s)? Why don't they have the proper _p_oid attribute? Is this truly a bug? And, on another front, it looks like conflicts of the sort that are triggering the bug are going to be relatively common. They are real and are ascerbated by the time required for database, file, and computation. Would installing ZEO help minimize conflicts? -- Dennis Allison * Computer Systems Laboratory * Gates 227 * Stanford University * Stanford CA 94305 * (650) 723-9213 * (650) 723-0033 fax * allison@shasta.stanford.edu * allison@sumeru.stanford.edu
I'll be at LinuxWorld tomorrow and mostly aware from my email, but here's a brief response that might offer some clues. On Thu, 2004-01-22 at 21:45, Dennis Allison wrote:
I have been having problems with Zope locking up (deadlock?) when ZODB conflicts occur. I am running Zope 2.6.2b3 with the 3.1.2b2 version of the ZODB (that is, a vanilla install). Python is 2.1.3.
It's possible that upgrading to a non-beta release will fix, although most of the issues resolved in the 3.1.x maintenance releases were ZEO issues.
The diagnostic is always:
2004-01-22T12:25:15 ERROR(200) ZODB A storage error occured during transaction abort. This shouldn't happen. Traceback (innermost last): File /home/zopesys/lib/python/ZODB/Transaction.py, line 262, in commit File /home/zopesys/lib/python/ZODB/Transaction.py, line 393, in _commit_error AttributeError: Surrogate instance has no attribute '_p_oid'
There is more than one error here, which makes the log message very confusing. In particular, there is an exception that occurs during the second phase of the two-phase commit, a second error that occurs trying to recover from the first error, and a third error trying to log a message about the second error. It's a real mess. It might help to change the way logging is done so that you can see the first two error messages. In _commit_error(), change the self.log() call in these lines except: # nothing to do but log the error self.log("Failed to abort object %s" % repr(o._p_oid), error=sys.exc_info()) to say this: self.log("Failed to abort object %s" % repr(o), error=sys.exc_info()) And change one line in _commit # Ugh, we got an got an error during commit, so we # have to clean up. First save the original exception # in case the cleanup process causes another # exception. error = sys.exc_info() try: self._commit_error(objects, ncommitted, jars, subjars) except: LOG('ZODB', ERROR, "A storage error occured during transaction " "abort. This shouldn't happen.", error=sys.exc_info()) The LOG() call should be passing error=error, not error=sys.exc_info(). I don't think those changes will introduce any new problems and will tell us a lot about what is really going on.
As I understand what's happening (after reading the logs carefully), I have one or more folks trying to use the same object. Zope's ZODB serializes the access and aborts all but the oldest of the conflicting transactions. The diagnostic occurs because a registered object is missing the _p_oid attribute that's used to rollback the transaction. The big question is which of the several sub-transactions is the culprit--or is it a bug as Dieter has suggested, soluable by providing a default value for _p_oid --
That right in a rough sense. ZODB uses optimistic concurrency control, so later transactions are aborted if they conflict with already-committed ones. But the transactions actually run concurrently. The original error in the second-phase of two-phase commit is part of the problem. Since it's in the second phase, it's not caused by a ZODB conflict but by some more serious failure of one of the transactional resource managers. Unfortunately, we can't tell which one because the buggy logging code has swallowed the error. Jeremy
Thank you for the pointers. I'll modify the logging code and see what turns up. Unfortunately this is a problem we only see when the system is live with real users. On Thu, 22 Jan 2004, Jeremy Hylton wrote:
I'll be at LinuxWorld tomorrow and mostly aware from my email, but here's a brief response that might offer some clues.
On Thu, 2004-01-22 at 21:45, Dennis Allison wrote:
I have been having problems with Zope locking up (deadlock?) when ZODB conflicts occur. I am running Zope 2.6.2b3 with the 3.1.2b2 version of the ZODB (that is, a vanilla install). Python is 2.1.3.
It's possible that upgrading to a non-beta release will fix, although most of the issues resolved in the 3.1.x maintenance releases were ZEO issues.
The diagnostic is always:
2004-01-22T12:25:15 ERROR(200) ZODB A storage error occured during transaction abort. This shouldn't happen. Traceback (innermost last): File /home/zopesys/lib/python/ZODB/Transaction.py, line 262, in commit File /home/zopesys/lib/python/ZODB/Transaction.py, line 393, in _commit_error AttributeError: Surrogate instance has no attribute '_p_oid'
There is more than one error here, which makes the log message very confusing. In particular, there is an exception that occurs during the second phase of the two-phase commit, a second error that occurs trying to recover from the first error, and a third error trying to log a message about the second error. It's a real mess. It might help to change the way logging is done so that you can see the first two error messages.
In _commit_error(), change the self.log() call in these lines
except: # nothing to do but log the error self.log("Failed to abort object %s" % repr(o._p_oid), error=sys.exc_info())
to say this: self.log("Failed to abort object %s" % repr(o), error=sys.exc_info())
And change one line in _commit
# Ugh, we got an got an error during commit, so we # have to clean up. First save the original exception # in case the cleanup process causes another # exception. error = sys.exc_info() try: self._commit_error(objects, ncommitted, jars, subjars) except: LOG('ZODB', ERROR, "A storage error occured during transaction " "abort. This shouldn't happen.", error=sys.exc_info())
The LOG() call should be passing error=error, not error=sys.exc_info().
I don't think those changes will introduce any new problems and will tell us a lot about what is really going on.
As I understand what's happening (after reading the logs carefully), I have one or more folks trying to use the same object. Zope's ZODB serializes the access and aborts all but the oldest of the conflicting transactions. The diagnostic occurs because a registered object is missing the _p_oid attribute that's used to rollback the transaction. The big question is which of the several sub-transactions is the culprit--or is it a bug as Dieter has suggested, soluable by providing a default value for _p_oid --
That right in a rough sense. ZODB uses optimistic concurrency control, so later transactions are aborted if they conflict with already-committed ones. But the transactions actually run concurrently.
The original error in the second-phase of two-phase commit is part of the problem. Since it's in the second phase, it's not caused by a ZODB conflict but by some more serious failure of one of the transactional resource managers. Unfortunately, we can't tell which one because the buggy logging code has swallowed the error.
Jeremy
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
On Thu, Jan 22, 2004 at 11:26:23PM -0500, Jeremy Hylton wrote:
That right in a rough sense. ZODB uses optimistic concurrency control, so later transactions are aborted if they conflict with already-committed ones. But the transactions actually run concurrently.
Could you explain this a bit more? Because we run relatively high-write operations on our Zope/ZEO setup, we hit ConflictErrors quite a bit. I'd like to understand the underlying machinery here, ideally so I can come up with a fix. srl -- Shane Landrum, Software Engineer srl@boston.com boston.com / NY Times Digital
On Fri, 23 Jan 2004 09:56:23 -0500 Shane Landrum <srl@boston.com> wrote:
On Thu, Jan 22, 2004 at 11:26:23PM -0500, Jeremy Hylton wrote:
That right in a rough sense. ZODB uses optimistic concurrency control, so later transactions are aborted if they conflict with already-committed ones. But the transactions actually run concurrently.
Could you explain this a bit more? Because we run relatively high-write operations on our Zope/ZEO setup, we hit ConflictErrors quite a bit. I'd like to understand the underlying machinery here, ideally so I can come up with a fix.
Here is a simple example: Suppose you have an object 'somefile' that has an attribute 'data'. Now suppose that there are two concurrent processes modifying 'somefile': Process 1: somefile.data = 'Mary had a little lamb' Process 2: somefile.data = 'The Quick Brown Fox...' The concurrency control is "optimistic" because it allows both processes to change the same file object (there are no locks). Now suppose that process 1 commits the transaction. This saves 'Mary had a little lamb' as the new file data in the ZODB. Inevitably, process 2 also commits its transaction. The ZODB sees that the file was modified since process 2's transaction began, and this is a conflict (specifically a write conflict). When the ZODB detects a write conflict it sees if the object supports conflict resolution (via a special method). If so, the method is called to allow the object to try to resolve the conflict. Most objects do not have conflict resolution, however. In this case ZODB raises a ConflictError (thereby aborting the transaction). This error is caught by Zope. In response, Zope retries the web request which replays the transaction again. If it gets another ConflictError is tries again (up to 3 times). After 3 times it gives up and returns an error to the user. In most cases, the first retry is sufficent. More retries are necessary as the system becomes busier (more concurrency). This causes extra work since many transactions need to be retried. This can exacerbate the problem and cause performance degradation. Conflicts tend to center around "hotspots", objects that are changed often by many requests. The Catalog is a classic example since it tends to get changed everytime new content is added to the system. Content repository folders are another example. Both of these "hotspots" have their own conflict resolution code (they typically use BTree objects which can resolve many write conflicts internally). Write conflicts tend to be pretty manageable if you recognize hotspots and use the proper data types (like BTrees) or add your own conflict resolution (not for the faint of heart, however). The more insidious kind of conflicts are Read Conflicts. They are caused (in simple terms) when one transaction changes an object that another concurrent transaction is about to use (These are called "dirty reads"). You can't resolve these kinds of conflicts and they can be very tricky to prevent especially in a busy system. Luckily ZODB 3.3 has a new feature: "Multi-version concurrency control" (MVCC) which resolves read conflicts (or at worst changes them to write conflicts). Once ZODB 3.3 gets released in Zope, we can look forward to many fewer conflict errors. -Casey
On Fri, Jan 23, 2004 at 11:36:24AM -0500, Casey Duncan wrote:
<a big long explanation>
Thanks, this is much clearer now.
Conflicts tend to center around "hotspots", objects that are changed often by many requests. The Catalog is a classic example since it tends to get changed everytime new content is added to the system. Content repository folders are another example.
Both of these "hotspots" have their own conflict resolution code (they typically use BTree objects which can resolve many write conflicts internally). Write conflicts tend to be pretty manageable if you recognize hotspots and use the proper data types (like BTrees) or add your own conflict resolution (not for the faint of heart, however).
Where can I read about writing my own conflict resolution? What risks are there with this approach, and how do I minimize them? srl -- Shane Landrum, Software Engineer srl@boston.com boston.com / NY Times Digital
On Fri, 23 Jan 2004 11:54:38 -0500 Shane Landrum <srl@boston.com> wrote:
Where can I read about writing my own conflict resolution? What risks are there with this approach, and how do I minimize them?
The subject isn't well-documented. I hope we'll improve the documentation for 3.3, and this is one of the issues the bears explaning. In the interim, here are two possibilities. Jim's wiki has a page that describes the basic API. It's all correct and has some simple example code. http://www.zope.org/Members/jim/ZODB/ApplicationLevelConflictResolution I think it doesn't address cross-object consistency issues. If you have two (or more) objects and there is an invariant that spans those two objects, you need to be very careful about maintaining that invariant in the face of conflict resolution. BTrees have to deal with that. I also wrote a technical note in my weblog about avoiding contention. It discusses several approaches, but it's just a sketch of some ideas. http://www.python.org/~jeremy/weblog/031031c.html If these don't answer your questions or provoke new questions, feel free to ask. It will help us understand what new documentation is needed. Jeremy
Jeremy -- when will 3.3 be ready for prime time? -d On Fri, 23 Jan 2004, Jeremy Hylton wrote:
On Fri, 23 Jan 2004 11:54:38 -0500 Shane Landrum <srl@boston.com> wrote:
Where can I read about writing my own conflict resolution? What risks are there with this approach, and how do I minimize them?
The subject isn't well-documented. I hope we'll improve the documentation for 3.3, and this is one of the issues the bears explaning.
In the interim, here are two possibilities.
Jim's wiki has a page that describes the basic API. It's all correct and has some simple example code. http://www.zope.org/Members/jim/ZODB/ApplicationLevelConflictResolution
I think it doesn't address cross-object consistency issues. If you have two (or more) objects and there is an invariant that spans those two objects, you need to be very careful about maintaining that invariant in the face of conflict resolution. BTrees have to deal with that.
I also wrote a technical note in my weblog about avoiding contention. It discusses several approaches, but it's just a sketch of some ideas. http://www.python.org/~jeremy/weblog/031031c.html
If these don't answer your questions or provoke new questions, feel free to ask. It will help us understand what new documentation is needed.
Jeremy
-- Dennis Allison * Computer Systems Laboratory * Gates 227 * Stanford University * Stanford CA 94305 * (650) 723-9213 * (650) 723-0033 fax * allison@shasta.stanford.edu * allison@sumeru.stanford.edu
On Fri, 23 Jan 2004 09:10:47 -0800 (PST) Dennis Allison <allison@shasta.stanford.edu> wrote:
Jeremy -- when will 3.3 be ready for prime time? -d
I'm not really the right person to ask about the schedule :-). Luckily, Jim Fulton has posted a Zope 2.8 schedule. ZODB 3.3 will be released in conjunction with Zope 2.8. http://dev.zope.org/Wikis/DevSite/Projects/Zope2.8/MilestonePlan The schedule says beta in March. I think that's realistic technically. Jeremy
Jeremy -- I want to follow up on your suggestions -- I put in the changes you suggested and we have been stable since. It appears that the real culprit was the diagnostic itself (although I cannot be sure since the fault was not something we could reproduce). We've been running now for several days with a roughly similar load and no failures. We do see occasional errors which go through the transaction mechanism (MySQL transactions that cannot be backed out) and I suspect that they are the events that triggered the _p_oid bug. I've not put in a bug report to the Collector on this -- do you want me to do so or will you do it. (It is your code (-: ) Many thanks for your help. -d On Thu, 22 Jan 2004, Jeremy Hylton wrote:
I'll be at LinuxWorld tomorrow and mostly aware from my email, but here's a brief response that might offer some clues.
On Thu, 2004-01-22 at 21:45, Dennis Allison wrote:
I have been having problems with Zope locking up (deadlock?) when ZODB conflicts occur. I am running Zope 2.6.2b3 with the 3.1.2b2 version of the ZODB (that is, a vanilla install). Python is 2.1.3.
It's possible that upgrading to a non-beta release will fix, although most of the issues resolved in the 3.1.x maintenance releases were ZEO issues.
The diagnostic is always:
2004-01-22T12:25:15 ERROR(200) ZODB A storage error occured during transaction abort. This shouldn't happen. Traceback (innermost last): File /home/zopesys/lib/python/ZODB/Transaction.py, line 262, in commit File /home/zopesys/lib/python/ZODB/Transaction.py, line 393, in _commit_error AttributeError: Surrogate instance has no attribute '_p_oid'
There is more than one error here, which makes the log message very confusing. In particular, there is an exception that occurs during the second phase of the two-phase commit, a second error that occurs trying to recover from the first error, and a third error trying to log a message about the second error. It's a real mess. It might help to change the way logging is done so that you can see the first two error messages.
In _commit_error(), change the self.log() call in these lines
except: # nothing to do but log the error self.log("Failed to abort object %s" % repr(o._p_oid), error=sys.exc_info())
to say this: self.log("Failed to abort object %s" % repr(o), error=sys.exc_info())
And change one line in _commit
# Ugh, we got an got an error during commit, so we # have to clean up. First save the original exception # in case the cleanup process causes another # exception. error = sys.exc_info() try: self._commit_error(objects, ncommitted, jars, subjars) except: LOG('ZODB', ERROR, "A storage error occured during transaction " "abort. This shouldn't happen.", error=sys.exc_info())
The LOG() call should be passing error=error, not error=sys.exc_info().
I don't think those changes will introduce any new problems and will tell us a lot about what is really going on.
As I understand what's happening (after reading the logs carefully), I have one or more folks trying to use the same object. Zope's ZODB serializes the access and aborts all but the oldest of the conflicting transactions. The diagnostic occurs because a registered object is missing the _p_oid attribute that's used to rollback the transaction. The big question is which of the several sub-transactions is the culprit--or is it a bug as Dieter has suggested, soluable by providing a default value for _p_oid --
That right in a rough sense. ZODB uses optimistic concurrency control, so later transactions are aborted if they conflict with already-committed ones. But the transactions actually run concurrently.
The original error in the second-phase of two-phase commit is part of the problem. Since it's in the second phase, it's not caused by a ZODB conflict but by some more serious failure of one of the transactional resource managers. Unfortunately, we can't tell which one because the buggy logging code has swallowed the error.
Jeremy
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
participants (5)
-
Casey Duncan -
Dennis Allison -
Dennis Allison -
Jeremy Hylton -
Shane Landrum