[ZODB-Dev] Re: BTrees strangeness (was [Zope-dev] Zope 2.X BIG
Session problems - blocker - our site dies - need help of experience
Zope developer, please)
Chris McDonough
chrism at plope.com
Tue May 11 00:54:13 EDT 2004
Whew. I finally got a chance to review at least a bit of Michael's
excellent work...
On Mon, 2004-04-26 at 18:34, Michael Dunstan wrote:
> Okay - I've gone ahead and implemented a flavour of transaction
> isolation for errors. (See below for patch.) That seems to have done
> the trick just nicely. I have not seen any more KeyError's thrown from
> Transience.py. Yay!
So what do folks think of this error-case transaction isolation patch?
Michael's original patch is preserved here (although it didn't survive
cut and paste from my mail client in a pristine way, you'll get the
idea):
http://www.plope.com/Members/chrism/transaction_isolation_error.patch/file_view
Personally it looks fine to me. It wraps each error invocation in its
own transaction.
We need to do something like this to fix problems exposed by session
writes (or any other database writes) that occur during an error. I
know there were some opinions about this before as to whether the error
should occur as part of the preceding transaction or whether the error
should be wrapped in its own transaction, but I've forgotten the pros
and cons (I've even forgotten which side I was on at that point ;-).
Does anyone have a strong opinion? I think I remember that there were
issues here with ZPatterns, mail-on-error, and some other things.
> However there is another KeyError reported in 848 too. This one is
> thrown from TempoaryStoarge. After some further tweaking of my test rig
> I was able to produce such KeyErrors quite frequently. Using Zope 2.7.0
> and recent the rewrite of Transience.py:
>
> http://cvs.zope.org/*checkout*/Products/Transience/Transience.py?
> rev=1.32.12.2.2.2&only_with_tag=chrism-sessiongeddon
>
> errors look like:
<snip 90% of traceback>
> TemporaryStorage.py", line 94, in load
> s=self._index[oid]
> KeyError: '\x00\x00\x00\x00\x00\x00\x08\x82'
>
> So far it looks like commenting out the construction of the to_notify
> list in _gc makes this go away. So _gc looks like:
>
> def _gc(self, now=None):
> if self._timeout_slices == 0:
> return # dont do gc if there is no timeout
>
> if now is None:
> now = getCurrentTimeslice() # for unit tests
> max_ts = now - (PERIOD * (self._timeout_slices + 1))
> keys = self._data.keys(None, max_ts)
>
> #to_notify = []
>
> for key in list(self._data.keys(None, max_ts)):
> assert(key <= max_ts)
> STRICT and _assert(self._data.has_key(key))
> #for v in self._data[key].values():
> # to_notify.append(v)
> del self._data[key]
>
> #for v in to_notify:
> # self.notifyDel(v)
>
> Note that the test rig does not set onAdd or onDelete.
I was unable to verify/review this, but I believe you! However, I think
the above might be fixing the symptom rather than the underlying
problem. I was able to provoke these errors out of temporary storage in
a completely different way.
There is a "session stress test" in the Sessions product
(stresstests/stressTestMultipleThreads) that kicks off multiple threads
emulating simultaneous reads and writes to a temporary storage that
exposes some key errors out of Transience as well.
I took a look at this and I have a sneaky suspicion that the "conflict
cache" in TemporaryStorage may be causing problems. TemporaryStorage
attempts to emulate a revisioning/undoing storage (which it isn't),
where older revisions of objects are kept around in order to play with
the conflict resolution strategy in ConflictResolvingStorage. I think
TemporaryStorage fails to clear this cache at some appropriate times
(particularly when an object is gc'ed in the storage by the
_takeOutGarbage method... TemporaryStorage does its own form of in-band
"packing" via reference counts).
In the amount of time I could devote to this, I was unable to prove or
disprove this objectively, but looking at the code made me believe it
was possible.
> Along the way there where a few things that I was not too sure about so
> I tried fiddling with to see what would happen:
>
> (1) Use of hasattr in Transience.py. hasattr should be avoided? See
> http://mail.zope.org/pipermail/zodb-dev/2003-October/005968.html. I did
> change usage of hasattr to match the recipe given by Tim. And added
> some logging to explicitly look for ConflictError's at those points.
> Found a few of these in the logs. But that did not solve the KeyErrors.
Good point, the hasattrs should all be changed to getattrs.
> (2) _p_resolveConflict of Increaser of Transience does not always
> return the max of the three given states. I replaced the implementation
> with return max(old, state1, state2). But that did not solve the
> KeyErrors.
Whoops. ;-) Thanks!
>
> (3) Use of try/except in TempoaryStorage. For example:
>
> try: del referenceCount[oid]
> except: pass
>
> I removed all such try/except wrappings of del. But that did not
> solve the KeyErrors. Nor did it raise any new errors in the test rig.
>
> (4) Checking consistency of btrees. I added the following code to _gc
> just before the assignment of keys:
>
> from BTrees.check import check, display
> try:
> self._data._check()
> check(self._data)
> except AssertionError:
> LOG('Transience',ERROR,'yikes')
> display(self._data)
> raise
>
> I never saw any noise from this. Looks like self._data is healthy.
>
> Just to be explicit... I removed all the above hacks leaving just the
> one that comments out the construction of the to_notify list. Reran the
> test rig. And have yet to see any KeyErrors.
Have you seen any since the time you reported this?
Thanks for your patience and work!
- C
More information about the Zope-Dev
mailing list