Hello,
we have spent most of the day tracking down obscure hangs of Zope (2.6.4rc1) under python2.1.3 on a RHEL3 machine.
The problem seems to be a logic flaw somewhere related to the cPickleCache, when using a destructor in a Zope object that accesses itself.
In our case(shortened to the offending line):
def __del__(self): print "About to destroy: ", self.id
What seems to happen is that the "self.id" access causes the object to be cached again, causing scan_gc_items() to run in circles.
Any ideas on how to best fix this?
Mario
Defining __del__ on a persistent object has unknown effects, FWIW. A persistent object's __del__ method may be called many times during its lifetime. See http://zope.org/Wikis/ZODB/FrontPage/guide/node3.html#SECTION000360000000000... for more info.
- C
On Fri, 2004-01-23 at 09:55, Mario Lorenz wrote:
Hello,
we have spent most of the day tracking down obscure hangs of Zope (2.6.4rc1) under python2.1.3 on a RHEL3 machine.
The problem seems to be a logic flaw somewhere related to the cPickleCache, when using a destructor in a Zope object that accesses itself.
In our case(shortened to the offending line):
def __del__(self): print "About to destroy: ", self.id
What seems to happen is that the "self.id" access causes the object to be cached again, causing scan_gc_items() to run in circles.
Any ideas on how to best fix this?
Mario
On Fri, 23 Jan 2004 10:31:59 -0500 Chris McDonough chrism@plope.com wrote:
Defining __del__ on a persistent object has unknown effects, FWIW. A persistent object's __del__ method may be called many times during its lifetime. See
http://zope.org/Wikis/ZODB/FrontPage/guide/node3.html#SECTION000360000000000...
for more info.
In fact, the original poster points out a very real problem with __del__. If the __del__ function access any attribute of the object, it will put the object back in the cache. I hadn't thought of that before, but it's obvious now that it would work this way.
The only useful answer is to avoid using __del__ on persistent objects. If there are resources that really need to be finalized whenever the object is ghosted, you can put them in a non-persistent sub-object that does have an __del__.
Jeremy
Jeremy Hylton wrote at 2004-1-23 12:12 -0500:
... The only useful answer is to avoid using __del__ on persistent objects. If there are resources that really need to be finalized whenever the object is ghosted, you can put them in a non-persistent sub-object that does have an __del__.
A minor note (not essential for the current discussion):
"__del__" is not called when the object is ghostified but when the last (non cache) reference to it is deleted (i.e. when the object is flushed from memory).
By the way, there seem to be some objects around which trigger this "infinite loop behaviour".
I noticed that a "minimize" in our production environment often leads to a no longer responding Zope instance. I think now that this may be caused by the above problem. We use mainly stock Zope and CMF objects, Postgres and MySQL database adapters and "LocalFS".
[Mario Lorenz]
we have spent most of the day tracking down obscure hangs of Zope (2.6.4rc1) under python2.1.3 on a RHEL3 machine.
Based on what you say next, it sure sounds like this isn't unique to 2.6.4rc1. Did the same code "work" under some previous release? The infinite loop appears to be an inherent property of this iteration of the cPickleCache design, and that's not new in 2.6.4rc1.
The problem seems to be a logic flaw somewhere related to the cPickleCache, when using a destructor in a Zope object that accesses itself.
In our case(shortened to the offending line):
def __del__(self): print "About to destroy: ", self.id
What seems to happen is that the "self.id" access causes the object to be cached again, causing scan_gc_items() to run in circles.
Based on eyeballing the C code, "the ring" is a list of objects in cache, ordered from least recently used to most recently used. scan_gc_items traverses this list once, from LRU to MRU, ghosting ghostifiable objects until the target number of non-ghost objects remaining is reached, or the entire list has been traversed.
It looks like ghostifying your "self" triggers self.__del__(). Then the __del__ method unghostifies self, which has the side effect of moving self to the MRU end of the ring, which in turn means the list traversal will visit self *again*. When it does, same thing happens all over again, ad infinitum.
Any ideas on how to best fix this?
As the docs Chris pointed you at say, persistent objects shouldn't have __del__ methods. If the by-eyeball analysis above is correct, if a persistent object does have a __del__ method referencing an attribute of self, an infinite loop in scan_gc_items() is inevitable.
So I only see 3 workarounds short of rewriting the C code:
1. Lose the __del__ method (recommended).
2. If you need a __del__ method (it's hard to imagine why, since it will get called whenever the object is ghostified, and has nothing to do with the object's actual lifetime), don't reference any persistent objects (and esp. not self) within it.
3. Recompile with MUCH_RING_CHECKING defined. Then scan_gc_items will give up after max(cache_size * 10, 10000) iterations instead of running forever.
On Fri, Jan 23, 2004 at 12:08:27PM -0500, Tim Peters wrote:
def __del__(self): print "About to destroy: ", self.id
I don't know what your intention is there, but fwiw, if what you're *really* interested in is the object being marked for deletion in the ZODB, you can use:
def manage_beforeDelete(self): print "About to destroy:", self.id
On Friday 23 January 2004 17:08, Tim Peters wrote:
It looks like ghostifying your "self" triggers self.__del__(). Then the __del__ method unghostifies self, which has the side effect of moving self to the MRU end of the ring, which in turn means the list traversal will visit self *again*. When it does, same thing happens all over again, ad infinitum.
Not necessaralily ad infinitum. It will only run forever if the number of __del__-resurrecting objects in the cache is larger than the cache target size. Does that fit with your scenario?
- If you need a __del__ method (it's hard to imagine why, since it will get called whenever the object is ghostified, and has nothing to do with the object's actual lifetime), don't reference any persistent objects (and esp. not self) within it.
or 2b as jeremy suggested, put your __del__ on a non-persistent sub object.
[Tim Peters]
It looks like ghostifying your "self" triggers self.__del__(). Then the __del__ method unghostifies self, which has the side effect of moving self to the MRU end of the ring, which in turn means the list traversal will visit self *again*. When it does, same thing happens all over again, ad infinitum.
[Toby Dickenson]
Not necessaralily ad infinitum. It will only run forever if the number of __del__-resurrecting objects in the cache is larger than the cache target size.
It's actually that the number of __del__-resurrecting objects *plus* the number of non-ghostifiable objects in cache is larger than the cache target size, right?
Does that fit with your scenario?
Similarly <wink>, except that if there's a large number of non-ghostifiable objects (more than the cache target size), then only one __del__-resurrected object is enough to provoke an infinite loop.
- If you need a __del__ method (it's hard to imagine why, since it will get called whenever the object is ghostified, and has nothing to do with the object's actual lifetime), don't reference any persistent objects (and esp. not self) within it.
or 2b as jeremy suggested, put your __del__ on a non-persistent sub object.
Yup.
On Monday 26 January 2004 17:22, Tim Peters wrote:
It's actually that the number of __del__-resurrecting objects *plus* the number of non-ghostifiable objects in cache is larger than the cache target size, right?
Yes, Right. That is more achievable than I thought.
Am 26. Jan 2004, um 12:22:43 schrieb Tim Peters:
It's actually that the number of __del__-resurrecting objects *plus* the number of non-ghostifiable objects in cache is larger than the cache target size, right?
Yes, but when you push the minimize button, the cache target size is 0.
Nonwithstanding our code (it was mostly debugging/tracing functionality anyway, so it was easy to fix), it is just that with Zope 2.5, it seemed to work (at least I never got a bug report back then) so it took us a while to track this down. Especially since we had just moved to RHEL3, so we were expecting more like a threading issue...
Given that this property is not that widely published (in the various tutorials etc.), I wonder if it might be a good idea to improve that loop check code, and walk through the ring not more than once, using a counter, and not the comparison vs. the ring start, which, as we have seen, may be a target that moves too quickly.
Similarly <wink>, except that if there's a large number of non-ghostifiable objects (more than the cache target size), then only one __del__-resurrected object is enough to provoke an infinite loop.
Yes, this is exactly what happened here.
Mario
It's actually that the number of __del__-resurrecting objects *plus* the number of non-ghostifiable objects in cache is larger than the cache target size, right?
[Mario Lorenz]
Yes, but when you push the minimize button, the cache target size is 0.
That would do it <wink>.
Nonwithstanding our code (it was mostly debugging/tracing functionality anyway, so it was easy to fix),
I'm glad to hear that!
it is just that with Zope 2.5, it seemed to work (at least I never got a bug report back then) so it took us a while to track this down.
I expect that in your Zope 2.5 installation, the cache was implemented in a very different way. The current implementation is universally regarded as a major improvement. Maybe Toby remembers which release(s) of ZODB the current cache implementation first appeared in; it's not brand new, but is relatively recent.
Especially since we had just moved to RHEL3, so we were expecting more like a threading issue...
Be patient <wink>.
Given that this property is not that widely published (in the various tutorials etc.),
I believe the idea that a persistent object with a __del__ method could provoke an infinite loop in cPickleCache was news to all of us, and yours was the first report.
I wonder if it might be a good idea to improve that loop check code, and walk through the ring not more than once, using a counter, and not the comparison vs. the ring start, which, as we have seen, may be a target that moves too quickly.
We need to open Collector issues for ZODB bugs and patches, because they have a history of getting lost sometimes when confined to the mailing list. I opened an issue for this one here:
On Tuesday 27 January 2004 09:39, Mario Lorenz wrote:
Given that this property is not that widely published (in the various tutorials etc.), I wonder if it might be a good idea to improve that loop check code, and walk through the ring not more than once, using a counter, and not the comparison vs. the ring start, which, as we have seen, may be a target that moves too quickly.
You got me curious....
The attached patch to 2.6 uses an extra temporary ring node to mark where scanning should stop. Normally this will be right next to the home node, and behaviour is unchanged. Any of these troublesome objects that move themselves to the top of the LRU list *during* the scan will end up *after* the scan terminator node, and therefore do not get rescanned.
This works for me on 2.6, which is the only branch Im set up to use at the moment.