On our fairly busy site (http://cnx.org) we're seeing in the logs some instances of an error in Shared.DC.ZRDB.DA:: ... Module None, line 97, in search_form_handler - <FSPythonScript at /plone/search_form_handler used for /plone/content> - Line 97 Module Products.RhaptosRepository.Repository, line 537, in searchRepository Module Products.RhaptosRepository.VersionFolder, line 456, in search Module Products.RhaptosModuleStorage.ZSQLFile, line 44, in __call__ Module Shared.DC.ZRDB.DA, line 492, in __call__ - <ExtZSQLMethod at /plone/portal_moduledb/20071212233625.240206723892> Module Shared.DC.ZRDB.DA, line 393, in _cached_result KeyError: ("\nSELECT p.*\nFROM persons p\nwhere\n p.firstname ~* req('href='::text)\n or\n p.surname ~* req('href='::text)\n or\n p.fullname ~* req('href='::text)\n or \n p.personid ~* ('^'||req('href='::text)||'$')\n or\n p.email ~* (req('href='::text)||'.*@')\n\n", 0, 'devrep') This is trying to remove a key from the ZSQL cache to shrink it down to size, but doesn't find the key. From Shared.DC.ZRDB.DA._cached_result: # if the cache is too big, we purge entries from it if len(cache) >= max_cache: keys=tcache.keys() # We also hoover out any stale entries, as we're # already doing cache minimisation. # 'keys' is ordered, so we purge the oldest results # until the cache is small enough and there are no # stale entries in it while keys and (len(keys) >= max_cache or keys[0] < t): key=keys[0] q=tcache[key] del tcache[key] del cache[q] # <===== line 393, with the error del keys[0] It looks a lot like: https://bugs.launchpad.net/zope2/+bug/143805 but we have that fix in our Zope 2.9.8: http://osdir.com/ml/web.zope.all-cvs/2006-11/msg00150.html Perhaps it is another high-load leak? I don't think it can be multiple threads doing cleanup at the same time, unless maybe there's a transaction commit in there somewhere I don't know about. Or maybe I'm running into the problem described in the big comment at the end?:: # When a ZSQL method is handled by one ZPublisher thread twice in # less time than it takes for time.time() to return a different # value, the SQL generated is different, then this code will leak # an entry in 'cache' for each time the ZSQL method generates # different SQL until time.time() returns a different value. # # On Linux, you would need an extremely fast machine under extremely # high load, making this extremely unlikely. On Windows, this is a # little more likely, but still unlikely to be a problem. # # If it does become a problem, the values of the tcache mapping # need to be turned into sets of cache keys rather than a single # cache key. Would it be unsafe to toss a try/except around the del cache[q] bit on the theory that it's already deleted, so, hey, why fail? It'd be really nice to keep this off of users, even with it if does cause a bit of a leak. I'll probably be setting up some logging to try and characterize this further, but anybody have any clues? --jcc -- Connexions http://cnx.org "Building Websites with Plone" http://plonebook.packtpub.com
Hi there, J Cameron Cooper wrote:
but we have that fix in our Zope 2.9.8: http://osdir.com/ml/web.zope.all-cvs/2006-11/msg00150.html
Are you absolutely sure you're using a version of Zope with my patches included?
Perhaps it is another high-load leak? I don't think it can be multiple threads doing cleanup at the same time, unless maybe there's a transaction commit in there somewhere I don't know about.
I'm afraid it's long enough ago that this code was touched that I can't really comment on this bit :-S
Or maybe I'm running into the problem described in the big comment at the end?::
Don't think so, this describes a leak, not a cause of exceptions...
Would it be unsafe to toss a try/except around the del cache[q] bit on the theory that it's already deleted, so, hey, why fail? It'd be really nice to keep this off of users, even with it if does cause a bit of a leak.
Not sure about the try/except, would be better to nail why the key isn't there... Best bet is to start trying to reproduce this in a unit test, which can be pretty hard :-S
I'll probably be setting up some logging to try and characterize this further,
Any results from this? cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk
Chris Withers wrote:
Hi there,
J Cameron Cooper wrote:
but we have that fix in our Zope 2.9.8: http://osdir.com/ml/web.zope.all-cvs/2006-11/msg00150.html
Are you absolutely sure you're using a version of Zope with my patches included?
Yes; I've checked both version (this went in for 2.9.6) and for that specific code. You can even see some of it in the excerpt in the original message, which I pasted straight from our production Zope. I had hoped it was simply missing the fix, too, but no dice. Now, I can't guarantee that the fix in question has remained working since it was checked in, but the release code looks much like the the patch.
Perhaps it is another high-load leak? I don't think it can be multiple threads doing cleanup at the same time, unless maybe there's a transaction commit in there somewhere I don't know about.
I'm afraid it's long enough ago that this code was touched that I can't really comment on this bit :-S
Or maybe I'm running into the problem described in the big comment at the end?::
Don't think so, this describes a leak, not a cause of exceptions...
Would it be unsafe to toss a try/except around the del cache[q] bit on the theory that it's already deleted, so, hey, why fail? It'd be really nice to keep this off of users, even with it if does cause a bit of a leak.
Not sure about the try/except, would be better to nail why the key isn't there...
I agree, of course, but these do end up being user visible, and a workaround, as long as it's not going to cause worse things, would be nice. Something like the original workaround in the bug might be in order for me (rather than a try/except) until the real fix is found. http://launchpadlibrarian.net/9460742/Zope-2.8.6-cached-result.patch I don't imagine it can cause more than cache bloat, and we restart enough that that shouldn't kill us.
Best bet is to start trying to reproduce this in a unit test, which can be pretty hard :-S
Yes, indeed. I did notice that the test you mention in: https://bugs.launchpad.net/zope2/+bug/143805 doesn't seem to exist, at least in my version.
I'll probably be setting up some logging to try and characterize this further,
Any results from this?
I haven't yet inserted anything in production (I get funny looks at the suggestion) but we'll be trying to cause it in a development system soon. --jcc -- Connexions http://cnx.org "Building Websites with Plone" http://plonebook.packtpub.com
J Cameron Cooper wrote at 2008-1-15 10:01 -0600: .... I do not have a fix (or an explanation) for your problem -- but maybe I can suggest a workaround: you might replace "ZSQLMethod" by my "CCSQLMethod". <http://www.dieter.handshake.de/pyprojects/zope> -- Dieter
participants (3)
-
Chris Withers -
Dieter Maurer -
J Cameron Cooper