Re: [Zope-DB] KeyError from Shared.DC.DA.DA
This probably merits the Zope Dev list too, since this is part of "core zope"... cheers, Chris FAIRS, Dan, GBM wrote:
Hi,
Under load, we're seeing KeyErrors from the _cached_result method of the DA class in the code used to clear out cached results. We're using Python 2.3.5, Zope 2.8.5, mxODBCZopeDA-1.0.8 running on RHEL3 connecting to a remote Microsoft SQL Server 2000 database.
Note this is a different issue to Collector #1745.
Here's the salient part of the traceback:
File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line 118, in _getDBValue values = self._getDBValues(attributes=[attribute]) File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line 96, in _getDBValues , userid=login_elements[1] File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line 453, in __call__ result=self._cached_result(DB__, (query, self.max_rows_)) File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line 375, in _cached_result if int(cache[q][0]) == key: KeyError: ("execute dbo.EmployeeByNT 'DOMAIN', 'user'\n", 1000, '\nDBConnId: None')
The ZSQL Method EmployeeByNT looks like this (it's actually an FSZSQL method in a skin):
<dtml-comment> connection_id: DB_Connection arguments:domain userid cache_time: 60 max_cache: 50000 </dtml-comment> execute dbo.EmployeeByNT <dtml-sqlvar domain type="string">, <dtml-sqlvar userid type="string">
We do have other ZSQL methods, but this one is by far the most heavily used.
Here's (most of) the offending method, _cached_result (I've snipped some code which just appends the db connection id to the query. Note also that Bucket is also defined as lambda: {} at the top of DA.py):
# Try to fetch from cache if hasattr(self,'_v_cache'): cache=self._v_cache else: cache=self._v_cache={}, Bucket() cache, tcache = cache max_cache=self.max_cache_ now=time() t=now-self.cache_time_ if len(cache) > max_cache / 2: keys=tcache.keys() keys.reverse() while keys and (len(keys) > max_cache or keys[-1] < t): key=keys[-1] q=tcache[key] del tcache[key] if int(cache[q][0]) == key: del cache[q] del keys[-1]
if cache.has_key(query): k, r = cache[query] if k > t: return r
# call the pure query result=apply(DB__.query, pure_query) if self.cache_time_ > 0: tcache[int(now)]=query cache[query]= now, result
return result
In this code, 'tcache' is a dict which maps integers representing the time to queries (the result of calling int(t) where t is the floating-point result from time.time()) , and 'cache' maps queries to a 2-tuple consisting of the floating-point timestamp of when the method was called, and the result of the query.
I've been trying to figure out a circumstance which could cause the KeyError we're seeing in production, without success.
Here are some observations which may give some clues:
The result of time.time() is converted to an integer and used as a key into tcache. AFAICS, this means that only 1 entry per second will be stored in tcache, even though every query will be stored in cache. It also looks like cache could be the source of a memory leak.
For example: - 'SELECT a FROM b' and 'SELECT c FROM b' run in the same 1-second window - 'SELECT c FROM b' will end up stored in tcache. - Both queries end up in cache. - 'SELECT a FROM b' will only ever be removed from cache if 'SELECT a FROM b' is executed once more later that second - and then 'SELECT c FROM b' will remain.
We only see the KeyError after 'some time' in the application. Presumably this is the time it takes for the caches to fill and start actively removing entries as per the while loop. (We restarted the cluster last night, and have yet to see the errors return).
The SQL query we are running maps an NT credential (domain and userid) to an internal user id from our corporate database. To give an idea of the usage pattern of this ZSQL method, it gets run at least once per request, with application serving approximately 150,000 - 180,000 requests per day, with 12,000 - 14,000 distinct users (and hence different arguments to the ZSQL method).
I'm going to put some logging code around the line that causes the KeyError next time we do a release. However - has anyone seen this before? Can anyone see what might be going wrong here?
Cheers, Dan
-- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk
participants (1)
-
Chris Withers