[Zope-dev] Re: [Zope-DB] KeyError from Shared.DC.DA.DA

Mon Sep 25 14:04:24 EDT 2006

This probably merits the Zope Dev list too, since this is part of "core 
zope"...

cheers,

Chris

FAIRS, Dan, GBM wrote:
> Hi,
> 
> Under load, we're seeing KeyErrors from the _cached_result method of the DA
> class in the code used to clear out cached results. We're using Python
> 2.3.5, Zope 2.8.5, mxODBCZopeDA-1.0.8 running on RHEL3 connecting to a
> remote Microsoft SQL Server 2000 database.
> 
> Note this is a different issue to Collector #1745.
> 
> Here's the salient part of the traceback:
> 
>   File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
> 118, in _getDBValue
>     values = self._getDBValues(attributes=[attribute])
>   File "/opt/intranet/cbfmnet/zope_home/Products/Membership/Member.py", line
> 96, in _getDBValues
>     , userid=login_elements[1]
>   File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
> 453, in __call__
>     result=self._cached_result(DB__, (query, self.max_rows_))
>   File "/opt/intranet/cbfmnet/zope/lib/python/Shared/DC/ZRDB/DA.py", line
> 375, in _cached_result
>     if int(cache[q][0]) == key:
> KeyError: ("execute dbo.EmployeeByNT 'DOMAIN', 'user'\n", 1000, '\nDBConnId:
> None')
> 
> The ZSQL Method EmployeeByNT looks like this (it's actually an FSZSQL method
> in a skin):
> 
> <dtml-comment>
> connection_id: DB_Connection
> arguments:domain userid
> cache_time: 60
> max_cache: 50000
> </dtml-comment>
> execute dbo.EmployeeByNT <dtml-sqlvar domain type="string">, <dtml-sqlvar
> userid type="string">
> 
> We do have other ZSQL methods, but this one is by far the most heavily used.
> 
> 
> Here's (most of) the offending method, _cached_result (I've snipped some
> code which just appends the db connection id to the query. Note also that
> Bucket is also defined as lambda: {} at the top of DA.py):
> 
>         # Try to fetch from cache
>         if hasattr(self,'_v_cache'): cache=self._v_cache
>         else: cache=self._v_cache={}, Bucket()
>         cache, tcache = cache
>         max_cache=self.max_cache_
>         now=time()
>         t=now-self.cache_time_
>         if len(cache) > max_cache / 2:
>             keys=tcache.keys()
>             keys.reverse()
>             while keys and (len(keys) > max_cache or keys[-1] < t):
>                 key=keys[-1]
>                 q=tcache[key]
>                 del tcache[key]
>                 if int(cache[q][0]) == key:
>                     del cache[q]
>                 del keys[-1]
> 
>         if cache.has_key(query):
>             k, r = cache[query]
>             if k > t: return r
> 
>         # call the pure query
>         result=apply(DB__.query, pure_query)
>         if self.cache_time_ > 0:
>             tcache[int(now)]=query
>             cache[query]= now, result
> 
>         return result
> 
> In this code, 'tcache' is a dict which maps integers representing the time
> to queries (the result of calling int(t) where t is the floating-point
> result from time.time()) , and 'cache' maps queries to a 2-tuple consisting
> of the floating-point timestamp of when the method was called, and the
> result of the query.
> 
> I've been trying to figure out a circumstance which could cause the KeyError
> we're seeing in production, without success. 
> 
> Here are some observations which may give some clues:
> 
> The result of time.time() is converted to an integer and used as a key into
> tcache. AFAICS, this means that only 1 entry per second will be stored in
> tcache, even though every query will be stored in cache. It also looks like
> cache could be the source of a memory leak.
> 
> For example:
> - 'SELECT a FROM b' and 'SELECT c FROM b' run in the same 1-second window 
> - 'SELECT c FROM b' will end up stored in tcache. 
> - Both queries end up in cache. 
> - 'SELECT a FROM b' will only ever be removed from cache if 'SELECT a FROM
> b' is executed once more later that second - and then 'SELECT c FROM b' will
> remain.
> 
> We only see the KeyError after 'some time' in the application. Presumably
> this is the time it takes for the caches to fill and start actively removing
> entries as per the while loop. (We restarted the cluster last night, and
> have yet to see the errors return).
> 
> The SQL query we are running maps an NT credential (domain and userid) to an
> internal user id from our corporate database. To give an idea of the usage
> pattern of this ZSQL method, it gets run at least once per request, with
> application serving approximately 150,000 - 180,000 requests per day, with
> 12,000 - 14,000 distinct users (and hence different arguments to the ZSQL
> method).
> 
> I'm going to put some logging code around the line that causes the KeyError
> next time we do a release. However - has anyone seen this before? Can anyone
> see what might be going wrong here?
> 
> Cheers,
> Dan

-- 
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk