[ZODB-Dev] ZEO communication deadlock?

Tue May 11 10:39:52 EDT 2004

We see occationally the following behaviour (Zope 2.7 + ZEO + psycopg):

   Zope suddenly stops request processing (usually during high load
   situations) and hangs apparently.

   An external lifeness checker detects the irresponsiveness
   and sends a SIGHUP to the Zope process. Zope does not react
   to this SIGHUP (we should see a log message when
   "Signals.Signals.restartHandler" is activated).

I interprete the "does not react to SIGHUP" that Zope is not running
any Python code as signals are only noticed at Python bytecode boundaries.

In principle this could have been caused by a C extension with
a blocking function that does not release the GIL. Indeed,
older versions of "psycopg" had such a behaviour in its "connect".
However, we log all interactions with Postgres and should see
blocking here.

Remains the possibility of a deadlock.
The asyncore thread would need to be included in this deadlock
as it executes Python code whenever a new request arrives (and
anyway after 30 s have elapsed). Medusa itself tries hard
to decouple from the worker thread such that a deadlock in worker
threads should not affect medusa itself.

Remains the ZEO communication. When I am right, it does not decouple
communication from work and calls methods directly from the "asyncore"
thread. Some calls acquire locks, e.g. calls to process invalidations.
Should it be possible that we get a deadlock in this way?

Maybe, a ZEO communication expert can say something about this...

-- 
Dieter