[Zope-dev] Dangerous shutdown procedure behaviour

9 Jul 2008

      Hi.

I think I discovered a dangerous code behaviour at zope shutdown.

I've had a strange problem on a site where persistent objects are created from 
data inserted in an SQL table. Upon object creation, SQL table is updated to 
mark the line as imported.
Such import got triggered just before a shutdown. After restart and another 
import, documents were created twice.

What I believe happened (but I could not find any hard evidence of it) is that 
Zope blindly exited while the working thread was runing, and in the worst 
possible method: tpc_finish.
ZODB was already commited, but mysql was not. So mysql did a rollback on 
changes, and the lines were in a "ready to import" state. And imported again 
at next import attemp.

Reading shutdown code, I discovered 2 distinct timeout mechanism (note: having 
just one is enough to trigger the problem):
 - Lifetime.py: iterating through asyncore sockets, it alerts
   servers that it will shut down soon. If they take the veto for too long,
   the veto is ignored and shutdown continues.
   Default timeout is 20 seconds, meaning there is at most one minute from the
   first shutdown notice to the effective process exit (taking all runing
   threads down).
   When invoking "zopectl stop", it's runing a "fast" shutdown, which means
   the timeout is shortened to 1 second, so total maximum sutdown time is 3
   seconds.
   This timeout can be worked around by just writing blocking shutdown
   methods and not using the veto system.
 - zdaemon/zdrun.py: if the instance being shut down still
   responds after 10 seconds, it will be sent a SIGKILL.
   This cannot be worked around without changing code in zdrun.py or not
   executing it at all (no idea if there is any alternative).

I could easily reproduce the problem by writing a simple connection mamager 
which calls time.wait(3600) in _finish method and defining a sortKey method 
to make it commit after another connection manager.

I could not find a trace of any mechanism preventing commit from happening 
when a shutdown is in progress, and I don't think there should be any: 
considering that some storages might be accessed through a network, latency 
can become a problem, so tpc_finish can take time to complete, so just 
checking that there is no pending shutdown before entering this function 
would not solve the problem.

I suggest removing all those timeouts. If a user wants a Zope to shutdown for 
a reason serious enough to send it a SIGKILL or causing immediate python 
thread termination, it's his responsibility.
But I think regular shutdown mechanism must not do that.

Also, the same problem can happen with "zopectl fg" since Zope does not go 
through any shutdown sequence as far as I can tell (it just dies).

-- 
Vincent Pelletier