[Zope-dev] Dangerous shutdown procedure behaviour
Vincent Pelletier
vincent at nexedi.com
Wed Jul 9 09:00:32 EDT 2008
Hi.
I think I discovered a dangerous code behaviour at zope shutdown.
I've had a strange problem on a site where persistent objects are created from
data inserted in an SQL table. Upon object creation, SQL table is updated to
mark the line as imported.
Such import got triggered just before a shutdown. After restart and another
import, documents were created twice.
What I believe happened (but I could not find any hard evidence of it) is that
Zope blindly exited while the working thread was runing, and in the worst
possible method: tpc_finish.
ZODB was already commited, but mysql was not. So mysql did a rollback on
changes, and the lines were in a "ready to import" state. And imported again
at next import attemp.
Reading shutdown code, I discovered 2 distinct timeout mechanism (note: having
just one is enough to trigger the problem):
- Lifetime.py: iterating through asyncore sockets, it alerts
servers that it will shut down soon. If they take the veto for too long,
the veto is ignored and shutdown continues.
Default timeout is 20 seconds, meaning there is at most one minute from the
first shutdown notice to the effective process exit (taking all runing
threads down).
When invoking "zopectl stop", it's runing a "fast" shutdown, which means
the timeout is shortened to 1 second, so total maximum sutdown time is 3
seconds.
This timeout can be worked around by just writing blocking shutdown
methods and not using the veto system.
- zdaemon/zdrun.py: if the instance being shut down still
responds after 10 seconds, it will be sent a SIGKILL.
This cannot be worked around without changing code in zdrun.py or not
executing it at all (no idea if there is any alternative).
I could easily reproduce the problem by writing a simple connection mamager
which calls time.wait(3600) in _finish method and defining a sortKey method
to make it commit after another connection manager.
I could not find a trace of any mechanism preventing commit from happening
when a shutdown is in progress, and I don't think there should be any:
considering that some storages might be accessed through a network, latency
can become a problem, so tpc_finish can take time to complete, so just
checking that there is no pending shutdown before entering this function
would not solve the problem.
I suggest removing all those timeouts. If a user wants a Zope to shutdown for
a reason serious enough to send it a SIGKILL or causing immediate python
thread termination, it's his responsibility.
But I think regular shutdown mechanism must not do that.
Also, the same problem can happen with "zopectl fg" since Zope does not go
through any shutdown sequence as far as I can tell (it just dies).
--
Vincent Pelletier
More information about the Zope-Dev
mailing list