[Zope-dev] Runaway processes

5 Dec 2007

      Hi everyone,

I have a problem and I am hoping that it has been solved already by someone or 
that I will get some input on at least.I apologize for the lengthy E-mail in 
advance, but I wanted to provide a detailed discussion as a starting point.

Zope is designed to have very short-lived transactions. If transactions are 
long-living all sorts of problems arise, most notably:

1. We occupy one thread for a long time.

2. The chance of conflict errors increases.

Problem 1 can be addressed by increasing the number of allowed threads or to 
simply add more Zope servers. But his has clearly its limits and is really 
just a work-around. Another way to solve the problem is to identify 
long-running operations and calling them asynchronous. Many of us have 
implemented solutions for this, one of which is lovely.remotetask.

Problem number 2 can only be addressed by identifying the long-running tasks 
beforehand and move them into an async call, again via lovely.remotetask for 
example. 

But what happens, if a something unexpected happens and we have an 
unanticipated long-runnning process? The worst case being something runs 
forever. Then whenever this problem occurs, one thread will be locked 
forever, and we can have a total system lockdown in no-time.

So how can this be solved? Effectively, from within Zope we cannot do 
anything, because (a) Zope makes no assumption about running in a thread, and 
(b) the application is stuck and won't have a hook to get unstuck.

So we have to solve the problem from outside. Currently, Zope is commonly run 
from an application thread. At least both WSGI servers that we commonly use, 
twisted and zserver, are implemented this way. This means that by some 
criterion, probably some timeout, the thread should be killed.

But hold on! In Python threads cannot be killed. :-( I have done some research 
and found issue 221115 [1], which discusses the shortcoming of not being able 
to kill a thread. The discussion ended in making a feature request in PEP 42 
[2] which has not been implemented as far as I can tell. So I googled some 
more to find possible implementations. Here are two distinctively different 
solutions (others I have found are either obviously trivial and will not 
work, or are derivatives of these two):

1. A Python-only solution using sys.settrace [3].

  Besides making everything very slow, sys.settrace() is only called when a 
  new byte code instruction is executed. So in case a low-level call hangs up  
  the process, then the trace intercept will never be called.

2. Use an exception to intercept execution on the C-level [4].

  This looked very promising, until I read the following comment on the page:

    The exception will be raised only when executing python bytecode. If your   
    thread calls a native/built-in blocking function, the exception will be  
    raised only when execution returns to the python code.

So my conclusion is that Python threads cannot be unconditionally killed. BTW, 
if a low-level call is blocking, then all Python threads are blocked. From 
the Python `thread` library documentation[5]:

  Not all built-in functions that may block waiting for I/O allow other 
  threads to run. (The most popular ones (time.sleep(), file.read(), 
  select.select()) work as expected.)

In all fairness, though, those are very rare occurrences. Most libraries are 
non-blocking and the above solutions would be just fine. 

But in my case, I really need to find a way to kill a Zope execution 
environment when a C call hangs. So what other choices do we have?

On Unix-like systems, we can use `os.fork()`. The advantage of this approach 
is that I can use OS system calls to kill the process. However, ZODB database 
storages cannot be shared between processes. Nikolay Kim has done some 
preliminary experiments and found that `db.open()` locks the system (for 
both, `FileStorage` and `ZeoClientStorage`). I have not verified these 
results or tried to figure out why it is hanging, but I can see the problem 
for `FileStorage`.

Are there any known side-effects on what happens, if I fork after the 
connection has been made? Since I am using the original process merely as a 
control, I guess I should be fine. Of course, the interesting question is: 
what happens to the ZODB connection, not to mention to the DB, if it is in 
the middle of writing? I guess the safest solution would be to fork within 
the constraint of the transaction. Any comments will be very much 
appreciated.

Once we decide on the forking approach, we have to solve the issue for Windows 
of course too. My googling did not turn out immediately successful, but I 
think if we use Windows' native threads they will provide us with the 
necessary API, since I can exit it at any time.

.. [1]: http://bugs.python.org/issue221115
.. [2]: http://www.python.org/dev/peps/pep-0042/
.. [3]: 
http://www.velocityreviews.com/forums/t330554-kill-a-thread-in-python.html
.. [4]: http://sebulba.wikispaces.com/recipe+thread2
.. [5]: http://docs.python.org/lib/module-thread.html

Regards,
Stephan
-- 
Stephan Richter
CBU Physics & Chemistry (B.S.) / Tufts Physics (Ph.D. student)
Web2k - Web Software Design, Development and Training