[Zope-dev] Segfault and Deadlock

Sun May 2 11:10:02 EDT 2004

Hi Zope (and Python) experts!

There seems to be a problem when an external python module segfaults
during a zope request. The remaining worker threads are deadlocked.

I think this is the same problem as Dieter pointed out in his message
to zope-dev "[Problem] strange state after SIGSEGV":

  http://mail.zope.org/pipermail/zope-dev/2004-March/022092.html

The reason is the way python handles threads on some systems
(RedHat-7.3, kernel 2.4.20, without NPTL). I've written a small python
extension, which does nothing but segfault[1]. With this, i made the
following simulation, where one thread acquires a lock and segfaults:

  #!/usr/bin/env python2.3

  import thread
  import time
  import _segfault

  _lock = thread.allocate_lock()

  def worker():
      time.sleep(10)
      _lock.acquire()
      _segfault.segfault()
      _lock.release()

  thread.start_new_thread(worker, ())
  thread.start_new_thread(worker, ())
  thread.start_new_thread(worker, ())
  thread.start_new_thread(worker, ())

  time.sleep(3600)

  print 'Bye...'

On my RedHat-7.3 box (kernel 2.4.20-18, without NPTL) i get the
following behaviour. After starting the program, pstree shows this:

  bash(4103,wlang)---python2.3(4333)---python2.3(4334)-+-python2.3(4335)
                                                       |-python2.3(4336)
                                                       |-python2.3(4337)
                                                       `-python2.3(4338)

After the 10 seconds sleep, one worker gets the lock, and
segfaults. After that, pstree shows this:

  init(1)-+-[...]
          |-python2.3(4336,wlang)
          |-python2.3(4337,wlang)
          |-python2.3(4338,wlang)

Three remaining worker threads (without main thread).

Gdb shows, that they wait for the lock (but they wont get it):

  (gdb) info stack
  #0  0x420293d5 in sigsuspend () from /lib/i686/libc.so.6
  #1  0x40031609 in __pthread_wait_for_restart_signal ()
     from /lib/i686/libpthread.so.0
  #2  0x4003272c in sem_wait@@GLIBC_2.1 () from /lib/i686/libpthread.so.0
  #3  0x080c7b2d in PyThread_acquire_lock (lock=0x8170728, waitflag=1)
                    ^^^^^^^^^^^^^^^^^^^^^
      at Python/thread_pthread.h:406
  [...]

(On a side note, as python threads block all signals, these worker
threads cannot be stopped with SIGTERM. They must be killed with SIGKILL.)

All this has the consequences Dieter described:
>   Consequences:
> 
>     *  Zope did no longer respond to requests
> 
>     *  "stop" did not work (as "SIGTERM" was ineffective)
> 
>     *  "start" did not work, as the dangling processes kept
>        the HTTP port bound.

So i think i know what's happening, but i don't know how to fix it!
Can anyone help please? Any hints are highly appreciated!

\wlang{}

PS: A RedHat-9 system (kernel 2.4.20, with NPTL) shows a different
behaviour. After the segfault, all threads disappeared. So maybe
all is ok with NPTL, but i've not tested it yet...

[1] segfault module

-segfault.c---------------

void
segfault(void)
{
  char *x = 0;

  *x = 'a';
}

-segfault.i----------------

%module segfault
%{
%}

void segfault(void);

-building:------------------

$ swig -python segfault.i
$ gcc -I/usr/local/include/python2.3 -c segfault_wrap.c -o segfault_wrap23.o
$ gcc -c -o segfault.o segfault.c
$ gcc -shared segfault_wrap23.o segfault.o -o _segfault.so

-- 
Willi.Langenberger at wu-wien.ac.at                Fax: +43/1/31336/9207
Zentrum fuer Informatikdienste, Wirtschaftsuniversitaet Wien, Austria