[Python 2.3.4c1] nasty LinuxThread problem not solved
Hello Tim, I just checked that Python 2.3.4c1 does not yet fix our LinuxThread-Crash problem -- the problem that lets a multi-threaded application enter a curious state when one on the threads crashes. And this despite the patch for this problem in Python's collector... Unfortunately, I am no "python-dev" subscriber. Can you check please whether this patch has a chance to become part of the official Python 2.3.4. If not, we would be forced to maintain our own Python version as Python's current crashing behaviour with LinuxThreads is unacceptable. -- Dieter
[Dieter Maurer]
I just checked that Python 2.3.4c1 does not yet fix our LinuxThread-Crash problem -- the problem that lets a multi-threaded application enter a curious state when one on the threads crashes.
I'm sorry to report that 2.3.4 final won't fix it either. The active Python bug report is here: [ 756924 ] SIGSEGV causes hung threads (Linux) http://www.python.org/sf/756924
And this despite the patch for this problem in Python's collector...
There's a patch that squashes the specific symptom you have in mind, but at the cost of other breakage -- the original patch was added for a reason too. Read the bug report. It's complicated, and too complicated to slam in any change here for 2.3.4 (according to Python's release manager, and according to Guido).
Unfortunately, I am no "python-dev" subscriber. Can you check please whether this patch has a chance to become part of the official Python 2.3.4.
Doesn't look like it.
If not, we would be forced to maintain our own Python version as Python's current crashing behaviour with LinuxThreads is unacceptable.
Please add your concerns to the bug report. There are still open issues in coming up with a correct patch, but if those are resolved and "enough" people care, it should be enough to justify a 2.3.5 release.
Tim Peters wrote at 2004-5-21 10:16 -0400:
[Dieter Maurer]
I just checked that Python 2.3.4c1 does not yet fix our LinuxThread-Crash problem -- the problem that lets a multi-threaded application enter a curious state when one on the threads crashes. ... And this despite the patch for this problem in Python's collector...
There's a patch that squashes the specific symptom you have in mind, but at the cost of other breakage -- the original patch was added for a reason too.
I verified that <http://sourceforge.net/tracker/index.php?func=detail&aid=949332&group_id=5470&atid=305470> indeed fixes the problem. It might introduce other subtle problems but at least none that are revealed by Python's regression test suite... Moreover, I doubt that such problems will be significant in practise: The patch prevents blocking of signals that should (as specified by the PThreads standard) not be blocked -- as the operating system uses these signals to report serious problems. No application should use these threads for application specific communication. Therefore, a violation of Python's principle to only deliver signals to the main thread seems appropriate for these signals. As an automatic restart after a crash is vital for our productive Zope2 installations, we will probably bite the bullet and maintain our own Python version. -- Dieter
[Tim Petere]
There's a patch that squashes the specific symptom you have in mind, but at the cost of other breakage -- the original patch was added for a reason too.
[Dieter Maurer]
I verified that
<http://sourceforge.net/tracker/index.php?func=detail&aid=949332&group_id=54 70&atid=305470>
indeed fixes the problem.
It fixes a problem, yes, and let's be clear that it does do by avoiding provoking a bug in LinuxThreads. If LinuxThreads had a POSIX-compliant implementation of signals, this discussion wouldn't be happening.
It might introduce other subtle problems but at least none that are revealed by Python's regression test suite...
The comments on the bug report are extensive: [ 756924 ] SIGSEGV causes hung threads (Linux) http://www.python.org/sf/756924 Guido applied Jason Lowe's original signal-blocking patch because he was persuaded it fixed significant thread problems at the time. Everyone (including Jason) now agrees that patch was too extreme, but the platform problems it intended to address still exist. It's certainly true that Python's test suite doesn't cover all endcase threading+signal interaction behaviors across dozens of incompatible thread implementations, and many such problems are exposed by GNU readline, which is plain difficult to test except interactively. So it goes.
Moreover, I doubt that such problems will be significant in practise:
As above, the original signal-blocking patch was added for reasons "in practice" that appeared sufficient at the time. If you want to argue that, the right (helpful) place to do so is in a comment attached to the bug report. ...
As an automatic restart after a crash is vital for our productive Zope2 installations, we will probably bite the bullet and maintain our own Python version.
The 2.3.4 release manager rejected any change in this area for 2.3.4, and Guido agreed with that decision. 2.3.4 is just days away now, and there are several issues on several quite different platforms that need to be addressed simultaneously. Talking to them (via the Python bug report) may change their minds, but (a) I doubt it, and (b) nobody on zope-dev can change this. Adding comments to the bug report will still help to get it resolved for 2.3.5. An alternative to maintaining your own Python, and/or your own Linux, is to move to the current Linux thread implementation (NPTL), which doesn't have the LinuxThread signal bug that's the deeper cause of Zope's problems (on Linux boxes using LinuxThreads).
Tim Peters wrote at 2004-5-23 14:46 -0400:
... Not blocking signals that should not be blocks according to PThreads standard ...
Moreover, I doubt that such problems will be significant in practise:
As above, the original signal-blocking patch was added for reasons "in practice" that appeared sufficient at the time. If you want to argue that, the right (helpful) place to do so is in a comment attached to the bug report.
You have snipped my explanation why I am convinced that the patch can only improve things! I have not argued that there was no case to block *some* signals, just not the ones that the operating system uses to signal major problems -- SIGSEGV, SIGBUS, ... and friends. The patch states that the pthreads standard says that such signals should not be blocked. This is a Python issue independent of the bug in LinuxThreads.
... An alternative to maintaining your own Python, and/or your own Linux, is to move to the current Linux thread implementation (NPTL), which doesn't have the LinuxThread signal bug that's the deeper cause of Zope's problems (on Linux boxes using LinuxThreads).
Our system administrators have been sceptical to switch to NPTL support. They say, there are still some problems about it. I will reraise the question and see what my colleagues feel as the less problematic way: use NPTL or our own Python version. -- Dieter
[Dieter Maurer]
You have snipped my explanation why I am convinced that the patch can only improve things!
Yes, because zope-dev isn't a useful place to discuss this complicated Python issue. If you missed it the first two times <wink>, let me suggest again that you add your comments to the bug report: that's the only place where the people fixing this problem, and the people making the release decisions, will see what you have to say (I have no say in what happens here, and I'm not working on resolving this issue either -- all I did is agitate to "do something" for 2.3.4, but I lost that battle).
I have not argued that there was no case to block *some* signals,
As the discussion in the bug report makes clearer, it's unclear whether Python "should be" blocking any signals anywhere.
just not the ones that the operating system uses to signal major problems -- SIGSEGV, SIGBUS, ... and friends. The patch states that the pthreads standard says that such signals should not be blocked.
This is a Python issue independent of the bug in LinuxThreads.
Python has its own threading model, which has to work across several thread implementations besides just pthreads. It doesn't *intend* to mimic the native platform thread gimmicks, it has to build on them. If LinuxThreads handled signals correctly (according to the pthreads standard), there wouldn't be a problem here.
Our system administrators have been sceptical to switch to NPTL support. They say, there are still some problems about it.
I don't know, but most things I've read about the *current* NPTL are very positive (orders of magnitude faster in some stress tests than LinuxThreads, and much better conformance to the standard). Earlier versions of NPTL got worse press.
I will reraise the question and see what my colleagues feel as the less problematic way: use NPTL or our own Python version.
It would sure help if people running Zope on an NPTL system spoke up here!
On Mon, 2004-05-24 at 11:29, Tim Peters wrote:
I will reraise the question and see what my colleagues feel as the less problematic way: use NPTL or our own Python version.
It would sure help if people running Zope on an NPTL system spoke up here!
I run Zope on an NPTL system (Fedora Core 1), but it's just a development box; as a result it doesn't get much load and doesn't stay up indefinitely. - C
[Chris McDonough]
I run Zope on an NPTL system (Fedora Core 1), but it's just a development box; as a result it doesn't get much load and doesn't stay up indefinitely.
It's better than knowing nothing <wink>. If that's the box you do overnight test runs on too, that's also interesting, since you frequently see failures in certain ZEO tests that are rarely (if ever) seen elsewhere.
On Mon, 2004-05-24 at 13:05, Tim Peters wrote:
[Chris McDonough]
I run Zope on an NPTL system (Fedora Core 1), but it's just a development box; as a result it doesn't get much load and doesn't stay up indefinitely.
It's better than knowing nothing <wink>. If that's the box you do overnight test runs on too, that's also interesting, since you frequently see failures in certain ZEO tests that are rarely (if ever) seen elsewhere.
That box actually runs LinuxThreads: [chrism@beach chrism]$ getconf GNU_LIBPTHREAD_VERSION linuxthreads-0.10 - C
On Mon, 24 May 2004 11:29:01 -0400, Tim Peters wrote:
[Dieter Maurer]
I will reraise the question and see what my colleagues feel as the less problematic way: use NPTL or our own Python version.
It would sure help if people running Zope on an NPTL system spoke up here!
My development machine uses NPTL : $ getconf GNU_LIBPTHREAD_VERSION NPTL 0.60 I am now starting to create a test deployment. That machine, which will be the production machine, also uses the same version of NPTL. Both systems are Debian with kernel 2.6.5. (FWIW) -- mailhost:/etc/mail# less sendmail.cf less: syntax of file "sendmail.cf" may induce nausea, show anyway? [n] www: http://dman13.dyndns.org/~dman/ jabber: dman@dman13.dyndns.org
Dieter Maurer wrote:
<http://sourceforge.net/tracker/index.php? func=detail&aid=949332&group_id=5470&atid=305470>
indeed fixes the problem.
It might introduce other subtle problems but at least none that are revealed by Python's regression test suite...
Moreover, I doubt that such problems will be significant in practise:
You might want to take a look at: < http://sourceforge.net/tracker/?func=detail&aid=960406&group_id=5470&atid=305470> and comment on its effectiveness. It fixes the original problem that lead to having signals blocked in spawned threads. (The problem essentially being that a Control-C in readline might be caught by the wrong thread.) but without the blocking. (besides the segfault problem, it also fixes the way that os.system() and os.popen() spawn children with blocked signals, and similar problems.) It does so in a more general purpose way that should help python on all supported platforms, and not just the specifics of LinuxThreads.
alangmead@boston.com wrote at 2004-5-26 11:28 -0400:
... You might want to take a look at:
< http://sourceforge.net/tracker/?func=detail&aid=960406&group_id=5470&atid=305470>
and comment on its effectiveness. It fixes the original problem that lead to having signals blocked in spawned threads. (The problem essentially being that a Control-C in readline might be caught by the wrong thread.) but without the blocking. (besides the segfault problem, it also fixes the way that os.system() and os.popen() spawn children with blocked signals, and similar problems.) It does so in a more general purpose way that should help python on all supported platforms, and not just the specifics of LinuxThreads.
Note the following extract from the Python Library Reference doc-2.3/lib/module-signal.html: .... and the main thread will be the only one to receive signals (this is enforced by the Python signal module, even if the underlying thread implementation supports sending signals to individual threads). Thus, beside the "Control-C" issue, we respect/update this assertion about signal delivery. -- Dieter
participants (5)
-
alangmead@boston.com -
Chris McDonough -
Derrick 'dman' Hudson -
Dieter Maurer -
Tim Peters