-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello everybody! We are encountering some really strange problems with Zope 2.7.7 on our RedHat EL 4 Linux machines. During the Zope 2.7.7 compilation works - however most of the time "make test" returns a random number of errors (somewhere between 20 and 30) ALL related to ZEO. The funny thing is, we've managed to do a "make test" without any failures - however after doing a "make distclean" and compiling everything again "make test" produces the above mentioned errors (using *exactly* the same source code!). I have absolutely no idea how this can happen - ANY hints are appreciated! Is this a known issue? What could it be related to? Thanks a lot! Regards, Andreas Krasa -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) iD8DBQFDAg7pfmH5mdoXi9wRAqkKAJ9oBzDN8WUzYYeNACVPJM0ifP4cwgCdFQh6 LPV9D5RElHRSbr256xj+HVY= =qzGm -----END PGP SIGNATURE-----
During the Zope 2.7.7 compilation works - however most of the time "make test" returns a random number of errors (somewhere between 20 and 30) ALL related to ZEO.
Maybe someone can help if you actually *tell us* what these errors are. At least my own crystal ball is in the shop for repairs right now... :) jens
Jens Vagelpohl schrieb:
During the Zope 2.7.7 compilation works - however most of the time "make test" returns a random number of errors (somewhere between 20 and 30) ALL related to ZEO.
Maybe someone can help if you actually *tell us* what these errors are. At least my own crystal ball is in the shop for repairs right now... :)
jens
Hi! Oops, almost forgot about those - the errors are as follows. They are always related to ZEO and an OSError "No child processes". Thanks & best regards, Andreas Krasa --- ====================================================================== ERROR: checkMultipleAddresses (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkMultipleServers (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkReadOnlyClient (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkReadOnlyFallbackReadOnlyServer (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkReadOnlyFallbackWritable (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkReconnectWritable (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkReconnection (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkTimeout (ZEO.tests.testConnection.MappingStorageTimeoutTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkTimeoutAfterVote (ZEO.tests.testConnection.MappingStorageTimeoutTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkTimeoutOnAbortNoLock (ZEO.tests.testConnection.MappingStorageTimeoutTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: checkTimeoutProvokingConflicts (ZEO.tests.testConnection.MappingStorageTimeoutTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ====================================================================== ERROR: testMonitor (ZEO.tests.testMonitor.MonitorTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes
[Andreas Krasa] So you showed a number of test errors here, all on line 121 of ConnectionTests.py. They all seem indentical to the first one:
ERROR: checkMultipleAddresses (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes
That's peculiar for several reasons, and-- sorry --I have no theory to offer. Because the failure is in tearDown(), the test body has _completed_. If there's no other failure reported for this test, that means the test succeeded. Instead it's dying while it's tearing the test framework down. Some of the ZEO tests create additional processes. The place it's dying is in this teardown loop: def tearDown(self): """Try to cause the tests to halt""" ... for pid in self._pids: os.waitpid(pid, 0) # here The test driver is simply waiting for the other process(es) it spawned to exit. If there was a failure to create the other process(es), then I'd expect a test to die long before teardown. But if it did create the other process(es), then I have no theory for why Linux waitpid() would claim there are no child processes. Maybe this is relevant (it's the only vaguely similar report I recall seeing across years, but I don't think it reached a conclusion): http://mail.zope.org/pipermail/zope3-users/2005-July/000794.html
Andreas Krasa // WUW wrote at 2005-8-16 18:37 +0200:
... ====================================================================== ERROR: checkMultipleAddresses (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes
I have seen similar errors happening non deterministically in the presence of a "SIGCHLD" handler set to "SIG_IGN". Such a handler causes the operating system to reap away so called zombie processes and if the zombie no longer exists, "waitpid" will fail. Some *nix variants automatically pass the "SIG_IGN" down to child processes. Our Debian and SuSE Linux versions do. I had to change "Zope.Startup.run" not to use "SIG_IGN" as "SIGCHLD" handler in order to avoid such problems. In case, you run your tests with "zopectl test", you may see this problem... -- Dieter
Dieter Maurer schrieb:
Andreas Krasa // WUW wrote at 2005-8-16 18:37 +0200:
... ====================================================================== ERROR: checkMultipleAddresses (ZEO.tests.testConnection.MappingStorageConnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/src/__zope__/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes
I have seen similar errors happening non deterministically in the presence of a "SIGCHLD" handler set to "SIG_IGN". Such a handler causes the operating system to reap away so called zombie processes and if the zombie no longer exists, "waitpid" will fail.
Some *nix variants automatically pass the "SIG_IGN" down to child processes. Our Debian and SuSE Linux versions do. I had to change "Zope.Startup.run" not to use "SIG_IGN" as "SIGCHLD" handler in order to avoid such problems.
In case, you run your tests with "zopectl test", you may see this problem...
Hi Dieter! Thanks very much for your help! I will give this one a try! Btw. since this also happens on 5 other machines - all natively installed with RHEL4 - there actually might really be something wrong within the OS. Is that worth submitting a bug to RedHat? Or is ist more like a "feature"? ;) Thanks again, Andreas
On 18 Aug 2005, at 07:50, Andreas Krasa // WUW wrote:
Is that worth submitting a bug to RedHat? Or is ist more like a "feature"? ;)
Why would RedHat care? They will just throw it back at you and say "sorry, Zope is not one of our supported packages". By the way, I hope you are not running Zope on the system-installed Python? If you do, then change your setups to build and install your own Python just for Zope and test again. jens
Jens Vagelpohl schrieb:
On 18 Aug 2005, at 07:50, Andreas Krasa // WUW wrote:
Is that worth submitting a bug to RedHat? Or is ist more like a "feature"? ;)
Why would RedHat care? They will just throw it back at you and say "sorry, Zope is not one of our supported packages".
By the way, I hope you are not running Zope on the system-installed Python? If you do, then change your setups to build and install your own Python just for Zope and test again.
jens
Hi Jens, no, we've rebuilt python (2.3.5) from sources, and, as our main Zope product Silva requires this, also libxml2 and libxslt (of course with pointing to our own python). This stuff all resides in /usr/local. We've compiled Zope pointing to /usr/local/bin/python23, so I guess that RedHat's own python RPM does not interfere with Zope, at least I hope so. As I understood Dieter's mail, this strange behavior is caused by the way RedHat Enterprise Linux 4 system libraries handle SIG_IGN/SIGCHLD. If this problem was due to some improper Zope methods, most people would have this sort of problems. Which is not the case. That makes me believe that the failure of ZEO tests actually is caused by some uncommon or improper implementation of those two handles - which, in my opinion, makes it something RedHat should take a look at. Anyway - how severe are those testing failures for actually USING a ZEO client/server on that particular OS as a production system? Cheers, Andreas
On 18 Aug 2005, at 11:00, Andreas Krasa // WUW wrote:
As I understood Dieter's mail, this strange behavior is caused by the way RedHat Enterprise Linux 4 system libraries handle SIG_IGN/SIGCHLD.
That makes me wonder why it does not happen on my CentOS 4 box. CentOS 4 is compiled from RHEL4 SRPMs with only very minor changes, mostly to remove copyrighted RedHat logos and names.
Anyway - how severe are those testing failures for actually USING a ZEO client/server on that particular OS as a production system?
I personally would not trust a system for production where the unit tests don't run until it's proven that the unit tests themselves are dodgy, or whatever else caused it is fixed. jens
[Andreas Krasa]
... As I understood Dieter's mail, this strange behavior is caused by the way RedHat Enterprise Linux 4 system libraries handle SIG_IGN/SIGCHLD.
I don't know. Dieter asked whether you ran the tests via "zopectl test", but I didn't see an answer to that. If you run the Zope tests directly ("python test.py"), then the ZODB/ZEO tests never touch the OS's default handler for SIGCHLD; if you do use zopectl, zopectl.py _does_ set its own handler for SIGCHLD. I'm not sure Dieter's info is current either. The SIGCHLD handler in current Zope 2.7.7's zopectl.py explicitly catches and ignores the specific exception you reported: def _ignoreSIGCHLD(*unused): while 1: try: os.waitpid(-1, os.WNOHANG) except OSError: break ... signal.signal(signal.SIGCHLD, _ignoreSIGCHLD) But looks like Dieter added that code to begin with, so hard to believe he forgot about it ;-)
If this problem was due to some improper Zope methods, most people would have this sort of problems. Which is not the case. That makes me believe that the failure of ZEO tests actually is caused by some uncommon or improper implementation of those two handles - which, in my opinion, makes it something RedHat should take a look at.
I don't believe anyone at this point knows why you're seeing this problem; the best way to make progress is to whittle it down to a small, small-contained test case. "Some ZEO tests fail sometimes" still involves mountains of code, including everything from the OS kernel to hundreds of .py files. The ZEO test process setup isn't anywhere near as complicated as zopectl, or as anything relying on zdaemon: the ZEO tests spawn processes directly via Python's os.spawnve(), and later waits for them to end, via the waitpid() code shown earlier. It doesn't muck around with signals, forks, or anything else that should be platform-dependent (the same ZEO-test process code is used on both Linux and Windows, BTW -- for this reason, it can't rely on any fancy signal or process gimmicks; spawnve+watipid is the entire story here).
... Anyway - how severe are those testing failures for actually USING a ZEO client/server on that particular OS as a production system?
All the failures you showed were in test teardown. If that's all the failures you got, then all the test bodies actually passed. Of course you have to be wary that normal methods of detecting child-process termination aren't working as hoped on this box, because all the test failures you reported were exactly failures to detect child-process termination. I don't know how much of that Zope does, but can say ZODB/ZEO never does that in normal operation (spawning multiple processes, in ZODB+ZEO, is unique to the testing code; a ZEO server is a single process, and doesn't spawn other process while it's running).
According to Tim Peters:
I don't know. Dieter asked whether you ran the tests via "zopectl test", but I didn't see an answer to that.
Ok, here some data points... bender:~/Zope-2.7.7-final$ cat /proc/version Linux version 2.6.9-11.ELsmp (bhcompile@decompose.build.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:26:27 EDT 2005 bender:~/Zope-2.7.7-final$ python2.3 Python 2.3.5 (#1, Apr 19 2005, 14:53:39) [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Running one single test: bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ====================================================================== ERROR: checkNoVerificationOnServerRestart (ZEO.tests.testConnection.FileStorageReconnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ---------------------------------------------------------------------- Ran 1 test in 0.689s FAILED (errors=1) After some retries, the same test passes: bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ---------------------------------------------------------------------- Ran 1 test in 0.691s OK Interesstingly, if i run the test with strace, i never see the test fail (i tried at least 30 times): bender:~/Zope-2.7.7-final$ strace -e trace=signal -o /var/tmp/zeotest.trc python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ---------------------------------------------------------------------- Ran 1 test in 0.710s OK (Obviously a Heisenberg effect -- the observation influences the behaviour ;-) If anyone is interessted in the trace file -- it can be found at: http://slime.wu-wien.ac.at/misc/zeotest.trc (However, it would be way more interessting to see the syscalls while the test is failing...) Also, i debugged the whole test with the python debugger. Unfortunatly (as with strace), i was not able to reproduce the failing of the test in the debugger.
the ZEO tests spawn processes directly via Python's os.spawnve(), and later waits for them to end, via the waitpid() code shown earlier. It doesn't muck around with signals, forks, or anything else that should be platform-dependent (the same ZEO-test process code is used on both Linux and Windows, BTW -- for this reason, it can't rely on any fancy signal or process gimmicks; spawnve+watipid is the entire story here).
Yes, its as simple as that: zeo ist started, zeo is stopped, and when the parent calls waitpid, we get the "No child processes" error most of the time :-( Any ideas what we can try to narrow this down?
All the failures you showed were in test teardown. If that's all the failures you got, then all the test bodies actually passed. Of course you have to be wary that normal methods of detecting child-process termination aren't working as hoped on this box, because all the test failures you reported were exactly failures to detect child-process termination.
Sure -- we could just make this change: bender:.../ZEO/tests$ diff ConnectionTests.py.ori ConnectionTests.py 121c121,124 < os.waitpid(pid, 0) ---
try: os.waitpid(pid, 0) except OSError: pass
then all tests will pass. But then we will not know why the zeo zombie vanishes before the waitpid can reap the exit code ;-) \wlang{} PS: i'am afraid it turns out to be a python thread / signals / race problem -- yuck! -- Willi.Langenberger@wu-wien.ac.at Fax: +43/1/31336/9207 Zentrum fuer Informatikdienste, Wirtschaftsuniversitaet Wien, Austria
On 19 Aug 2005, at 02:22, Willi Langenberger wrote:
According to Interesstingly, if i run the test with strace, i never see the test fail (i tried at least 30 times):
This sounds like something Tim had mentioned at one point, where tests can fail on a machine that is "too fast"? Both with strace and the debugger you "slowed the test down" a bit, and they pass. Maybe that's a clue, basically pointing to the fact that the tests are dodgy rather than the software? jens
[Willi Langenberger]
Interesstingly, if i run the test with strace, i never see the test fail (i tried at least 30 times):
[Jens Vagelpohl]
This sounds like something Tim had mentioned at one point, where tests can fail on a machine that is "too fast"?
Dieter said that. I believe him, but it doesn't match my experience. Then again, I spend most of my time on Windows boxes, where process creation is so much slower (than on Linux) that Dieter can't even imagine that many orders of magnitude <wink>. Andres Krasa reported that his tests were running on an Intel Xeon 3 GHz Dual-CPU box. Does nobody else here run on a Linux box that fast? I normally run tests on a 3.4 GHz hyperthreaded box, but it _is_ running WinXP instead, and the spawnve and waitpid implementations on Windows have nothing in common with their Linux implementations (e.g., most Unix signals, including SIGCHLD, don't exist on Windows -- the OSes have very different ways of managing processes).
Both with strace and the debugger you "slowed the test down" a bit, and they pass. Maybe that's a clue, basically pointing to the fact that the tests are dodgy rather than the software?
Trying to come up with a small, self-contained test case remains (IMO) the best way to make progress here. "The tests are dodgy" sounds appealing until you think about what they do related to the point of failure: spawn a process, and wait for it to exit later, passing waitpid() the pid returned by spawnve(). There just isn't anything complicated (at the Python level) going on there.
Hi Tim, Tim Peters wrote at 2005-8-19 11:15 -0400:
... "The tests are dodgy" sounds appealing until you think about what they do related to the point of failure: spawn a process, and wait for it to exit later, passing waitpid() the pid returned by spawnve(). There just isn't anything complicated (at the Python level) going on there.
There is one essential thing you stress over and over again -- but which I am not sure: You say, the exception in "tearDown" means that the test completed successfully -- without any error. However, I am convinced that "tearDown" is called, too, when the test fails. I did not point this out earlier, because you are probably right. If the test itself had failed, we should probably have seen a previous exception and a "pid" cannot be registered for later clean up before it was created. Looks as if there were something that eats the dead child before the "waitpid" could take care of it. I know that a SIGCHLD/SIG_IGN can do that or a "waitpid(pid)" with "pid <= 0". If for some reason, a value "<= 0" happened to arrive in the list of processes to be cleaned up, then this could explain the strange non-deterministic behaviour. -- Dieter
[Dieter Maurer]
There is one essential thing you stress over and over again -- but which I am not sure:
You say, the exception in "tearDown" means that the test completed successfully -- without any error.
Oh no, that's not what I'm saying. As you say next,
However, I am convinced that "tearDown" is called, too, when the test fails.
That's right, it does. What I said is that _if_ the only listing of errors/failures we've seen here was in fact an exhaustive list of all the errors/failures that were seen in that run, _then_ we can deduce that the tests passed. That's simply because if a test body failed, that would have produced an _additional_ error/failure report. But every one of the error/failure output blocks in the message showing them was the same waitpid() complaint, reached from tearDown(). Assuming this was an exhaustive listing, there were no error/failure reports of any kind stemming from test setup or test body code, only from test teardown code. So it's not that I saw an error in tearDown() that causes me to believe "the tests passed", it's that we haven't seen any errors/failures _other_ than tearDown() errors. Willi was later kind enough to include what looked like a screen scrape of an entire test run, and I think we can be sure of that there: """ Running one single test: bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ====================================================================== ERROR: checkNoVerificationOnServerRestart (ZEO.tests.testConnection.FileStorageReconnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes ---------------------------------------------------------------------- Ran 1 test in 0.689s FAILED (errors=1) """ If the setup code or body of checkNoVerificationOnServerRestart had something to complain about too, I would expect to see an additional blob of ERROR or FAILURE output. The ConnectionTests.py code that starts a ZEO server process doesn't swallow exceptions, and simply cannot add the pid returned from spawnve() to its list of tids to wait for later unless ZEO/tests/forker.py's start_zeo_server() returns normally.
I did not point this out earlier, because you are probably right.
If the test itself had failed, we should probably have seen a previous exception and a "pid" cannot be registered for later clean up before it was created.
As above, yes.
Looks as if there were something that eats the dead child before the "waitpid" could take care of it.
Yup.
I know that a SIGCHLD/SIG_IGN can do that or a "waitpid(pid)" with "pid <= 0".
If for some reason, a value "<= 0" happened to arrive in the list of processes to be cleaned up, then this could explain the strange non-deterministic behaviour.
Perhaps they can add some print statements or asserts then, to test that possibility. From the Python docs: If pid is 0, the request is for the status of any child in the process group of the current process. If pid is -1, the request pertains to any child of the current process. If pid is less than -1, status is requested for any process in the process group -pid (the absolute value of pid). If the OS happens to return a pid "with the sign bit set", I'm not sure whether the Python implementation of all this stuff would manage to do "the right thing". Python's waitpid() wrapper definitely treats the pid as a native signed C int, not as being of type pid_t. OTOH, pid_t isn't part of standard C, it's a Unix thing, and I believe pid_t _is_ C int in glibc. If so, then a pid "with the sign bit set" is simply impossible to use in a call to waitpid(), so it would be an OS bug if it ever returned a pid with the sign bit set.
[Willi Langenberger]
Ok, here some data points...
bender:~/Zope-2.7.7-final$ cat /proc/version Linux version 2.6.9-11.ELsmp (bhcompile@decompose.build.redhat.com) (gcc version 3.4.3 20050227 (Red Hat 3.4.3-22)) #1 SMP Fri May 20 18:26:27 EDT 2005
bender:~/Zope-2.7.7-final$ python2.3 Python 2.3.5 (#1, Apr 19 2005, 14:53:39) [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2 ...
Running one single test:
bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ====================================================================== ERROR: checkNoVerificationOnServerRestart (ZEO.tests.testConnection.FileStorageReconnectionTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/wlang/Zope-2.7.7-final/lib/python/ZEO/tests/ConnectionTests.py", line 121, in tearDown os.waitpid(pid, 0) OSError: [Errno 10] No child processes
---------------------------------------------------------------------- Ran 1 test in 0.689s
FAILED (errors=1)
After some retries, the same test passes:
bender:~/Zope-2.7.7-final$ python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ---------------------------------------------------------------------- Ran 1 test in 0.691s
OK
Interesstingly, if i run the test with strace, i never see the test fail (i tried at least 30 times):
bender:~/Zope-2.7.7-final$ strace -e trace=signal -o /var/tmp/zeotest.trc python2.3 test.py testConnection checkNoVerificationOnServerRestart\$ Running unit tests from /home/wlang/Zope-2.7.7-final/lib/python ---------------------------------------------------------------------- Ran 1 test in 0.710s
OK
(Obviously a Heisenberg effect -- the observation influences the behaviour ;-)
Not unusual, alas. What's more peculiar is that nobody else on this list reports the same problem. Then again, we have no way to know whether anyone other than Jens and I _tried_ to ;-)
If anyone is interessted in the trace file -- it can be found at:
http://slime.wu-wien.ac.at/misc/zeotest.trc
(However, it would be way more interessting to see the syscalls while the test is failing...)
Someone more up-to-date than I on the vagaries of Linux signals and strace might be able to deduce something from that about how SIGCHLD is treated by this OS. AFAICT, the SIGCHLD handler was set to SIG_DFL shortly after Python started, and wasn't fiddled with again. I don't know the exact intended meaning of every character in the: --- SIGCHLD (Child exited) @ 0 (0) --- lines near the end of the trace either.
Also, i debugged the whole test with the python debugger. Unfortunatly (as with strace), i was not able to reproduce the failing of the test in the debugger.
[Tim]
the ZEO tests spawn processes directly via Python's os.spawnve(), and later waits for them to end, via the waitpid() code shown earlier. It doesn't muck around with signals, forks, or anything else that should be platform-dependent (the same ZEO-test process code is used on both Linux and Windows, BTW -- for this reason, it can't rely on any fancy signal or process gimmicks; spawnve+watipid is the entire story here).
Yes, its as simple as that: zeo ist started, zeo is stopped, and when the parent calls waitpid, we get the "No child processes" error most of the time :-(
Any ideas what we can try to narrow this down?
Whittle it down. If I had a box on which I saw the problem, the next thing I'd try is writing a tiny Python program that did nothing other than spawn a simple process and then wait for it finish. So far, there's no particular reason to believe that the mountain of Zope/ZODB/ZEO code really has anything to do with this, right? The outcome of trying to remove all that from the equation would suggest a next step. ...
Sure -- we could just make this change:
bender:.../ZEO/tests$ diff ConnectionTests.py.ori ConnectionTests.py 121c121,124 < os.waitpid(pid, 0) ---
try: os.waitpid(pid, 0) except OSError: pass
then all tests will pass.
That should be verified (by actually trying it). For one thing, I count 8 instances of "os.waitpid(pid, 0)" in the Zope-2_7-branch branch, and it would be surprising if the other 7 always worked on your box, right? ;-)
But then we will not know why the zeo zombie vanishes before the waitpid can reap the exit code ;-)
Right, it would be papering over a symptom, leaving the cause unknown. If you find that expedient in your installation, that's fine, it's a key advantage of open source that you can worm around problems on your own. Of course I don't want to do that in the distributed code without understanding the problem first (for example, catching OSError here could _also_ end up hiding genuine bugs later -- there's no reason we know of to expect that waitpid() can fail here).
... PS: i'am afraid it turns out to be a python thread / signals / race problem -- yuck!
If you can whittle it down, possible causes will become clearer. If you want to try some random thrashing, try Python 2.4.1. Dealing with signals is a cross-Unix mess, and LinuxThreads fail to conform to the POSIX standard in some obscure ways related to signals. 2.4.1 tried to worm around that.
Tim Peters wrote at 2005-8-18 17:16 -0400:
... I'm not sure Dieter's info is current either. The SIGCHLD handler in current Zope 2.7.7's zopectl.py explicitly catches and ignores the specific exception you reported:
Good! Something like that I did for our Zope 2.7.2.
... But looks like Dieter added that code to begin with, so hard to believe he forgot about it ;-)
Sometimes, changes by me happen to land in in the core distribution either via the Zope collector or because Andreas put them in there. In both case, I do not necessarily know about that. I have not (yet) modified directly core Zope code. Maybe, in the near future... -- Dieter
Andreas Krasa // WUW wrote at 2005-8-18 08:50 +0200:
... Btw. since this also happens on 5 other machines - all natively installed with RHEL4 - there actually might really be something wrong within the OS.
Is that worth submitting a bug to RedHat? Or is ist more like a "feature"? ;)
The Linux documentation explicitly mentions that it is not defined whether or not an ignored SIGCHLD is inherited by child processes. Thus, the Redhat behaviour at least conforms to the documention. I would not call it a bug (also I had to change Zope code, to get rid of the nasty side effects). -- Dieter
[Andreas Krasa]
We are encountering some really strange problems with Zope 2.7.7 on our RedHat EL 4 Linux machines.
During the Zope 2.7.7 compilation works - however most of the time "make test" returns a random number of errors (somewhere between 20 and 30) ALL related to ZEO.
The funny thing is, we've managed to do a "make test" without any failures - however after doing a "make distclean" and compiling everything again "make test" produces the above mentioned errors (using *exactly* the same source code!).
I have absolutely no idea how this can happen - ANY hints are appreciated! Is this a known issue?
No. For example, it doesn't happen in the daily overnight testrunner reports.
What could it be related to?
ZEO <wink>? You'll have to give more info about which tests fail, and precisely how they fail. Because many of the ZEO tests create multiple processes, and try to assign sockets so that these processes can communicate, they're vulnerable to vagaries of OS process scheduling and socket use by other apps. For example, on a slow or overburdened (with other simultaneous work) machine, some ZEO tests can fail due to not getting enough cycles soon enough. The worst tests of that sort wait as long as a minute now for another process to "do something" they're waiting for before failing, but not even waiting a minute can _guarantee_ success. Might be informative to run the tests on an otherwise-quiet machine.
Tim Peters schrieb:
[Andreas Krasa]
We are encountering some really strange problems with Zope 2.7.7 on our RedHat EL 4 Linux machines.
During the Zope 2.7.7 compilation works - however most of the time "make test" returns a random number of errors (somewhere between 20 and 30) ALL related to ZEO.
The funny thing is, we've managed to do a "make test" without any failures - however after doing a "make distclean" and compiling everything again "make test" produces the above mentioned errors (using *exactly* the same source code!).
I have absolutely no idea how this can happen - ANY hints are appreciated! Is this a known issue?
No. For example, it doesn't happen in the daily overnight testrunner reports.
What could it be related to?
ZEO <wink>? You'll have to give more info about which tests fail, and precisely how they fail. Because many of the ZEO tests create multiple processes, and try to assign sockets so that these processes can communicate, they're vulnerable to vagaries of OS process scheduling and socket use by other apps. For example, on a slow or overburdened (with other simultaneous work) machine, some ZEO tests can fail due to not getting enough cycles soon enough. The worst tests of that sort wait as long as a minute now for another process to "do something" they're waiting for before failing, but not even waiting a minute can _guarantee_ success.
Might be informative to run the tests on an otherwise-quiet machine.
Thank you Tim for the feedback! Our system is a Intel Xeon 3 GHz Dual-CPU with 2.5 GB RAM running RedHat Enterprise Linux 4 (SElinux disabled). As this is a test-machine it doesn't run any CPU-consuming tasks I can think of - the server load is usually somewhere between 0.00 and 0.10. But I'll check that nevertheless! Best regards Andreas
On 16 Aug 2005, at 17:42, Andreas Krasa // WUW wrote:
Our system is a Intel Xeon 3 GHz Dual-CPU with 2.5 GB RAM running RedHat Enterprise Linux 4 (SElinux disabled).
I just downloaded and ran all tests for Zope 2.7.7 on one of my boxes, a CentOS 4 install (same as RHEL 4) with all the latest fixes, and they all ran fine. This must be something having to do either with your specific environment or with the way you ran the unit tests. jens
Tim Peters wrote at 2005-8-16 12:35 -0400:
... For example, on a slow or overburdened (with other simultaneous work) machine, some ZEO tests can fail due to not getting enough cycles soon enough.
My experience has been that ZEO tests fail with preference on especially fast machines. -- Dieter
participants (6)
-
Andreas Krasa // WUW -
Andreas Krasa // ZID -
Dieter Maurer -
Jens Vagelpohl -
Tim Peters -
Willi Langenberger