Hi with our new ZEO client systems, we frequently observe this problem: a ZEO process starts to use 100% CPU time (user) without a significant increase of requests. Sometimes (but not always) the process stops answering requests, still using 100% CPU. When we kill such a process, it changes to zombie state (shown in top as 'Z' and '<defunct>'), still using 100% CPU, but now its system time, not user. The HTTP port is still in use, so we have to reboot this node to restart the ZEO client. This usually fails because some filesystems cannot be unmouted, there are still files locked. I tried both start modes, runzope and zopectl, but no difference. All that is in opposition of what I know about zombie processes, they should use no CPU time. Versions are: RedHat RHEL4, Kernel 2.6.9-42.0.10.ELsmp, with address extension (16 GB RAM) Python 2.3.6 Zope 2.8.8 The older cluster nodes work perfectly, no such zombie problem ever (connected to the same storage server); they run on Debian Sarge, Kernel 2.4.27 SMP Any hint is appreciated Regards Bengt
Bengt Giger wrote at 2007-4-10 15:36 +0200:
with our new ZEO client systems, we frequently observe this problem:
a ZEO process starts to use 100% CPU time (user) without a significant increase of requests. Sometimes (but not always) the process stops answering requests, still using 100% CPU.
When we kill such a process, it changes to zombie state (shown in top as 'Z' and '<defunct>'), still using 100% CPU, but now its system time, not user. The HTTP port is still in use, so we have to reboot this node to restart the ZEO client. This usually fails because some filesystems cannot be unmouted, there are still files locked.
I tried both start modes, runzope and zopectl, but no difference.
All that is in opposition of what I know about zombie processes, they should use no CPU time.
I may have seen a similar problem (though I am not sure about the details): The problem was a buggy Python signal handling together with a doubtful Linux thread implementation (2.4 kernel). When a fatal signal occured, the signal killed the main thread but all other threads were in a strange state. Only a "SIGKILL" could get them out of this state. Of course, the ports remained open although nobody serviced them. I am not sure whether the threads were in zombie state or used CPU. Probably not. -- Dieter
participants (2)
-
Bengt Giger -
Dieter Maurer