We have been chasing a problem with our Zope server. Comments from others here on the mailing list have given me the impression that others might be facing the same quandry. Basically, the problem is that Zope intermittently "hangs". When I say "hangs", what I mean is: 1. It stops serving web requests. When attempting to access Zope via HTTP, the tcp connection succeeds, but nothing is every returned. 2. We cannot get in through the monitor_client 3. Zope appears to be using no CPU time. 'ps' reports that there are several Python processes associated with Zope, but none of them are doing anything. 4. Since HTTP does not work at all, access to /manage or /manage_debug doesn't get anywhere. 5. The only remedy which has worked is to kill the Zope server and restart it. 6. netstat does *not* report any unusual number of sockets owned by python Zope processes. 7. The size of the Python Zope processes does not seem unusual. Last time I checked, they were all around 13M each. We run Zope 2.1.0 final, the version compiled by Digital Creations, downloaded from their site. The same problem existed on 2.1.0b2 and the 2.0 series. The problem is not specifically reproduceable. There is no sequence of steps which causes Zope to hang in the manner described. Nonetheless, it happens. Sometimes it happens a lot. Sometimes it happens infrequently. We run ZServer as our web server. We have never tried running Zope behind Apache. (Should we?) Our machine is a dual-PIII 550 with 512MB of RAM. The operating system is Debian 2.1 (potato). The problem has appeared under other configurations, but we have not tried Zope under any other version of Linux. Virtually nothing else ever runs on this machine, other than Zope. The machine is sitting on a 10-base-T Ethernet, connected to a T-1 line. This Zope server has never been exposed to anything which I would consider to be a really heavy load. The Zope server in question can be seen at www.sourcegear.com It is [usually] up and running. :-) Suspicions: This seems to happen more often when we are editing things using /manage. But then again, there have been several times when it has hung while nobody was even around. We have tried using wget -r to abuse ZServer in the hopes that we could get it to fail predictably. This worked once, causing a hang after five minutes or so. The next time we tried it, Zope carried on for half an hour under heavy load, with no problems. I don't mind fighting a bug, but I don't enjoy intermittent ones. :-( Sometimes Zope stays up longer than others. We have had situations where it stayed up for three days. And, we have had situations where it stayed up for three minutes. This latter situation *seemed* to be associated with the presence of a robot which was crawling our site. Every time we brought the site back up, the robot would resume, and the site would go back down. Watching the log file caused us to suspect a problem with certain types of acquisition. We had some bad relative URLs in our content which were causing deep recursion. The crawler was happily crunching through everything it could find, and our log file revealed some extremely long URLs. [ We blocked the crawler at our firewall. :-) ] However, we fixed every bad relative URL we could find, and the hangs have continued. Less frequent, I supposed, but they still continue. Our site is not terribly large. We do have quite a few documents, but nothing out of the ordinary. Most or all of our images are PNG. We *do* have a ZCatalog running. We use the Knowledge product. One section of our site is a database front end using ZMySQLDA. However, this part of the site is not currently visible to the rest of the world (at least there is no link to it). We have seen no particular correlation between frequency of access to this section of the site and frequency of "hangs". We tried activating the debug_log once, using instructions obtained from a message posted to this list. Unfortunately, that was the attempt wherein Zope ran for half an hour under heavy load with no problem. Our Data.fs file is 77MB. We have never compacted it. Any advice would be *much* appreciated. We have a new service that we are *almost* ready to deploy, and we really need to resolve this problem before it goes out. Thanks in advance. -- Eric W. Sink, Software Craftsman SourceGear Corporation eric@sourcegear.com
A followup to my own post. We are still having problems, but we have some new information:
We tried activating the debug_log once, using instructions obtained from a message posted to this list. Unfortunately, that was the attempt wherein Zope ran for half an hour under heavy load with no problem.
Our server got hung up again, last night at 3:42am. But this time, I had the debug log running. The result may be interesting to somebody, but I don't think I know enough to interpret it. The last two entries in Z2.log are: aaa.bbb.ccc.ddd - - [13/Dec/1999:03:42:50 -0500] "GET /SRF/images/srflogo64 HTTP/1.0" 200 18236 "http://www.sourcegear.com/SRF" "Mozilla/4.7 [en] (WinNT; U)" aaa.bbb.ccc.ddd - - [13/Dec/1999:03:42:51 -0500] "GET /SRF/images/srfad HTTP/1.0" 200 49994 "http://www.sourcegear.com/SRF" "Mozilla/4.7 [en] (WinNT; U)" and those entries seem to be reflected properly in the debug.log file: B 143352400 1999-12-13T03:42:50 GET /SRF/images/srflogo64 I 143352400 1999-12-13T03:42:50 0 B 144071968 1999-12-13T03:42:50 GET /SRF/images/srfad I 144071968 1999-12-13T03:42:50 0 A 143352400 1999-12-13T03:42:50 200 18236 E 143352400 1999-12-13T03:42:50 A 144071968 1999-12-13T03:42:50 200 49994 E 144071968 1999-12-13T03:42:51 The debug.log file has an obvious pattern. Every web server hit seems to generate four entries in the file, each with a different letter at the beginning of the line: B, I, A, and E. I'm speculating that these entries correspond to four stages of serving the transaction, and that the B and E probably refer to Beginning and End. There are no further entries in Z2.log, but there are LOTS of subsequent entries in debug.log. The first few look like this: B 143461304 1999-12-13T03:43:04 GET /robots.txt I 143461304 1999-12-13T03:43:04 0 B 143095944 1999-12-13T03:43:10 GET /Store I 143095944 1999-12-13T03:43:10 0 B 143514056 1999-12-13T03:44:59 GET /SOS I 143514056 1999-12-13T03:44:59 0 and that remains the pattern for the remainder of the debug.log file. Every attempted web transaction generates a B line and an I line, but there is never an A or E line. It appears interesting to note that the first failed transaction attempt was a fetch of /robots.txt, but I think that's a coincidence. We do not have a robots.txt file on our site right now, but there are lots of other occurrences in the log files of attempts to fetch that file, returning the 404 properly. Any ideas? -- Eric W. Sink, Software Craftsman SourceGear Corporation eric@sourcegear.com
participants (1)
-
Eric W. Sink