Last friday, my Zope-site locked up again. The same thing happened to me 9 days before ("Zope locks up", 05/27/2004). I'll try to give more details on what happened. SETUP ===== * RedHat 8 * Apache 2.0.40 * Zope 2.6.4 * Python 2.1.3 * CMF 1.2 Apache and Zope run on separate servers, both Dell PowerEdges. The connection between Apache and Zope is established through RewriteRules and SiteRoots (I'm planning to switch to VirtualHostMonster). I also use Apache for SSL. PERFORMANCE MONITORING ====================== I monitor the performance of the site by checking the number of httpd-threads. When everything is fine, there are about 20 to 23 threads. When this number goes up, the site is slowing down. When the number of threads reaches 150 (the maximum number of threads configured in httpd.conf), the site has locked up. Nothing happens anymore. I run Zope with 4 threads to resolve requests. This should be enough, everybody tells me. But what if some of these threads are waiting for data from external databases. In my case, LDAP was a problem. We used an old version of OpenLDAP (1.2) which had some serious memory leaks. When the load of the LDAP-server started peaking, the number of httpd-threads peaked also. When the load of the LDAP-server went down, the number of httpd-processes went also down. I don't have to do anything, the problem disappears all by itself. THE PROBLEM =========== Friday, suddenly the number of httpd-threads started rising, and soon reached the maximum of 150. The site had locked up. I checked, but the problem was not related to LDAP (as far as I could see). First, there were no load-problems with LDAP. Second, I switched to another LDAP-server, which uses a recent version of OpenLDAP. This version doesn't have those memory leak problems. And the worst part: the problem didn't go away. The performance problems with LDAP usually don't last very long. But this time (and the time before), the problem lasted for hours. INTERLUDE: WEBDAV/MICROSOFT PROBLEMS ==================================== I've had a similar problem about a year ago. My website contains some MS Word-documents. A couple of times, the site locked up when someone tried to download one of those documents. What happened was this: they clicked a link to a Word-document, so it would open immediately in their browser. When this happens, some versions of Microsoft Word keep on sending PROPFIND-requests (PROPFIND is a WebDAV method) to the server (to the http-port, not the WebDAV-port). This caused Zope to slow down, and eventually lock up. I solved this problem by adding a RewriteRule that returned an "403 - Forbidden" when someone sent a WebDAV-request to ports 80 or 443. This solved the problem. THINGS I'VE TRIED ================= * Restarting Zope: This didn't work. Immediately after I've restarted Zope, the number of httpd-threads started rising again. * Switching back to an older Data.fs: I had made some major changes friday, so maybe they were the cause of the problems. I replaced Data.fs by Data.fs from the day before, but it didn't work. The number of httpd-threads rose again. * Using another Zope server: Maybe it's a problem with my Zope server. Luckily, I have another Zope server running, with the same site. I changed httpd.conf, so that requests were forwarded to this other server. The problem didn't go away. * Raising the maximum number of httpd-threads: This was a suggestion of a colleague. I raised the maximum number of threads to 300. This time, the maximum number of threads rose to 300. So the problem didn't go away. * Excluding internal users with iptables: I configured iptables, so that any internal requests (requests from users at the university) were rejected. Immediately, the number of httpd-threads went down to a comfortable 21 (I did not restart Zope or Apache). When I restored iptables, so that internal requests were accepted again, the number of httpd-threads remained at 21. Temporarily excluding local users seemed to have solved the problem. CONCLUSIONS =========== The only thing I can conclude, is that the problem is caused by requests that somehow lock up Zope. These requests come from one or more users at the university. Unfortunately, I can only be sure if the problem happens again. I'm not sure if I want this to happen. I am thinking of using som kind of log replayer, so I can simulate what happened friday with another server. Maybe that way I can see if the requests are indeed the problem. QUESTIONS ========= 1. Is my conclusion correct? 2. What would you do in this situation? Bert Vanderbauwhede... -- "I laugh in the face of danger. Then I hide until it goes away." -- Alexander LaVelle Harris