Last friday, my Zope-site locked up again. The same thing happened to me 9 days before ("Zope locks up", 05/27/2004). I'll try to give more details on what happened. SETUP ===== * RedHat 8 * Apache 2.0.40 * Zope 2.6.4 * Python 2.1.3 * CMF 1.2 Apache and Zope run on separate servers, both Dell PowerEdges. The connection between Apache and Zope is established through RewriteRules and SiteRoots (I'm planning to switch to VirtualHostMonster). I also use Apache for SSL. PERFORMANCE MONITORING ====================== I monitor the performance of the site by checking the number of httpd-threads. When everything is fine, there are about 20 to 23 threads. When this number goes up, the site is slowing down. When the number of threads reaches 150 (the maximum number of threads configured in httpd.conf), the site has locked up. Nothing happens anymore. I run Zope with 4 threads to resolve requests. This should be enough, everybody tells me. But what if some of these threads are waiting for data from external databases. In my case, LDAP was a problem. We used an old version of OpenLDAP (1.2) which had some serious memory leaks. When the load of the LDAP-server started peaking, the number of httpd-threads peaked also. When the load of the LDAP-server went down, the number of httpd-processes went also down. I don't have to do anything, the problem disappears all by itself. THE PROBLEM =========== Friday, suddenly the number of httpd-threads started rising, and soon reached the maximum of 150. The site had locked up. I checked, but the problem was not related to LDAP (as far as I could see). First, there were no load-problems with LDAP. Second, I switched to another LDAP-server, which uses a recent version of OpenLDAP. This version doesn't have those memory leak problems. And the worst part: the problem didn't go away. The performance problems with LDAP usually don't last very long. But this time (and the time before), the problem lasted for hours. INTERLUDE: WEBDAV/MICROSOFT PROBLEMS ==================================== I've had a similar problem about a year ago. My website contains some MS Word-documents. A couple of times, the site locked up when someone tried to download one of those documents. What happened was this: they clicked a link to a Word-document, so it would open immediately in their browser. When this happens, some versions of Microsoft Word keep on sending PROPFIND-requests (PROPFIND is a WebDAV method) to the server (to the http-port, not the WebDAV-port). This caused Zope to slow down, and eventually lock up. I solved this problem by adding a RewriteRule that returned an "403 - Forbidden" when someone sent a WebDAV-request to ports 80 or 443. This solved the problem. THINGS I'VE TRIED ================= * Restarting Zope: This didn't work. Immediately after I've restarted Zope, the number of httpd-threads started rising again. * Switching back to an older Data.fs: I had made some major changes friday, so maybe they were the cause of the problems. I replaced Data.fs by Data.fs from the day before, but it didn't work. The number of httpd-threads rose again. * Using another Zope server: Maybe it's a problem with my Zope server. Luckily, I have another Zope server running, with the same site. I changed httpd.conf, so that requests were forwarded to this other server. The problem didn't go away. * Raising the maximum number of httpd-threads: This was a suggestion of a colleague. I raised the maximum number of threads to 300. This time, the maximum number of threads rose to 300. So the problem didn't go away. * Excluding internal users with iptables: I configured iptables, so that any internal requests (requests from users at the university) were rejected. Immediately, the number of httpd-threads went down to a comfortable 21 (I did not restart Zope or Apache). When I restored iptables, so that internal requests were accepted again, the number of httpd-threads remained at 21. Temporarily excluding local users seemed to have solved the problem. CONCLUSIONS =========== The only thing I can conclude, is that the problem is caused by requests that somehow lock up Zope. These requests come from one or more users at the university. Unfortunately, I can only be sure if the problem happens again. I'm not sure if I want this to happen. I am thinking of using som kind of log replayer, so I can simulate what happened friday with another server. Maybe that way I can see if the requests are indeed the problem. QUESTIONS ========= 1. Is my conclusion correct? 2. What would you do in this situation? Bert Vanderbauwhede... -- "I laugh in the face of danger. Then I hide until it goes away." -- Alexander LaVelle Harris
On Mon, Jun 07, 2004 at 11:21:25AM +0200, Bert Vanderbauwhede wrote:
The only thing I can conclude, is that the problem is caused by requests that somehow lock up Zope. These requests come from one or more users at the university.
(snip) Just a guess, but maybe there are some local user(s) who are infected with worms / virii that are DOSing your Zope? I'd look through your Zope access log and see if it's only the number of requests that are a problem, or the I've seen on some of my sites that extremely large numbers of requests to non-existent pages can effectively DOS zope. But I've never found a solution other than waiting for the storm to subside. I'd like to do some research on this "if I get time"... Another guess: Do you use ZEO? Can you see if there were several overlapping requests for downloading large files? I found that this slowed zope so much that no more requests completed until one or several of the large file requests finally succeeded. I noticed in the ZEO server log that there were a bunch of cache flip messages in quick succession. My interpretation is that the ZEO cache is basically overloaded to the point of being useless, and most/all requests are waiting on the very busy ZEO connection. Things got better when I drastically increased the size of the ZEO client cache so that no single download could trigger a cache flip. -- Paul Winkler http://www.slinkp.com
Bert Vanderbauwhede wrote:
The only thing I can conclude, is that the problem is caused by requests that somehow lock up Zope. These requests come from one or more users at the university.
Unfortunately, I can only be sure if the problem happens again. I'm not sure if I want this to happen. I am thinking of using som kind of log replayer, so I can simulate what happened friday with another server. Maybe that way I can see if the requests are indeed the problem.
QUESTIONS =========
1. Is my conclusion correct?
2. What would you do in this situation?
try www.apsis.ch/pound as reverse proxy before zope -- Jaroslav Luke -- Tento e-mail nemu*e obsahovat VIRY jeliko nepocházi z virózního systému M$ Windows!
Jaroslav Lukesh wrote:
Bert Vanderbauwhede wrote:
2. What would you do in this situation?
try www.apsis.ch/pound as reverse proxy before zope
A reverse proxy is on my todo-list. This is part of the switch from SiteRoot to VirtualHostMonster. Once I've got VirtualHosts defined in Apache, I can use Apache as a reverse proxy. Unfortunately, I can't cache pages for logged in users. BTW: It seems that I've used the wrong Subject-line. Sorry, my mistake. Bert Vanderbauwhede... -- "I laugh in the face of danger. Then I hide until it goes away." -- Alexander LaVelle Harris
participants (3)
-
Bert Vanderbauwhede -
Jaroslav Lukesh -
Paul Winkler