[Zope] Link of the Day

Mon Jun 7 05:21:25 EDT 2004

Last friday, my Zope-site locked up again. The same thing happened to
me 9 days before ("Zope locks up", 05/27/2004). I'll try to give more
details on what happened.

SETUP
=====

* RedHat 8
* Apache 2.0.40
* Zope 2.6.4
* Python 2.1.3
* CMF 1.2

Apache and Zope run on separate servers, both Dell PowerEdges.

The connection between Apache and Zope is established through RewriteRules
and SiteRoots (I'm planning to switch to VirtualHostMonster). I also
use Apache for SSL.

PERFORMANCE MONITORING
======================

I monitor the performance of the site by checking the number of
httpd-threads.  When everything is fine, there are about 20 to 23
threads. When this number goes up, the site is slowing down. When the
number of threads reaches 150 (the maximum number of threads configured
in httpd.conf), the site has locked up.  Nothing happens anymore.

I run Zope with 4 threads to resolve requests. This should be enough,
everybody tells me. But what if some of these threads are waiting for
data from external databases.  In my case, LDAP was a problem. We used an
old version of OpenLDAP (1.2) which had some serious memory leaks. When
the load of the LDAP-server started peaking, the number of httpd-threads
peaked also. When the load of the LDAP-server went down, the number of
httpd-processes went also down. I don't have to do anything, the problem
disappears all by itself.

THE PROBLEM
===========

Friday, suddenly the number of httpd-threads started rising, and soon
reached the maximum of 150. The site had locked up.

I checked, but the problem was not related to LDAP (as far as I could
see). First, there were no load-problems with LDAP. Second, I switched
to another LDAP-server, which uses a recent version of OpenLDAP. This
version doesn't have those memory leak problems.

And the worst part: the problem didn't go away. The performance problems
with LDAP usually don't last very long. But this time (and the time
before), the problem lasted for hours.

INTERLUDE: WEBDAV/MICROSOFT PROBLEMS
====================================

I've had a similar problem about a year ago. My website contains some
MS Word-documents. A couple of times, the site locked up when someone
tried to download one of those documents.

What happened was this: they clicked a link to a Word-document, so
it would open immediately in their browser. When this happens, some
versions of Microsoft Word keep on sending PROPFIND-requests (PROPFIND is
a WebDAV method) to the server (to the http-port, not the WebDAV-port).
This caused Zope to slow down, and eventually lock up.

I solved this problem by adding a RewriteRule that returned an "403 -
Forbidden" when someone sent a WebDAV-request to ports 80 or 443. This
solved the problem.

THINGS I'VE TRIED
=================

* Restarting Zope: This didn't work. Immediately after I've restarted
   Zope, the number of httpd-threads started rising again.

* Switching back to an older Data.fs: I had made some major changes
   friday, so maybe they were the cause of the problems. I replaced
   Data.fs by Data.fs from the day before, but it didn't work. The number
   of httpd-threads rose again.

* Using another Zope server: Maybe it's a problem with my Zope
   server. Luckily, I have another Zope server running, with the same
   site. I changed httpd.conf, so that requests were forwarded to this
   other server. The problem didn't go away.

* Raising the maximum number of httpd-threads: This was a suggestion of
   a colleague.  I raised the maximum number of threads to 300. This
   time, the maximum number of threads rose to 300. So the problem didn't
   go away.

* Excluding internal users with iptables: I configured iptables, so that
   any internal requests (requests from users at the university) were
   rejected. Immediately, the number of httpd-threads went down to a
   comfortable 21 (I did not restart Zope or Apache). When I restored
   iptables, so that internal requests were accepted again, the number
   of httpd-threads remained at 21.

   Temporarily excluding local users seemed to have solved the problem.

CONCLUSIONS
===========

The only thing I can conclude, is that the problem is caused by requests
that somehow lock up Zope. These requests come from one or more users
at the university.

Unfortunately, I can only be sure if the problem happens again. I'm not
sure if I want this to happen. I am thinking of using som kind of log
replayer, so I can simulate what happened friday with another server.
Maybe that way I can see if the requests are indeed the problem.

QUESTIONS
=========

1. Is my conclusion correct?

2. What would you do in this situation?

Bert Vanderbauwhede...
-- 
"I laugh in the face of danger. Then I hide until it goes away."
   -- Alexander LaVelle Harris