FuBuJo wrote at 2008-3-14 13:31 +0000: You need to be a bit more careful in your description. For example the diagram "Apache -> Zeo -> Zope(ZODB)" is very confusing. It is very rare that Apache speaks to Zeo. The confusion between Zope and Zeo may go straight through your description such that it is often unclear whether you really mean Zope when you write Zope and Zeo when you write Zeo. More below.
... The traffic is heavy write traffic (I read some of Dieters posts and am testing that out as well). Once overall load hits about 100 people or so the Zeo's start dying
Here again, you use a wrong word: "dying" would mean that your ZEO process terminates but below to say that it gets slower.
- heavy load, slow response, python takes all CPU/Memory.
Which "python"? The "python" executing Zeo? Or the one executing Zope?
Then when traffic is removed from the ZEO instance ... the system remains CPU bound by the python process ... and you have to bounce Zope(Zeo instance) and Apache to free it.
Which system? The one running ZEO (the ZEO server) or the one running Zope?
The ZODB reports heavy Clients waiting ... but doesn't budge on load. You see this in the ZEO logfile? Then, it is ZEO which reports the waiting -- not the ZODB.
So ... anyone have any suggestions.
We are having similar problems -- I call them commit congestions. As far as we understand it by now, it is a multiple cause problem. Commit congestions can be caused on the client (=Zope) side and on the server (=ZEO) side. A client drastically increases the probability for commit congestions when he does expensive things while he helds the commit lock, i.e. during the second phase of the two phase commit protocol. We have identified three causes: * garbage collections During a garbage collection the garbage collector holds the GIL and blocks all Python activity. We found that a single generation 2 (i.e. full) garbage collection can take between 10 and 20 s. We had a bad text index implementation that caused excessive object creation and thereby lots of garbage collections. Our measure has been to drop the bad index implementation and reconfigure the garbage collector to reduce the garbage collection frequency by a factor of 1000 * "stat"s in the second commit phase. In our system, "stat"s for NFS served files could take up to 27 s. It is a complete mystery why. Local IO, too, occasionally seemed to need excessive time. This, too, is still mysterious. We may have some hints: some ranking bugs in a search engine could cause millions of IO operations within a short timeframe and may have significantly affected the Linux IO behaviour. * invalidation message reception and correspondng client cache updates during the second commit phase Other causes for commit contention come from the (Zeo) server: * "FileStorage.pack" unnecessarily holds the commit lock during large periods of the copying phase, drastically increasing the probability for commit contentions * during some pack phase (reachability analysis), access to the storage file is high volume and erratic. This drastically reduces the performance of the storage and make commit contentions likely. * other heavy use of the file system can affect the IO performance available for storage access and can increase the likelyhood for commit contentions.
I can throw 10 more Apache/Zeo instances as it - but not sure if that's the right approach.
It is not. Commit contention is a synchronization problem. It does not go away but is likely to increase when you scale your frontends up.
So I guess here's my questions.
1. Is there a Zeo Client limit you can have when connecting to a Zope(Zeo Server) instance?
There is no limit in principle -- but as you can see, lots of clients can affect performance. Invalidation message processing poses a load on the server which grows linearly with the number of clients (each client must get all invalidations). Most other Zeo load contributions are more dependent on the actual number of requested operations (reads, writes, commits) and less on the number of clients that request these operation (of cause, more clients can generated more requests).
2. Are there any special setting to allow for 'many' Zeo clients connecting to Zeo server?
Reconfigure the Python garbage collector such that it runs far less often. Get rid of components that (unnecessarily) create lots of Python objects. Check whether you do unnecessary operations during the second commit phase. Place your ZODB storage files intelligently in the file system such that other high volume IO operations do not badly affect IO on the storage. -- Dieter