[Zope] [WARNING] Garbage collection may be disruptive

29 Sep 2007

      We currently face a strange problem in a high performance Zope setup.
Its features are special such that not all of you may fear to face
the same problem:

   The setup consists of a large number of Zope instances
   all talking to a ZEO.

   Almost every non-trivial request (all requests not delivering
   static content) write to the ZODB. This means, we are
   using the ZODB in a high frequency write szenario -- one,
   it has not been designed for.

The problem:

   ZEO reports large amounts of "client blocked" occurrencies
   *BUT* neither the IO nor the CPU on the ZEO host is utilized
   significantly. This implies that it is not ZEO which is
   the bottleneck but the ZEO clients.

Up to yesterday, I was extremely puzzled by the behaviour of the ZEO
clients (i.e. the Zope processes). Usually, they worked quick and
reliably but occasionally, everything and especially the IO seemed
extremely slow. I added log messages to prominent IO operations and
could (e.g.) see occational times for "stat" in the order of 5 to 20 s.

The system is complex and there are many possible causes for such
desastrous behaviour. One may be problems with the IO subsystem. Our
administrators told me they could not see any unusual load on the IO
subsystem. But, for them, too, the system is complex.

I knew that the strange timings could mean two things: the IO is
really that slow or the measurement is flawed.  But all measurements
went into the same direction.  But all measurements used Python....
And this could mean that the Python thread scheduling (controlled by the
GIL) is extremely distorted -- distorted in the order of 10 to 20 s.

Yesterday (far too late), I decided to prove whether or not
there are scheduling distortions. I put another thread in
the Zope processes, scheduled periodically, which complains when 
there is too much discrepancy between the expected and the observed
scheduling time. The thread proved: there are scheduling distortions
at least in the order of 5 s.

Next question: why?
This morning I found a candidate: Python's garbage collector.
Of course, it holds the GIL (and prevents Python thread scheduling)
while it runs. From the documentation, I had the impression
that it should be efficient and quick, even for full (generation 2)
garbage collections -- but who knows.
I disabled automatic garbage collection and replaced it
with a peridic call to "gc.collect", logging the executions times.

The times started with 0.6s but meanwhile increased to 1.9 s per run
and I expect they will easily grow further until 5 to 6 s per run....

Moreover, scheduling distortions are reported precisely when
the garbage collector runs...

Usually, occasional garbage collection times of 5 to 6 s are
not so problematic -- provided that they are not
so frequent that they severely interfere with
transaction commits.
In our case, they do interfere with transaction commits
and can severely slow down the complete ZEO client farm
due to contention for the commit lock.
The reason is a broken TextIndex implementation -- again maybe
special for our installation.

The aim of this message: should you see strangely behaving ZEO clients
and lots of unexplainable "clients blocked" in your ZEO logfile,
you may think of scheduling distortions caused by Python's garbage
collector.

-- 
Dieter

[Zope] [WARNING] Garbage collection may be disruptive

Dieter Maurer