[Zope-dev] 100k+ objects, or...Improving Performance of BTreeFolder...

Sun, 09 Dec 2001 19:36:21 -0800

Interesting FYI for those looking to support lots of cataloged objects in
ZODB and Zope (Chris W., et al)... I'm working on a project to put ~350k
Cataloged objects (customer database) in a single BTreeFolder-derived
container; these objects are 'proxy' objects which each expose a single
record in a relational dataset, and allow about 8 fields to be indexed (2 of
which, TextIndexes).

Some informal stress tests using 100k+ _Cataloged_ objects in a BTreeFolder
in Zope 2.3.3 on my PIII/500/256mb laptop are proving to be successful, but
not without some stubborn investigation and a few caveats.  

BTreeFolder, using ObjectManager APIs, frankly, just won't scale for
bulk-adds of objects to folders.  I was adding CatalogAware objects to my
folder (including index_object()). After waiting for bulk-add processes to
finish after running for 2 days, I killed Zope and started trying to
optimize, figuring that the problem was related to Catalog and my own RDB
access code, and got nowhere (well, I tuned my app, bu this didn't solve my
problem), then went to #zope, got a few ideas, and ended up with the
conclusion that my problem was not Catalog-related, but related to
BTreeFolder; I initially thought it was a problem with the C-Based generic
BTree implementation scaling well past 10k objects, but felt I couldn't
point the finger at that before some more basic stuff was ruled out.  

The easiest thing to do in this case, was to figure out what was heavily
accessing the BTree via its dictionary-like interface, and the thought
occurred to me that there might be multiple has_key checks, security stuff,
and the like called by ObjectManager._setObject(), and I was right. I
figured a switch to use the simple BasicBTreeFolder._setOb() for my stress
tests might reveal an increase in speed, and...

...it works, acceptably, no less, on my slow laptop for 100,000 objects.  It
took ~50 minutes to do this on meager hardware with a 4200 RPM ide disk, and
I figure a bulk add process like this on fast, new hardware (i.e. something
with upwards of 22k pystones and lots of RAM) with a dedicated server for my
RDB, would likely take 1/5th this time, or about 10 minutes (by increasing
both MySQL performance, and Zope performance); combine this with ZEO and
have a dedicated node do this, and I think this is a small amount of proof
of Zope's ability to scale to many objects. (See my caveats at the bottom of
this message, though).

After days of frustration, I'm actually impressed by what I found: My
data-access APIs are very computationally expensive, since they establish a
MySQLdb cursor object for each call and execute a query; these data access
methods used in bulk adding 100k objects after using _setOb() during
Cataloging via index_object() (the transaction done all in memory for now,
but likely moved to subtransactions soon to support up to 4x that data). 

So far, the moral of the story: use _setOb(), not _setObject() for this many
objects!

I haven't seen any material documenting anything like this for BTreeFolder,
so I figured I would share with zope-dev what I found in the hopes that
developers creating products with BTreeFolder and/or future implementations
of BTreeFolder might take this into account, in docs, if nothing else.

Caveats:
- I'm using FileStorage and an old version of Zope (2.3.3).  I can't say how
this will perform with Python 2.1/Zope 2.[4/5].  I imagine that one would
want to pack the storage between full rebuilds or have very, very fast
storage hardware.

- Catalog searches without any limiting queries to indexes will simply be
too slow for practical use with this many objects, so they need to be
forbidden with a permission to prevent accidental over-utilization of system
resources or DOS-style attacks.  Otherwise, Catalog searches on my slow hard
drive seem acceptable. 

- I'm not too concerned with BTreeFolder __getattr__() performance
penalties, though I modified BTreeFolder.__getattr__ just in case to remove
the 'if tree and tree.has_key(name)', replacing with try/except; I'm not
sure if this helps/hinders, because my stress-test code uses _getOb()
instead.

- objectIds() doesn't work; or, more accurately, at first glance, <dtml-var
"_.len(objectIds())"> doesn't work; I haven't tested anything else.  I would
like to find out why this is, and fix it.  I suppose that there is something
done in ObjectManager that BTreeFolder's simple _setOb() doesn't do.  If
anyone wants to help me figure out the obvious here, I'd appreciate it. ;)

- I don't think un-indexed access of records is likely to be very practical
with this many, esp. if things like objectIds() are broken, which increases
the value of Catalog, and I think that what my experiences here with this
project are showing is that Catalog indexing isn't as expensive/slow as I
initially thought it would be.  That said, I'm sure there can be
improvements in Catalog as often is discussed here recently, but for now, I
think I'm happy. :)  

- I Haven't compared these results with OFS.Folder.Folder yet.  I'm too
lazy/busy to comparison test.

- I'm relatively sure that, in my app, the text index BTrees in the Catalog
are very 'bushy' (more so than normal) because I am indexing people's full
names, and street addresses, which means there are less common words than
indexing, say, an every-day document.

- Also, I want to make it clear that if I had a data access API that needed
more than simple information about my datasets (i.e. I was trying to do
reporting on patterns, like CRM-ish types of applications), I would likely
wrap a function around indexes done in the RDB, not in Catalog.  My requires
no reporting functionality, and thus really needs no indexes, other than for
finding a record for customer service purposes and account validation
purposes.  The reason, however, that I chose ZCatalog was for full text
indexing that I could control/hack/customize easily.  My slightly uninformed
belief now is that for big datasets or "enterprise" applications (whatever
that means), I would use a hybrid set of (faster) indexes using the RDB's
indexes where appropriate (heavily queried fields), and ZCatalog for
TextIndexes (convenient).   I'm sure inevitable improvements to ZCatalog
(there seems to be community interest in such) will help here.

- I wonder if "directory-storage" combined with ReiserFS might make for an
interesting future ZODB choice for this sort of app.

Sean

=========================
Sean Upton
Senior Programmer/Analyst
SignOnSanDiego.com
The San Diego Union-Tribune
619.718.5241
sean.upton@uniontrib.com
=========================