Re: [Zope] ZCatalog performance issues - catalogging objects takes ages

31 Mar 2003

      On Monday 31 March 2003 05:47 am, Wankyu Choi wrote:
...
Dear All,
May I have your expertise on this? ;-)
As much as I'm new to Zope/Python, ZCatalog (Catalog) internals vex me even
more.
I have a message board product called NeoBoard, some of you might know.
Recently I rewrote its core to have a built-in catalog for indexing articles
and displaying them automatically sorted on thread keys. It showed quite a
boost in performance. Previous versions without the built-in catalog used to
ramrod all article objects into/out of memory whenever they need to display
them. What a waste of memory and CPU power as Toby Dickenson suggested.
Here's what I did to solve this problem:
- Rewrote the parent class of the NeoBoard/NeoBoardArticle ( article
container/article objects themselves ), NeoPortalElementContainer to inherit
ZCatalog. Basically NeoPortalElementContainer automatically natural
sorts/numbers objects (elements) when they're added to the container:
page_1, page_2, ... etc.
Subclassing ZCatalog can be a maintenance headache. I did it for 
DocumentLibrary and regretted it.
...
- NeoBoardArticle looks toward NeoBoard when the catalog methods defined in
NeoPortalElementContaier are called. So NeoBoard's catalog methods are
always used no matter whereever you are in the path hierarchy.
- When you call a NeoBoard instance, it calls ZCatalog's searchResults(),
which returns brains objects. A threaded (expanded) look does  require a
step further: NeoBoard sorts a pageful of threads and their replies before
returning them; it doesn't care about the other threads that are not
displayed in the current request.
Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of
threads takes only a fraction of a second ), but not bad.
Is this Zope 2.6.1? What do the queries look like?
...
Okay, I partially solved one problem ( wasting memory/horsepower, etc - I'm
still not satisfied with the performance, though ) but created another set
of problems while so doing.  I could display 5,000 threads ( about 20,000
article obects incuding all replies to the threads) in less than a second (
it takes a bit more when you load the board for the first time. ) The
problems are...
I would be interested in using this data as a benchmark for improvements in 
2.7...
...
- It takes ages when cataloging even a small number of articles. 18 seconds
for cataloging 50 or so article objects with so little to index? Is it
normal? Can't imagine recataloging 20,000 objects.  For example, if you move
a thread from one NeoBoard instance to another, you have to uncatalog the
thread including  all its replies in NeoBoad A and catalog them in NeoBoard
B: cataloging a single article object takes more than 1 second. Don't think
it's normal... Or is it?
Profiling may be necessary to pin this down. Likely culprets are textindexes, 
but its hard to say. Are you sure you are doing a minimum of work (i.e., only 
indexing each message once)?
...
- When I attempt to uncatalog an object that's not been catalogged, Zope
spews out errors in the log. Can I supress the errors in code, which, in my
applications, are meaningless.
These errors are harmless. It might be better to check if they are cataloged 
first before uncataloging them.
...
- Catalogs sometimes do get corrupted so recatalogging is required from time
to time. Is it also normal? All of my article objects are catalog-aware and
they catalog/uncatalog/recatalog themselves when getting added, deleted, or
modified using manage_afterAdd(), manage_beforeDelete() and CMF'ish _edit()
method. When a missing article (ghost catalog entry) causes a KeyError,
NeoBoard attempts to refresh the catalog: well, takes too much time. But
manually recreating its catalog is not an alternative. Any ideas why this'd
happen? Any tips on maintaining catalog integrity?
Although there are have historically been BTree bugs that can cause KeyErrors, 
they have slowly been stamped out. It would be helpful to find a test case 
that causes these key errors. Do these keyerrors happen at search time?
...
- Here're the indexes NeoBoard uses:
security.declarePublic( 'enumerateIndexes' )
    def enumerateIndexes( self ):
        """
            Return a list of ( index_name, type ) pairs for the initial
            index set.
        """ 
        return ( ('Title', 'TextIndex')
               , ('meta_type', 'FieldIndex')
               , ('getSortKey', 'FieldIndex')        
               , ('getThreadSortKey', 'FieldIndex')        
               , ('isThreadParent', 'FieldIndex')                       
               , ('creation_date', 'FieldIndex')        
               , ('Creator', 'FieldIndex')
               , ('CreatorEmail', 'FieldIndex')
               , ('getArticleCategory', 'FieldIndex')
               , ('getNeoPortalContentSearchText', 'TextIndex')
               , ('getInlineCommentsSearchText', 'TextIndex')               
               , ('getInlineCommentCreators', 'TextIndex')               
               , ('getAttachmentsSearchText', 'TextIndex')
               , ('getNeoPortalReadCount', 'FieldIndex')
               , ('getNeoPortalNumContentRatings', 'FieldIndex')
               , ('getNeoPortalElementNumber', 'FieldIndex')
               , ('isTempNeoBoardArticle', 'FieldIndex')
               )
I'm concerned that the CommentsSearchText and AttachmentsSearchText are 
arbitrarily expensive. Maybe as a test try removing one index at a time to 
see if any one is causing a noticable performance decrease. Start with the 
TextIndexes.
...
I came to know that 'TextIndex' is deprecated. Have yet to try ZCTextIndex
or TextIndexNG ( the latter seems like an overkill). Found 'TopicIndex' very
interesting. Would they make much difference? Especially, I was suprised to
find the simple 'Title' index takes almost one full second when applied on
an object: that getIndex( name ) call alone in the Catalog.py takes this
much. So I suspect it's not about Catalog but I'm doing something very
stupid in setting up this built-in catalog.
That delay may be exposing an index bug. getIndex just does a single 
dictionary lookup and wraps it, so I'm not sure why this should take a long 
time, unless the TextIndex object is taking a *long* time to load from the 
database. But its main ZODB record should not be very big.

I would definitely Try ZCTextIndex, just because its searching works so much 
better.
...
ONE FINAL QUESTION: I strongly suspect I wouldn'[t be able to get any faster
using ZCatalog. At least not as fast as using RDBMS. I'm thinking... "Not
fast enough, not flexible enough since I can't perform sopnisticated queries
on ZCatalog and stuff... why not revert to MySQL?" Got any thoughts on this?
How does ZCatalog compare to a reasonably fast RDBMS?
One general suggestion: What is your ZODB cache set to? The default of 400 is 
*way* too small for heavy ZCatalog use. I would try upping it to 2000, maybe 
higher (depending on RAM). Use the activity monitor to see how much reading 
happens when you query and index. Upping the cache size can dramatically 
reduce reading from disk. Going from 400 to 2000 gave me roughly a factor of 
10 improvement in one test I had querying ZCTextIndex. It also can 
dramatically help index time since more of the lexicon and index BTrees can 
remain in memory.
...
NeoBoard (1.1) will be taken out of its beta phases when I solve this
catalogging weirdness, and might start working on 1.2 using MySQL or SAPDB
as backend. Hope somebody can persuave me out of this path... just the
thought of having to rewrite the core to use SQL makes me
shudder....arrrrrrgh...
Any help, hints or comments would be much appreciated.  I do need to move on
with this project :-( It's been almost a year now...ouch. Weeks became
months; months became a whole year... whew.
Yup, been there ;^)

-Casey

Re: [Zope] ZCatalog performance issues - catalogging objects takes ages

Casey Duncan