Re: [Zope] ZCatalog performance issues - catalogging objects takes ages

31 Mar 2003

      On Monday 31 March 2003 12:02 pm, Wankyu Choi wrote:
...
So glad to catch your attention :-)
...
Subclassing ZCatalog can be a maintenance headache. I did it for 
DocumentLibrary and regretted it.
Can you expound on this? In fact, I did just the opposite. First, I tried
"self._np_catalog = NeoPortalCatalog()" where NeoPortalCatalog is a subclass
of ZCatalog. Thought better ( or worse should I say :-) of it, and inherited
directly from NeoPortalCatalog. Thought it would be easier in terms of
maintenance.
What I found was that forward compatibility was a problem. I just so happened 
to land right before ZCatalog was refactored majorly for Zope 2.4.  I really 
depends on how much internal ZCatalog machinery/data structures you depend 
on...

See more comments inline:
...
...
...
Performance? Not so fast as SQL-backed PHP version ( displaying a 
pageful of threads takes only a fraction of a second ), but not bad.
...
Is this Zope 2.6.1? What do the queries look like?
Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope 2.6.1
and has had no obvious problems yet.
Here's the code block that returns the query results and sorts a portion of
them if necessary ( a non-threaded look doesn't require this sort.
security.declarePrivate( '_getArticles' )                
    def _getArticles( self
                    , top=None
                    , expand_all=0
                    , skey=None
                    , reverse=0
                    , default_reverse=0
                    , search_fields=''
                    , search_keywords=''
                    , b_start=0
                    , limit=None ):
        """
            Return a complete list of brains objects from the Catalog
        """
if search_fields and search_keywords:
index_map = {'title': 'Title'
                    , 'creator':'Creator'
                    , 'creator_email':'CreatorEmail'
                    , 'body':'getNeoPortalContentSearchText'
                    , 'category':'getArticleCategory'
                    , 'comments':'getInlineCommentsSearchText'
                    , 'comment_creators':'getInlineCommentCreators'
                    , 'attachments': 'getAttachmentsSearchText'}
new_indexes = []
            for index in search_fields.keys():
                new_indexes.append( index_map[index] )
results = self.search( indexes=new_indexes,
keywords=search_keywords.split( ' ' )  )
else:
            # threads are automatically reverse-sorted on sort_keys
reverse = int( reverse )
            result_limit = None
            b_start = int( b_start )
            if limit is not None:
                limit = int( limit )
                result_limit = b_start + limit
if skey:
                if skey=='np_read_count': skey = 'getNeoPortalReadCount'
                elif skey=='np_num_ratings': skey =
'getNeoPortalNumContentRatings'
                else: skey = 'creation_date'
                sort_order = ''                
                if reverse: sort_order = 'reverse'
                results = self.searchResults(
meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, sort_on=skey,
sort_order=sort_order, limit = result_limit )
If you are trying to use the new sort limits, use: sort_limit = result_limit

[snip]
...
security.declarePublic( 'sortAritlceThreads' )                
    def sortArticleThreads( self, brains ):
        """
            Sort a list of brains 
        """
import operator
temp_list = map( lambda x: ( getattr( x.getObject(), '_sort_key' ),
getattr( x.getObject(), '_thread_sort_key' ), x ), brains )
temp_list.sort()
        brains[:] = map( operator.getitem, temp_list, ( -1, ) * len(
temp_list ) )
return brains
This sorting code is not going to scale well at all.
...
- isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts an
article, it first creates a temp article and examines it. If unacceptable,
deletes it. The method tells if the article is a temporary one or not.
- isThreadParent: tells if the article is the top-most one in the thread.
- expand_all: Boolean value to tell if we need a threaded-look.
- getSortKey: returns thread sorting key. It's the inversed article number:
article 50's sort key  becomes -50 when added to the board for automatic
reverse-sorting. ( Tried creation_date once, but it turned out to be a
disaster when you do import/export.)
To make creation date work, you'd need to make it an application modified 
attribute.
...
- result_limit: calculated on the basis of the current batch.
If you want to take a look at it in context, I've got a viewCVS set up here:
http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py
And you can see the board in action here:
http://www.zoper.net/Boards/qa/view
I learned while reading docs on ZCatalog that I'd get better results by
adding meta data to brain objects. Will remove that expensive sorting method
soon.
No actually, metadata won't help sorting much. If you want "out of band" 
sorting Catalog (as of 2.6.1) has a method called sortResults, whose 
signature looks like this:

sortResults(rs, sort_index, reverse=0, limit=None, merge=1)

Where rs is the bare recordset (which can be had by calling searchResults(..., 
_merge=0))
sort_index is the index to sort by (the object not the name)
reverse is the direction (sort_order)
limit is the sort limit
merge determines what is returned (1=brains, 0=a sorted list of rids), you 
probably want 1

So you could do:

catalog = self._catalog
rs = catalog.searchResults(..., _merge=0)
...do some stuff with rs...
return catalog.sortResults(rs, self.getIndex(sort_key), ...)
...
...
...
another set of problems while so doing.  I could display 5,000 threads 
( about 20,000 article obects incuding all replies to the threads) in 
less than a second ( it takes a bit more when you load the board for 
the first time. ) The problems are...
...
I would be interested in using this data as a benchmark for improvements
in 
2.7...
Took me a whole day to generate these articles; had fun with them for about
a week and lost them last night when the board's catalog went crazy with
missing keys; I had to remove the board and the data went with it :-(
On a different note. Creating an article object doesn't require that much
computation power. Just a bunch of init values for its properties. But
instantiating an article in a for loop, for example, takes more than a
second and it gets worse as the loop goes on. Is it because ZODB's
transaction/version support? Normally, how long should it take to generate
20,000 not-so-heavy objects? Taking more than an hour seems not right with
enough horsepower. When creating those test data, I had to take a long nap
:-(
You should probably commit a subtransaction every so often so as not to use 
too much memory. Sounds like it was trying to commit a really big 
transaction. If these objects are all nested and you create a big hierarchy, 
that might explain it a bit.

[snip]
...
...
that causes these key errors. Do these keyerrors happen at search time?
No, I meant a key error in 'mybrain.getObject()', that is, a ghost entry in
the Catalog without the corresponding object. Guess it happens after a
massive set of additions or deletions. Can't pinpoint a case. Fast reloads
sometimes do generate ZODB conflict errors. If you reload while reindexing
everything with heavy disk I/O, you usually get these ZODB conflicts. Maybe
I should do some work on conflict resolutions?
That means an object was deleted without being unindexed. Sounds like an 
application bug somewhere.

BTW: Calling getObject for every object found is a really bad idea and will 
kill performance.

[snip]
...
I was wrong. I looked into the Catalog.py more closely and it was not the
getIndex() call that was taking too long, but each index was. For example,
'Title' is a textindex and 'creation_date' is a DateIndex. Why would a
TextIndex take that long to index an article with a simple title 'test'? The
DateIndex is also painfully slow.
I can't tell you without seeing it myself ;^). If you can demonstrate this 
behavior in a relatively simple test case, I'd be interested in helping to 
fix it, if not for TextIndex then at least for DateIndex.
...
...
I would definitely Try ZCTextIndex, just because its searching works so
much better.
Will try :-)
...
One general suggestion: What is your ZODB cache set to?
I'm running these tests both on my desktop Linux box and on a set of
enterprise servers.
Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, 512M
RAM, cache set to 8,000, FileStorage
Enterprise servers:
- ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest errata
applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on ReiserFS
with noatime on
- ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata
applied, 2G RAM with ZODB cache set to 20,000.
Both my desktop and the ZEO client show the same symptoms. The ZEO servers
render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so I
don't think there's any hardware/cache problems.
...
...
Any help, hints or comments would be much appreciated.  I do need to 
move on with this project :-( It's been almost a year now...ouch. 
Weeks became months; months became a whole year... whew.
...
Yup, been there ;^)
Been there too many times with other tools. Just hoped this time would be
different with Zope :-)
Thanks for your help.
---------------------------------------------------------------
  Wankyu Choi
  CEO/President
  NeoQuest Communications, Inc.
  http://www.zoper.net
  http://www.neoboard.net
---------------------------------------------------------------

Re: [Zope] ZCatalog performance issues - catalogging objects takes ages

Casey Duncan