[Zope] ZCatalog performance issues - catalogging objects takes ages

Wankyu Choi wankyu@neoqst.com
Tue, 1 Apr 2003 05:15:31 +0900


Thanks a million.

I'll give it a try first thing in the morning ( oops, it's already 5AM =
in
Korea; gotta go to bed  ;-).

---------------------------------------------------------------
  Wankyu Choi
  CEO/President
  NeoQuest Communications, Inc.
  http://www.zoper.net
  http://www.neoboard.net
---------------------------------------------------------------  =20

-----Original Message-----
From: zope-admin@zope.org [mailto:zope-admin@zope.org] On Behalf Of =
Casey
Duncan
Sent: Tuesday, April 01, 2003 4:37 AM
To: Wankyu Choi; zope@zope.org
Subject: Re: [Zope] ZCatalog performance issues - catalogging objects =
takes
ages


On Monday 31 March 2003 12:02 pm, Wankyu Choi wrote:
> So glad to catch your attention :-)
>=20
> >Subclassing ZCatalog can be a maintenance headache. I did it for
> DocumentLibrary and regretted it.
>=20
> Can you expound on this? In fact, I did just the opposite. First, I=20
> tried "self._np_catalog =3D NeoPortalCatalog()" where NeoPortalCatalog =

> is a subclass of ZCatalog. Thought better ( or worse should I say :-)=20
> of it, and inherited directly from NeoPortalCatalog. Thought it would=20
> be easier in terms of maintenance.

What I found was that forward compatibility was a problem. I just so
happened=20
to land right before ZCatalog was refactored majorly for Zope 2.4.  I =
really

depends on how much internal ZCatalog machinery/data structures you =
depend=20
on...

See more comments inline:
=20
> >> Performance? Not so fast as SQL-backed PHP version ( displaying a
> >> pageful of threads takes only a fraction of a second ), but not =
bad.
>=20
> > Is this Zope 2.6.1? What do the queries look like?
>=20
> Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope =

> 2.6.1 and has had no obvious problems yet.
>=20
> Here's the code block that returns the query results and sorts a=20
> portion of them if necessary ( a non-threaded look doesn't require=20
> this sort.
>=20
> security.declarePrivate( '_getArticles' )               =20
>     def _getArticles( self
>                     , top=3DNone
>                     , expand_all=3D0
>                     , skey=3DNone
>                     , reverse=3D0
>                     , default_reverse=3D0
>                     , search_fields=3D''
>                     , search_keywords=3D''
>                     , b_start=3D0
>                     , limit=3DNone ):
>         """
>             Return a complete list of brains objects from the Catalog
>         """
>                   =20
>         if search_fields and search_keywords:
>=20
>             index_map =3D {'title': 'Title'
>                     , 'creator':'Creator'
>                     , 'creator_email':'CreatorEmail'
>                     , 'body':'getNeoPortalContentSearchText'
>                     , 'category':'getArticleCategory'
>                     , 'comments':'getInlineCommentsSearchText'
>                     , 'comment_creators':'getInlineCommentCreators'
>                     , 'attachments': 'getAttachmentsSearchText'}   =20
>                =20
>             new_indexes =3D []
>             for index in search_fields.keys():
>                 new_indexes.append( index_map[index] )
>=20
>             results =3D self.search( indexes=3Dnew_indexes,=20
> keywords=3Dsearch_keywords.split( ' ' )  )
>=20
>         else:
>             # threads are automatically reverse-sorted on sort_keys
>=20
>             reverse =3D int( reverse )
>             result_limit =3D None
>             b_start =3D int( b_start )
>             if limit is not None:
>                 limit =3D int( limit )
>                 result_limit =3D b_start + limit
>=20
>             if skey:
>                 if skey=3D=3D'np_read_count': skey =3D =
'getNeoPortalReadCount'
>                 elif skey=3D=3D'np_num_ratings': skey =3D=20
> 'getNeoPortalNumContentRatings'
>                 else: skey =3D 'creation_date'
>                 sort_order =3D ''               =20
>                 if reverse: sort_order =3D 'reverse'
>                 results =3D self.searchResults(=20
> meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0,=20
> sort_on=3Dskey, sort_order=3Dsort_order, limit =3D result_limit )

If you are trying to use the new sort limits, use: sort_limit =3D =
result_limit

[snip]
>            =20
>     security.declarePublic( 'sortAritlceThreads' )               =20
>     def sortArticleThreads( self, brains ):
>         """
>             Sort a list of brains=20
>         """
>=20
>         import operator
>=20
>         temp_list =3D map( lambda x: ( getattr( x.getObject(),=20
> '_sort_key' ), getattr( x.getObject(), '_thread_sort_key' ), x ),=20
> brains )
>               =20
>         temp_list.sort()
>         brains[:] =3D map( operator.getitem, temp_list, ( -1, ) * len( =

> temp_list ) )
>          =20
>         return brains

This sorting code is not going to scale well at all.
=20
> - isTempNeoBoardArticle: NeoBoard does what CMF does. When a user=20
> posts an article, it first creates a temp article and examines it. If=20
> unacceptable, deletes it. The method tells if the article is a=20
> temporary one or not.
> - isThreadParent: tells if the article is the top-most one in the =
thread.
> - expand_all: Boolean value to tell if we need a threaded-look.
> - getSortKey: returns thread sorting key. It's the inversed article
number:
> article 50's sort key  becomes -50 when added to the board for =
automatic
> reverse-sorting. ( Tried creation_date once, but it turned out to be a
> disaster when you do import/export.)

To make creation date work, you'd need to make it an application =
modified=20
attribute.

> - result_limit: calculated on the basis of the current batch.
>=20
> If you want to take a look at it in context, I've got a viewCVS set up =

> here:=20
> http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py
>=20
> And you can see the board in action here:=20
> http://www.zoper.net/Boards/qa/view
>=20
> I learned while reading docs on ZCatalog that I'd get better results=20
> by adding meta data to brain objects. Will remove that expensive=20
> sorting method soon.

No actually, metadata won't help sorting much. If you want "out of band" =

sorting Catalog (as of 2.6.1) has a method called sortResults, whose=20
signature looks like this:

sortResults(rs, sort_index, reverse=3D0, limit=3DNone, merge=3D1)

Where rs is the bare recordset (which can be had by calling
searchResults(...,=20
_merge=3D0))
sort_index is the index to sort by (the object not the name) reverse is =
the
direction (sort_order) limit is the sort limit merge determines what is
returned (1=3Dbrains, 0=3Da sorted list of rids), you=20
probably want 1

So you could do:

catalog =3D self._catalog
rs =3D catalog.searchResults(..., _merge=3D0)
...do some stuff with rs...
return catalog.sortResults(rs, self.getIndex(sort_key), ...)

>=20
> >> another set of problems while so doing.  I could display 5,000=20
> >> threads
> >> ( about 20,000 article obects incuding all replies to the threads) =
in=20
> >> less than a second ( it takes a bit more when you load the board =
for=20
> >> the first time. ) The problems are...
>=20
> > I would be interested in using this data as a benchmark for=20
> > improvements
> in
> 2.7...
>=20
> Took me a whole day to generate these articles; had fun with them for=20
> about a week and lost them last night when the board's catalog went=20
> crazy with missing keys; I had to remove the board and the data went=20
> with it :-(
>=20
> On a different note. Creating an article object doesn't require that=20
> much computation power. Just a bunch of init values for its=20
> properties. But instantiating an article in a for loop, for example,=20
> takes more than a second and it gets worse as the loop goes on. Is it=20
> because ZODB's transaction/version support? Normally, how long should=20
> it take to generate 20,000 not-so-heavy objects? Taking more than an=20
> hour seems not right with enough horsepower. When creating those test=20
> data, I had to take a long nap :-(

You should probably commit a subtransaction every so often so as not to =
use=20
too much memory. Sounds like it was trying to commit a really big=20
transaction. If these objects are all nested and you create a big =
hierarchy,

that might explain it a bit.
 =20
[snip]
> > that causes these key errors. Do these keyerrors happen at search=20
> > time?
>=20
> No, I meant a key error in 'mybrain.getObject()', that is, a ghost=20
> entry in the Catalog without the corresponding object. Guess it=20
> happens after a massive set of additions or deletions. Can't pinpoint=20
> a case. Fast reloads sometimes do generate ZODB conflict errors. If=20
> you reload while reindexing everything with heavy disk I/O, you=20
> usually get these ZODB conflicts. Maybe I should do some work on=20
> conflict resolutions?

That means an object was deleted without being unindexed. Sounds like an =

application bug somewhere.

BTW: Calling getObject for every object found is a really bad idea and =
will=20
kill performance.

[snip]
> I was wrong. I looked into the Catalog.py more closely and it was not=20
> the
> getIndex() call that was taking too long, but each index was. For =
example,
> 'Title' is a textindex and 'creation_date' is a DateIndex. Why would a
> TextIndex take that long to index an article with a simple title =
'test'?
The
> DateIndex is also painfully slow.=20

I can't tell you without seeing it myself ;^). If you can demonstrate =
this=20
behavior in a relatively simple test case, I'd be interested in helping =
to=20
fix it, if not for TextIndex then at least for DateIndex.
=20
> > I would definitely Try ZCTextIndex, just because its searching works =

> > so
> much better.
>=20
> Will try :-)
>=20
> > One general suggestion: What is your ZODB cache set to?
>=20
> I'm running these tests both on my desktop Linux box and on a set of=20
> enterprise servers.
>=20
> Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied,=20
> 512M RAM, cache set to 8,000, FileStorage
>=20
> Enterprise servers:
>=20
> - ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest=20
> errata applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage=20
> on ReiserFS with noatime on
> - ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest=20
> errata applied, 2G RAM with ZODB cache set to 20,000.
>=20
> Both my desktop and the ZEO client show the same symptoms. The ZEO=20
> servers render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5=20
> second, so I don't think there's any hardware/cache problems.
>=20
> >> Any help, hints or comments would be much appreciated.  I do need=20
> >> to
> >> move on with this project :-( It's been almost a year now...ouch.=20
> >> Weeks became months; months became a whole year... whew.
>=20
> > Yup, been there ;^)
>=20
> Been there too many times with other tools. Just hoped this time would =

> be different with Zope :-)
>=20
> Thanks for your help.
>=20
> ---------------------------------------------------------------
>   Wankyu Choi
>   CEO/President
>   NeoQuest Communications, Inc.
>   http://www.zoper.net
>   http://www.neoboard.net
> ---------------------------------------------------------------  =20
>=20
>=20


_______________________________________________
Zope maillist  -  Zope@zope.org =
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists -=20
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )