[Zope] ZCatalog performance issues - catalogging objects takes ages

Mon, 31 Mar 2003 14:36:56 -0500

On Monday 31 March 2003 12:02 pm, Wankyu Choi wrote:
> So glad to catch your attention :-)
>=20
> >Subclassing ZCatalog can be a maintenance headache. I did it for=20
> DocumentLibrary and regretted it.
>=20
> Can you expound on this? In fact, I did just the opposite. First, I tri=
ed
> "self._np_catalog =3D NeoPortalCatalog()" where NeoPortalCatalog is a s=
ubclass
> of ZCatalog. Thought better ( or worse should I say :-) of it, and inhe=
rited
> directly from NeoPortalCatalog. Thought it would be easier in terms of
> maintenance.

What I found was that forward compatibility was a problem. I just so happ=
ened=20
to land right before ZCatalog was refactored majorly for Zope 2.4.  I rea=
lly=20
depends on how much internal ZCatalog machinery/data structures you depen=
d=20
on...

See more comments inline:
=20
> >> Performance? Not so fast as SQL-backed PHP version ( displaying a=20
> >> pageful of threads takes only a fraction of a second ), but not bad.
>=20
> > Is this Zope 2.6.1? What do the queries look like?
>=20
> Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope =
2.6.1
> and has had no obvious problems yet.
>=20
> Here's the code block that returns the query results and sorts a portio=
n of
> them if necessary ( a non-threaded look doesn't require this sort.=20
>=20
> security.declarePrivate( '_getArticles' )               =20
>     def _getArticles( self
>                     , top=3DNone
>                     , expand_all=3D0
>                     , skey=3DNone
>                     , reverse=3D0
>                     , default_reverse=3D0
>                     , search_fields=3D''
>                     , search_keywords=3D''
>                     , b_start=3D0
>                     , limit=3DNone ):
>         """
>             Return a complete list of brains objects from the Catalog
>         """
>                   =20
>         if search_fields and search_keywords:
>=20
>             index_map =3D {'title': 'Title'
>                     , 'creator':'Creator'
>                     , 'creator_email':'CreatorEmail'
>                     , 'body':'getNeoPortalContentSearchText'
>                     , 'category':'getArticleCategory'
>                     , 'comments':'getInlineCommentsSearchText'
>                     , 'comment_creators':'getInlineCommentCreators'
>                     , 'attachments': 'getAttachmentsSearchText'}   =20
>                =20
>             new_indexes =3D []
>             for index in search_fields.keys():
>                 new_indexes.append( index_map[index] )
>=20
>             results =3D self.search( indexes=3Dnew_indexes,
> keywords=3Dsearch_keywords.split( ' ' )  )
>=20
>         else:
>             # threads are automatically reverse-sorted on sort_keys
>=20
>             reverse =3D int( reverse )
>             result_limit =3D None
>             b_start =3D int( b_start )
>             if limit is not None:
>                 limit =3D int( limit )
>                 result_limit =3D b_start + limit
>=20
>             if skey:
>                 if skey=3D=3D'np_read_count': skey =3D 'getNeoPortalRea=
dCount'
>                 elif skey=3D=3D'np_num_ratings': skey =3D
> 'getNeoPortalNumContentRatings'
>                 else: skey =3D 'creation_date'
>                 sort_order =3D ''               =20
>                 if reverse: sort_order =3D 'reverse'
>                 results =3D self.searchResults(
> meta_type=3DNeoBoardArticle.meta_type, isTempNeoBoardArticle=3D0, sort_=
on=3Dskey,
> sort_order=3Dsort_order, limit =3D result_limit )

If you are trying to use the new sort limits, use: sort_limit =3D result_=
limit

[snip]
>            =20
>     security.declarePublic( 'sortAritlceThreads' )               =20
>     def sortArticleThreads( self, brains ):
>         """
>             Sort a list of brains=20
>         """
>=20
>         import operator=20
>=20
>         temp_list =3D map( lambda x: ( getattr( x.getObject(), '_sort_k=
ey' ),
> getattr( x.getObject(), '_thread_sort_key' ), x ), brains )
>               =20
>         temp_list.sort()
>         brains[:] =3D map( operator.getitem, temp_list, ( -1, ) * len(
> temp_list ) )
>          =20
>         return brains

This sorting code is not going to scale well at all.
=20
> - isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts=
 an
> article, it first creates a temp article and examines it. If unacceptab=
le,
> deletes it. The method tells if the article is a temporary one or not.
> - isThreadParent: tells if the article is the top-most one in the threa=
d.
> - expand_all: Boolean value to tell if we need a threaded-look.
> - getSortKey: returns thread sorting key. It's the inversed article num=
ber:
> article 50's sort key  becomes -50 when added to the board for automati=
c
> reverse-sorting. ( Tried creation_date once, but it turned out to be a
> disaster when you do import/export.)

To make creation date work, you'd need to make it an application modified=
=20
attribute.

> - result_limit: calculated on the basis of the current batch.
>=20
> If you want to take a look at it in context, I've got a viewCVS set up =
here:
> http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py
>=20
> And you can see the board in action here:
> http://www.zoper.net/Boards/qa/view
>=20
> I learned while reading docs on ZCatalog that I'd get better results by
> adding meta data to brain objects. Will remove that expensive sorting m=
ethod
> soon.

No actually, metadata won't help sorting much. If you want "out of band"=20
sorting Catalog (as of 2.6.1) has a method called sortResults, whose=20
signature looks like this:

sortResults(rs, sort_index, reverse=3D0, limit=3DNone, merge=3D1)

Where rs is the bare recordset (which can be had by calling searchResults=
(...,=20
_merge=3D0))
sort_index is the index to sort by (the object not the name)
reverse is the direction (sort_order)
limit is the sort limit
merge determines what is returned (1=3Dbrains, 0=3Da sorted list of rids)=
, you=20
probably want 1

So you could do:

catalog =3D self._catalog
rs =3D catalog.searchResults(..., _merge=3D0)
=2E..do some stuff with rs...
return catalog.sortResults(rs, self.getIndex(sort_key), ...)

>=20
> >> another set of problems while so doing.  I could display 5,000 threa=
ds=20
> >> ( about 20,000 article obects incuding all replies to the threads) i=
n=20
> >> less than a second ( it takes a bit more when you load the board for=
=20
> >> the first time. ) The problems are...
>=20
> > I would be interested in using this data as a benchmark for improveme=
nts
> in=20
> 2.7...
>=20
> Took me a whole day to generate these articles; had fun with them for a=
bout
> a week and lost them last night when the board's catalog went crazy wit=
h
> missing keys; I had to remove the board and the data went with it :-(
>=20
> On a different note. Creating an article object doesn't require that mu=
ch
> computation power. Just a bunch of init values for its properties. But
> instantiating an article in a for loop, for example, takes more than a
> second and it gets worse as the loop goes on. Is it because ZODB's
> transaction/version support? Normally, how long should it take to gener=
ate
> 20,000 not-so-heavy objects? Taking more than an hour seems not right w=
ith
> enough horsepower. When creating those test data, I had to take a long =
nap
> :-(

You should probably commit a subtransaction every so often so as not to u=
se=20
too much memory. Sounds like it was trying to commit a really big=20
transaction. If these objects are all nested and you create a big hierarc=
hy,=20
that might explain it a bit.
 =20
[snip]
> > that causes these key errors. Do these keyerrors happen at search tim=
e?
>=20
> No, I meant a key error in 'mybrain.getObject()', that is, a ghost entr=
y in
> the Catalog without the corresponding object. Guess it happens after a
> massive set of additions or deletions. Can't pinpoint a case. Fast relo=
ads
> sometimes do generate ZODB conflict errors. If you reload while reindex=
ing
> everything with heavy disk I/O, you usually get these ZODB conflicts. M=
aybe
> I should do some work on conflict resolutions?

That means an object was deleted without being unindexed. Sounds like an=20
application bug somewhere.

BTW: Calling getObject for every object found is a really bad idea and wi=
ll=20
kill performance.

[snip]
> I was wrong. I looked into the Catalog.py more closely and it was not t=
he
> getIndex() call that was taking too long, but each index was. For examp=
le,
> 'Title' is a textindex and 'creation_date' is a DateIndex. Why would a
> TextIndex take that long to index an article with a simple title 'test'=
? The
> DateIndex is also painfully slow.=20

I can't tell you without seeing it myself ;^). If you can demonstrate thi=
s=20
behavior in a relatively simple test case, I'd be interested in helping t=
o=20
fix it, if not for TextIndex then at least for DateIndex.
=20
> > I would definitely Try ZCTextIndex, just because its searching works =
so
> much better.
>=20
> Will try :-)
>=20
> > One general suggestion: What is your ZODB cache set to?=20
>=20
> I'm running these tests both on my desktop Linux box and on a set of
> enterprise servers.
>=20
> Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, 51=
2M
> RAM, cache set to 8,000, FileStorage
>=20
> Enterprise servers:=20
>=20
> - ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest er=
rata
> applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on Reiser=
FS
> with noatime on
> - ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata
> applied, 2G RAM with ZODB cache set to 20,000.
>=20
> Both my desktop and the ZEO client show the same symptoms. The ZEO serv=
ers
> render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so=
 I
> don't think there's any hardware/cache problems.
>=20
> >> Any help, hints or comments would be much appreciated.  I do need to=
=20
> >> move on with this project :-( It's been almost a year now...ouch.=20
> >> Weeks became months; months became a whole year... whew.
>=20
> > Yup, been there ;^)
>=20
> Been there too many times with other tools. Just hoped this time would =
be
> different with Zope :-)=20
>=20
> Thanks for your help.
>=20
> ---------------------------------------------------------------
>   Wankyu Choi
>   CEO/President
>   NeoQuest Communications, Inc.
>   http://www.zoper.net
>   http://www.neoboard.net
> ---------------------------------------------------------------  =20
>=20
>=20