ZCatalog performance issues - catalogging objects takes ages
Dear All, May I have your expertise on this? ;-) As much as I'm new to Zope/Python, ZCatalog (Catalog) internals vex me even more. I have a message board product called NeoBoard, some of you might know. Recently I rewrote its core to have a built-in catalog for indexing articles and displaying them automatically sorted on thread keys. It showed quite a boost in performance. Previous versions without the built-in catalog used to ramrod all article objects into/out of memory whenever they need to display them. What a waste of memory and CPU power as Toby Dickenson suggested. Here's what I did to solve this problem: - Rewrote the parent class of the NeoBoard/NeoBoardArticle ( article container/article objects themselves ), NeoPortalElementContainer to inherit ZCatalog. Basically NeoPortalElementContainer automatically natural sorts/numbers objects (elements) when they're added to the container: page_1, page_2, ... etc. - NeoBoardArticle looks toward NeoBoard when the catalog methods defined in NeoPortalElementContaier are called. So NeoBoard's catalog methods are always used no matter whereever you are in the path hierarchy. - When you call a NeoBoard instance, it calls ZCatalog's searchResults(), which returns brains objects. A threaded (expanded) look does require a step further: NeoBoard sorts a pageful of threads and their replies before returning them; it doesn't care about the other threads that are not displayed in the current request. Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of threads takes only a fraction of a second ), but not bad. Okay, I partially solved one problem ( wasting memory/horsepower, etc - I'm still not satisfied with the performance, though ) but created another set of problems while so doing. I could display 5,000 threads ( about 20,000 article obects incuding all replies to the threads) in less than a second ( it takes a bit more when you load the board for the first time. ) The problems are... - It takes ages when cataloging even a small number of articles. 18 seconds for cataloging 50 or so article objects with so little to index? Is it normal? Can't imagine recataloging 20,000 objects. For example, if you move a thread from one NeoBoard instance to another, you have to uncatalog the thread including all its replies in NeoBoad A and catalog them in NeoBoard B: cataloging a single article object takes more than 1 second. Don't think it's normal... Or is it? - When I attempt to uncatalog an object that's not been catalogged, Zope spews out errors in the log. Can I supress the errors in code, which, in my applications, are meaningless. - Catalogs sometimes do get corrupted so recatalogging is required from time to time. Is it also normal? All of my article objects are catalog-aware and they catalog/uncatalog/recatalog themselves when getting added, deleted, or modified using manage_afterAdd(), manage_beforeDelete() and CMF'ish _edit() method. When a missing article (ghost catalog entry) causes a KeyError, NeoBoard attempts to refresh the catalog: well, takes too much time. But manually recreating its catalog is not an alternative. Any ideas why this'd happen? Any tips on maintaining catalog integrity? - Here're the indexes NeoBoard uses: security.declarePublic( 'enumerateIndexes' ) def enumerateIndexes( self ): """ Return a list of ( index_name, type ) pairs for the initial index set. """ return ( ('Title', 'TextIndex') , ('meta_type', 'FieldIndex') , ('getSortKey', 'FieldIndex') , ('getThreadSortKey', 'FieldIndex') , ('isThreadParent', 'FieldIndex') , ('creation_date', 'FieldIndex') , ('Creator', 'FieldIndex') , ('CreatorEmail', 'FieldIndex') , ('getArticleCategory', 'FieldIndex') , ('getNeoPortalContentSearchText', 'TextIndex') , ('getInlineCommentsSearchText', 'TextIndex') , ('getInlineCommentCreators', 'TextIndex') , ('getAttachmentsSearchText', 'TextIndex') , ('getNeoPortalReadCount', 'FieldIndex') , ('getNeoPortalNumContentRatings', 'FieldIndex') , ('getNeoPortalElementNumber', 'FieldIndex') , ('isTempNeoBoardArticle', 'FieldIndex') ) I came to know that 'TextIndex' is deprecated. Have yet to try ZCTextIndex or TextIndexNG ( the latter seems like an overkill). Found 'TopicIndex' very interesting. Would they make much difference? Especially, I was suprised to find the simple 'Title' index takes almost one full second when applied on an object: that getIndex( name ) call alone in the Catalog.py takes this much. So I suspect it's not about Catalog but I'm doing something very stupid in setting up this built-in catalog. ONE FINAL QUESTION: I strongly suspect I wouldn'[t be able to get any faster using ZCatalog. At least not as fast as using RDBMS. I'm thinking... "Not fast enough, not flexible enough since I can't perform sopnisticated queries on ZCatalog and stuff... why not revert to MySQL?" Got any thoughts on this? How does ZCatalog compare to a reasonably fast RDBMS? NeoBoard (1.1) will be taken out of its beta phases when I solve this catalogging weirdness, and might start working on 1.2 using MySQL or SAPDB as backend. Hope somebody can persuave me out of this path... just the thought of having to rewrite the core to use SQL makes me shudder....arrrrrrgh... Any help, hints or comments would be much appreciated. I do need to move on with this project :-( It's been almost a year now...ouch. Weeks became months; months became a whole year... whew. Thanks in advance. --------------------------------------------------------------- Wankyu Choi CEO/President NeoQuest Communications, Inc. http://www.zoper.net http://www.neoboard.net ---------------------------------------------------------------
On Monday 31 March 2003 05:47 am, Wankyu Choi wrote:
Dear All,
May I have your expertise on this? ;-)
As much as I'm new to Zope/Python, ZCatalog (Catalog) internals vex me even more.
I have a message board product called NeoBoard, some of you might know. Recently I rewrote its core to have a built-in catalog for indexing articles and displaying them automatically sorted on thread keys. It showed quite a boost in performance. Previous versions without the built-in catalog used to ramrod all article objects into/out of memory whenever they need to display them. What a waste of memory and CPU power as Toby Dickenson suggested.
Here's what I did to solve this problem:
- Rewrote the parent class of the NeoBoard/NeoBoardArticle ( article container/article objects themselves ), NeoPortalElementContainer to inherit ZCatalog. Basically NeoPortalElementContainer automatically natural sorts/numbers objects (elements) when they're added to the container: page_1, page_2, ... etc.
Subclassing ZCatalog can be a maintenance headache. I did it for DocumentLibrary and regretted it.
- NeoBoardArticle looks toward NeoBoard when the catalog methods defined in NeoPortalElementContaier are called. So NeoBoard's catalog methods are always used no matter whereever you are in the path hierarchy. - When you call a NeoBoard instance, it calls ZCatalog's searchResults(), which returns brains objects. A threaded (expanded) look does require a step further: NeoBoard sorts a pageful of threads and their replies before returning them; it doesn't care about the other threads that are not displayed in the current request.
Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of threads takes only a fraction of a second ), but not bad.
Is this Zope 2.6.1? What do the queries look like?
Okay, I partially solved one problem ( wasting memory/horsepower, etc - I'm still not satisfied with the performance, though ) but created another set of problems while so doing. I could display 5,000 threads ( about 20,000 article obects incuding all replies to the threads) in less than a second ( it takes a bit more when you load the board for the first time. ) The problems are...
I would be interested in using this data as a benchmark for improvements in 2.7...
- It takes ages when cataloging even a small number of articles. 18 seconds for cataloging 50 or so article objects with so little to index? Is it normal? Can't imagine recataloging 20,000 objects. For example, if you move a thread from one NeoBoard instance to another, you have to uncatalog the thread including all its replies in NeoBoad A and catalog them in NeoBoard B: cataloging a single article object takes more than 1 second. Don't think it's normal... Or is it?
Profiling may be necessary to pin this down. Likely culprets are textindexes, but its hard to say. Are you sure you are doing a minimum of work (i.e., only indexing each message once)?
- When I attempt to uncatalog an object that's not been catalogged, Zope spews out errors in the log. Can I supress the errors in code, which, in my applications, are meaningless.
These errors are harmless. It might be better to check if they are cataloged first before uncataloging them.
- Catalogs sometimes do get corrupted so recatalogging is required from time to time. Is it also normal? All of my article objects are catalog-aware and they catalog/uncatalog/recatalog themselves when getting added, deleted, or modified using manage_afterAdd(), manage_beforeDelete() and CMF'ish _edit() method. When a missing article (ghost catalog entry) causes a KeyError, NeoBoard attempts to refresh the catalog: well, takes too much time. But manually recreating its catalog is not an alternative. Any ideas why this'd happen? Any tips on maintaining catalog integrity?
Although there are have historically been BTree bugs that can cause KeyErrors, they have slowly been stamped out. It would be helpful to find a test case that causes these key errors. Do these keyerrors happen at search time?
- Here're the indexes NeoBoard uses:
security.declarePublic( 'enumerateIndexes' ) def enumerateIndexes( self ): """ Return a list of ( index_name, type ) pairs for the initial index set. """ return ( ('Title', 'TextIndex') , ('meta_type', 'FieldIndex') , ('getSortKey', 'FieldIndex') , ('getThreadSortKey', 'FieldIndex') , ('isThreadParent', 'FieldIndex') , ('creation_date', 'FieldIndex') , ('Creator', 'FieldIndex') , ('CreatorEmail', 'FieldIndex') , ('getArticleCategory', 'FieldIndex') , ('getNeoPortalContentSearchText', 'TextIndex') , ('getInlineCommentsSearchText', 'TextIndex') , ('getInlineCommentCreators', 'TextIndex') , ('getAttachmentsSearchText', 'TextIndex') , ('getNeoPortalReadCount', 'FieldIndex') , ('getNeoPortalNumContentRatings', 'FieldIndex') , ('getNeoPortalElementNumber', 'FieldIndex') , ('isTempNeoBoardArticle', 'FieldIndex') )
I'm concerned that the CommentsSearchText and AttachmentsSearchText are arbitrarily expensive. Maybe as a test try removing one index at a time to see if any one is causing a noticable performance decrease. Start with the TextIndexes.
I came to know that 'TextIndex' is deprecated. Have yet to try ZCTextIndex or TextIndexNG ( the latter seems like an overkill). Found 'TopicIndex' very interesting. Would they make much difference? Especially, I was suprised to find the simple 'Title' index takes almost one full second when applied on an object: that getIndex( name ) call alone in the Catalog.py takes this much. So I suspect it's not about Catalog but I'm doing something very stupid in setting up this built-in catalog.
That delay may be exposing an index bug. getIndex just does a single dictionary lookup and wraps it, so I'm not sure why this should take a long time, unless the TextIndex object is taking a *long* time to load from the database. But its main ZODB record should not be very big. I would definitely Try ZCTextIndex, just because its searching works so much better.
ONE FINAL QUESTION: I strongly suspect I wouldn'[t be able to get any faster using ZCatalog. At least not as fast as using RDBMS. I'm thinking... "Not fast enough, not flexible enough since I can't perform sopnisticated queries on ZCatalog and stuff... why not revert to MySQL?" Got any thoughts on this? How does ZCatalog compare to a reasonably fast RDBMS?
One general suggestion: What is your ZODB cache set to? The default of 400 is *way* too small for heavy ZCatalog use. I would try upping it to 2000, maybe higher (depending on RAM). Use the activity monitor to see how much reading happens when you query and index. Upping the cache size can dramatically reduce reading from disk. Going from 400 to 2000 gave me roughly a factor of 10 improvement in one test I had querying ZCTextIndex. It also can dramatically help index time since more of the lexicon and index BTrees can remain in memory.
NeoBoard (1.1) will be taken out of its beta phases when I solve this catalogging weirdness, and might start working on 1.2 using MySQL or SAPDB as backend. Hope somebody can persuave me out of this path... just the thought of having to rewrite the core to use SQL makes me shudder....arrrrrrgh...
Any help, hints or comments would be much appreciated. I do need to move on with this project :-( It's been almost a year now...ouch. Weeks became months; months became a whole year... whew.
Yup, been there ;^) -Casey
So glad to catch your attention :-)
Subclassing ZCatalog can be a maintenance headache. I did it for DocumentLibrary and regretted it.
Can you expound on this? In fact, I did just the opposite. First, I tried "self._np_catalog = NeoPortalCatalog()" where NeoPortalCatalog is a subclass of ZCatalog. Thought better ( or worse should I say :-) of it, and inherited directly from NeoPortalCatalog. Thought it would be easier in terms of maintenance.
Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of threads takes only a fraction of a second ), but not bad.
Is this Zope 2.6.1? What do the queries look like?
Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope 2.6.1 and has had no obvious problems yet. Here's the code block that returns the query results and sorts a portion of them if necessary ( a non-threaded look doesn't require this sort. ) security.declarePrivate( '_getArticles' ) def _getArticles( self , top=None , expand_all=0 , skey=None , reverse=0 , default_reverse=0 , search_fields='' , search_keywords='' , b_start=0 , limit=None ): """ Return a complete list of brains objects from the Catalog """ if search_fields and search_keywords: index_map = {'title': 'Title' , 'creator':'Creator' , 'creator_email':'CreatorEmail' , 'body':'getNeoPortalContentSearchText' , 'category':'getArticleCategory' , 'comments':'getInlineCommentsSearchText' , 'comment_creators':'getInlineCommentCreators' , 'attachments': 'getAttachmentsSearchText'} new_indexes = [] for index in search_fields.keys(): new_indexes.append( index_map[index] ) results = self.search( indexes=new_indexes, keywords=search_keywords.split( ' ' ) ) else: # threads are automatically reverse-sorted on sort_keys reverse = int( reverse ) result_limit = None b_start = int( b_start ) if limit is not None: limit = int( limit ) result_limit = b_start + limit if skey: if skey=='np_read_count': skey = 'getNeoPortalReadCount' elif skey=='np_num_ratings': skey = 'getNeoPortalNumContentRatings' else: skey = 'creation_date' sort_order = '' if reverse: sort_order = 'reverse' results = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, sort_on=skey, sort_order=sort_order, limit = result_limit ) else: expand_all = int( expand_all ) current_thread = None if expand_all: results = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, sort_on='getSortKey', limit = result_limit ) else: results = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, isThreadParent=( not expand_all ), sort_on='getSortKey', limit = result_limit ) # also pull all the replies to the current article if top is not None and top.meta_type == NeoBoardArticle.meta_type: top = top.getThreadParent() sort_key = top.getSortKey() current_thread = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, getSortKey=sort_key, isThreadParent=0, sort_on='getThreadSortKey' ) if limit is not None: # sort only the meaningful portion of the results specified by 'limit' first_half = [] second_half = [] first_half = results[: b_start ] middle = results[ b_start : ( b_start + limit ) ] if not expand_all and current_thread is not None: middle[:] = middle[:] + current_thread[:] second_half = results[ b_start + limit : ] middle = self.sortArticleThreads( middle ) results = first_half + middle + second_half notices = self.getNotices() if len( notices ) > 0: results = notices[:] + results[:] return results security.declarePublic( 'sortAritlceThreads' ) def sortArticleThreads( self, brains ): """ Sort a list of brains """ import operator temp_list = map( lambda x: ( getattr( x.getObject(), '_sort_key' ), getattr( x.getObject(), '_thread_sort_key' ), x ), brains ) temp_list.sort() brains[:] = map( operator.getitem, temp_list, ( -1, ) * len( temp_list ) ) return brains - isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts an article, it first creates a temp article and examines it. If unacceptable, deletes it. The method tells if the article is a temporary one or not. - isThreadParent: tells if the article is the top-most one in the thread. - expand_all: Boolean value to tell if we need a threaded-look. - getSortKey: returns thread sorting key. It's the inversed article number: article 50's sort key becomes -50 when added to the board for automatic reverse-sorting. ( Tried creation_date once, but it turned out to be a disaster when you do import/export.) - result_limit: calculated on the basis of the current batch. If you want to take a look at it in context, I've got a viewCVS set up here: http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py And you can see the board in action here: http://www.zoper.net/Boards/qa/view I learned while reading docs on ZCatalog that I'd get better results by adding meta data to brain objects. Will remove that expensive sorting method soon.
another set of problems while so doing. I could display 5,000 threads ( about 20,000 article obects incuding all replies to the threads) in less than a second ( it takes a bit more when you load the board for the first time. ) The problems are...
I would be interested in using this data as a benchmark for improvements in 2.7...
Took me a whole day to generate these articles; had fun with them for about a week and lost them last night when the board's catalog went crazy with missing keys; I had to remove the board and the data went with it :-( On a different note. Creating an article object doesn't require that much computation power. Just a bunch of init values for its properties. But instantiating an article in a for loop, for example, takes more than a second and it gets worse as the loop goes on. Is it because ZODB's transaction/version support? Normally, how long should it take to generate 20,000 not-so-heavy objects? Taking more than an hour seems not right with enough horsepower. When creating those test data, I had to take a long nap :-(
cataloging a single article object takes more than 1 second. Don't think it's normal... Or is it?
Profiling may be necessary to pin this down. Likely culprets are textindexes, but its hard to say. Are you sure you are doing a minimum of work (i.e., only indexing each message once)?
I used to re-render the contents of articles when copying/moving them to another instance of NeoBoard, but removed the code since it took too long. It doesn't do anything now except cataloging/uncatalogging moved/copied articles. Excerpts from NeoBoardArticle's manage_afterAdd(): if item is self and item.isThreadParent(): item._setSortKey() item._setThreadSortKey() neoboard = item.getNeoBoard() neoboard.addToNeoPortalElementContainerCatalog( item ) articles = [ obj for id, obj in item.ZopeFind( item, obj_metatypes = [ NeoBoardArticle.meta_type, ], search_sub=1 ) ] for article in articles: article._setSortKey() article._setThreadSortKey() neoboard.addToNeoPortalElementContainerCatalog( article ) I had to rename the indexObject method to 'addToNeoPortalElementContainerCatalog()' to prevent confusion with CMF. Nothing fancy there; an exact copy of ZCatalog's.
- When I attempt to uncatalog an object that's not been catalogged, Zope spews out errors in the log. Can I supress the errors in code, which, in my applications, are meaningless.
These errors are harmless. It might be better to check if they are cataloged first before uncataloging them.
Guess I was stupid in thinking I'd save some time by skipping the check ;-)
Although there are have historically been BTree bugs that can cause KeyErrors, they have slowly been stamped out. It would be helpful to find a test case
that causes these key errors. Do these keyerrors happen at search time?
No, I meant a key error in 'mybrain.getObject()', that is, a ghost entry in the Catalog without the corresponding object. Guess it happens after a massive set of additions or deletions. Can't pinpoint a case. Fast reloads sometimes do generate ZODB conflict errors. If you reload while reindexing everything with heavy disk I/O, you usually get these ZODB conflicts. Maybe I should do some work on conflict resolutions?
I'm concerned that the CommentsSearchText and AttachmentsSearchText are arbitrarily expensive. Maybe as a test try removing one index at a time to
see if any one is causing a noticable performance decrease. Start with the
TextIndexes.
Actually not. My test article objects didn't have any inline comments, for example. I tried a single index after removing all the other indexes, getNeoPortalSearchText, which returns title + body text. No obvious improvements in performance. The funny thing is the test articles didn't have much body: just a couple of words ( 'test artice' ).
That delay may be exposing an index bug. getIndex just does a single dictionary lookup and wraps it, so I'm not sure why this should take a long time, unless the TextIndex object is taking a *long* time to load from the
database. But its main ZODB record should not be very big.
I was wrong. I looked into the Catalog.py more closely and it was not the getIndex() call that was taking too long, but each index was. For example, 'Title' is a textindex and 'creation_date' is a DateIndex. Why would a TextIndex take that long to index an article with a simple title 'test'? The DateIndex is also painfully slow.
I would definitely Try ZCTextIndex, just because its searching works so much better.
Will try :-)
One general suggestion: What is your ZODB cache set to?
I'm running these tests both on my desktop Linux box and on a set of enterprise servers. Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, 512M RAM, cache set to 8,000, FileStorage Enterprise servers: - ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest errata applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on ReiserFS with noatime on - ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata applied, 2G RAM with ZODB cache set to 20,000. Both my desktop and the ZEO client show the same symptoms. The ZEO servers render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so I don't think there's any hardware/cache problems.
Any help, hints or comments would be much appreciated. I do need to move on with this project :-( It's been almost a year now...ouch. Weeks became months; months became a whole year... whew.
Yup, been there ;^)
Been there too many times with other tools. Just hoped this time would be different with Zope :-) Thanks for your help. --------------------------------------------------------------- Wankyu Choi CEO/President NeoQuest Communications, Inc. http://www.zoper.net http://www.neoboard.net ---------------------------------------------------------------
On Monday 31 March 2003 12:02 pm, Wankyu Choi wrote:
So glad to catch your attention :-)
Subclassing ZCatalog can be a maintenance headache. I did it for DocumentLibrary and regretted it.
Can you expound on this? In fact, I did just the opposite. First, I tried "self._np_catalog = NeoPortalCatalog()" where NeoPortalCatalog is a subclass of ZCatalog. Thought better ( or worse should I say :-) of it, and inherited directly from NeoPortalCatalog. Thought it would be easier in terms of maintenance.
What I found was that forward compatibility was a problem. I just so happened to land right before ZCatalog was refactored majorly for Zope 2.4. I really depends on how much internal ZCatalog machinery/data structures you depend on... See more comments inline:
Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of threads takes only a fraction of a second ), but not bad.
Is this Zope 2.6.1? What do the queries look like?
Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope 2.6.1 and has had no obvious problems yet.
Here's the code block that returns the query results and sorts a portion of them if necessary ( a non-threaded look doesn't require this sort.
security.declarePrivate( '_getArticles' ) def _getArticles( self , top=None , expand_all=0 , skey=None , reverse=0 , default_reverse=0 , search_fields='' , search_keywords='' , b_start=0 , limit=None ): """ Return a complete list of brains objects from the Catalog """
if search_fields and search_keywords:
index_map = {'title': 'Title' , 'creator':'Creator' , 'creator_email':'CreatorEmail' , 'body':'getNeoPortalContentSearchText' , 'category':'getArticleCategory' , 'comments':'getInlineCommentsSearchText' , 'comment_creators':'getInlineCommentCreators' , 'attachments': 'getAttachmentsSearchText'}
new_indexes = [] for index in search_fields.keys(): new_indexes.append( index_map[index] )
results = self.search( indexes=new_indexes, keywords=search_keywords.split( ' ' ) )
else: # threads are automatically reverse-sorted on sort_keys
reverse = int( reverse ) result_limit = None b_start = int( b_start ) if limit is not None: limit = int( limit ) result_limit = b_start + limit
if skey: if skey=='np_read_count': skey = 'getNeoPortalReadCount' elif skey=='np_num_ratings': skey = 'getNeoPortalNumContentRatings' else: skey = 'creation_date' sort_order = '' if reverse: sort_order = 'reverse' results = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, sort_on=skey, sort_order=sort_order, limit = result_limit )
If you are trying to use the new sort limits, use: sort_limit = result_limit [snip]
security.declarePublic( 'sortAritlceThreads' ) def sortArticleThreads( self, brains ): """ Sort a list of brains """
import operator
temp_list = map( lambda x: ( getattr( x.getObject(), '_sort_key' ), getattr( x.getObject(), '_thread_sort_key' ), x ), brains )
temp_list.sort() brains[:] = map( operator.getitem, temp_list, ( -1, ) * len( temp_list ) )
return brains
This sorting code is not going to scale well at all.
- isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts an article, it first creates a temp article and examines it. If unacceptable, deletes it. The method tells if the article is a temporary one or not. - isThreadParent: tells if the article is the top-most one in the thread. - expand_all: Boolean value to tell if we need a threaded-look. - getSortKey: returns thread sorting key. It's the inversed article number: article 50's sort key becomes -50 when added to the board for automatic reverse-sorting. ( Tried creation_date once, but it turned out to be a disaster when you do import/export.)
To make creation date work, you'd need to make it an application modified attribute.
- result_limit: calculated on the basis of the current batch.
If you want to take a look at it in context, I've got a viewCVS set up here: http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py
And you can see the board in action here: http://www.zoper.net/Boards/qa/view
I learned while reading docs on ZCatalog that I'd get better results by adding meta data to brain objects. Will remove that expensive sorting method soon.
No actually, metadata won't help sorting much. If you want "out of band" sorting Catalog (as of 2.6.1) has a method called sortResults, whose signature looks like this: sortResults(rs, sort_index, reverse=0, limit=None, merge=1) Where rs is the bare recordset (which can be had by calling searchResults(..., _merge=0)) sort_index is the index to sort by (the object not the name) reverse is the direction (sort_order) limit is the sort limit merge determines what is returned (1=brains, 0=a sorted list of rids), you probably want 1 So you could do: catalog = self._catalog rs = catalog.searchResults(..., _merge=0) ...do some stuff with rs... return catalog.sortResults(rs, self.getIndex(sort_key), ...)
another set of problems while so doing. I could display 5,000 threads ( about 20,000 article obects incuding all replies to the threads) in less than a second ( it takes a bit more when you load the board for the first time. ) The problems are...
I would be interested in using this data as a benchmark for improvements in 2.7...
Took me a whole day to generate these articles; had fun with them for about a week and lost them last night when the board's catalog went crazy with missing keys; I had to remove the board and the data went with it :-(
On a different note. Creating an article object doesn't require that much computation power. Just a bunch of init values for its properties. But instantiating an article in a for loop, for example, takes more than a second and it gets worse as the loop goes on. Is it because ZODB's transaction/version support? Normally, how long should it take to generate 20,000 not-so-heavy objects? Taking more than an hour seems not right with enough horsepower. When creating those test data, I had to take a long nap :-(
You should probably commit a subtransaction every so often so as not to use too much memory. Sounds like it was trying to commit a really big transaction. If these objects are all nested and you create a big hierarchy, that might explain it a bit. [snip]
that causes these key errors. Do these keyerrors happen at search time?
No, I meant a key error in 'mybrain.getObject()', that is, a ghost entry in the Catalog without the corresponding object. Guess it happens after a massive set of additions or deletions. Can't pinpoint a case. Fast reloads sometimes do generate ZODB conflict errors. If you reload while reindexing everything with heavy disk I/O, you usually get these ZODB conflicts. Maybe I should do some work on conflict resolutions?
That means an object was deleted without being unindexed. Sounds like an application bug somewhere. BTW: Calling getObject for every object found is a really bad idea and will kill performance. [snip]
I was wrong. I looked into the Catalog.py more closely and it was not the getIndex() call that was taking too long, but each index was. For example, 'Title' is a textindex and 'creation_date' is a DateIndex. Why would a TextIndex take that long to index an article with a simple title 'test'? The DateIndex is also painfully slow.
I can't tell you without seeing it myself ;^). If you can demonstrate this behavior in a relatively simple test case, I'd be interested in helping to fix it, if not for TextIndex then at least for DateIndex.
I would definitely Try ZCTextIndex, just because its searching works so much better.
Will try :-)
One general suggestion: What is your ZODB cache set to?
I'm running these tests both on my desktop Linux box and on a set of enterprise servers.
Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, 512M RAM, cache set to 8,000, FileStorage
Enterprise servers:
- ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest errata applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on ReiserFS with noatime on - ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata applied, 2G RAM with ZODB cache set to 20,000.
Both my desktop and the ZEO client show the same symptoms. The ZEO servers render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so I don't think there's any hardware/cache problems.
Any help, hints or comments would be much appreciated. I do need to move on with this project :-( It's been almost a year now...ouch. Weeks became months; months became a whole year... whew.
Yup, been there ;^)
Been there too many times with other tools. Just hoped this time would be different with Zope :-)
Thanks for your help.
--------------------------------------------------------------- Wankyu Choi CEO/President NeoQuest Communications, Inc. http://www.zoper.net http://www.neoboard.net ---------------------------------------------------------------
Thanks a million. I'll give it a try first thing in the morning ( oops, it's already 5AM in Korea; gotta go to bed ;-). --------------------------------------------------------------- Wankyu Choi CEO/President NeoQuest Communications, Inc. http://www.zoper.net http://www.neoboard.net --------------------------------------------------------------- -----Original Message----- From: zope-admin@zope.org [mailto:zope-admin@zope.org] On Behalf Of Casey Duncan Sent: Tuesday, April 01, 2003 4:37 AM To: Wankyu Choi; zope@zope.org Subject: Re: [Zope] ZCatalog performance issues - catalogging objects takes ages On Monday 31 March 2003 12:02 pm, Wankyu Choi wrote:
So glad to catch your attention :-)
Subclassing ZCatalog can be a maintenance headache. I did it for DocumentLibrary and regretted it.
Can you expound on this? In fact, I did just the opposite. First, I tried "self._np_catalog = NeoPortalCatalog()" where NeoPortalCatalog is a subclass of ZCatalog. Thought better ( or worse should I say :-) of it, and inherited directly from NeoPortalCatalog. Thought it would be easier in terms of maintenance.
What I found was that forward compatibility was a problem. I just so happened to land right before ZCatalog was refactored majorly for Zope 2.4. I really depends on how much internal ZCatalog machinery/data structures you depend on... See more comments inline:
Performance? Not so fast as SQL-backed PHP version ( displaying a pageful of threads takes only a fraction of a second ), but not bad.
Is this Zope 2.6.1? What do the queries look like?
Zope 2.6.1 with Python 2.2. Heard Python 2.2 works just fine with Zope 2.6.1 and has had no obvious problems yet.
Here's the code block that returns the query results and sorts a portion of them if necessary ( a non-threaded look doesn't require this sort.
security.declarePrivate( '_getArticles' ) def _getArticles( self , top=None , expand_all=0 , skey=None , reverse=0 , default_reverse=0 , search_fields='' , search_keywords='' , b_start=0 , limit=None ): """ Return a complete list of brains objects from the Catalog """
if search_fields and search_keywords:
index_map = {'title': 'Title' , 'creator':'Creator' , 'creator_email':'CreatorEmail' , 'body':'getNeoPortalContentSearchText' , 'category':'getArticleCategory' , 'comments':'getInlineCommentsSearchText' , 'comment_creators':'getInlineCommentCreators' , 'attachments': 'getAttachmentsSearchText'}
new_indexes = [] for index in search_fields.keys(): new_indexes.append( index_map[index] )
results = self.search( indexes=new_indexes, keywords=search_keywords.split( ' ' ) )
else: # threads are automatically reverse-sorted on sort_keys
reverse = int( reverse ) result_limit = None b_start = int( b_start ) if limit is not None: limit = int( limit ) result_limit = b_start + limit
if skey: if skey=='np_read_count': skey = 'getNeoPortalReadCount' elif skey=='np_num_ratings': skey = 'getNeoPortalNumContentRatings' else: skey = 'creation_date' sort_order = '' if reverse: sort_order = 'reverse' results = self.searchResults( meta_type=NeoBoardArticle.meta_type, isTempNeoBoardArticle=0, sort_on=skey, sort_order=sort_order, limit = result_limit )
If you are trying to use the new sort limits, use: sort_limit = result_limit [snip]
security.declarePublic( 'sortAritlceThreads' ) def sortArticleThreads( self, brains ): """ Sort a list of brains """
import operator
temp_list = map( lambda x: ( getattr( x.getObject(), '_sort_key' ), getattr( x.getObject(), '_thread_sort_key' ), x ), brains )
temp_list.sort() brains[:] = map( operator.getitem, temp_list, ( -1, ) * len( temp_list ) )
return brains
This sorting code is not going to scale well at all.
- isTempNeoBoardArticle: NeoBoard does what CMF does. When a user posts an article, it first creates a temp article and examines it. If unacceptable, deletes it. The method tells if the article is a temporary one or not. - isThreadParent: tells if the article is the top-most one in the thread. - expand_all: Boolean value to tell if we need a threaded-look. - getSortKey: returns thread sorting key. It's the inversed article number: article 50's sort key becomes -50 when added to the board for automatic reverse-sorting. ( Tried creation_date once, but it turned out to be a disaster when you do import/export.)
To make creation date work, you'd need to make it an application modified attribute.
- result_limit: calculated on the basis of the current batch.
If you want to take a look at it in context, I've got a viewCVS set up here: http://cvs.zoper.net:3333/cgi-bin/viewcvs.cgi/NeoBoard/NeoBoard.py
And you can see the board in action here: http://www.zoper.net/Boards/qa/view
I learned while reading docs on ZCatalog that I'd get better results by adding meta data to brain objects. Will remove that expensive sorting method soon.
No actually, metadata won't help sorting much. If you want "out of band" sorting Catalog (as of 2.6.1) has a method called sortResults, whose signature looks like this: sortResults(rs, sort_index, reverse=0, limit=None, merge=1) Where rs is the bare recordset (which can be had by calling searchResults(..., _merge=0)) sort_index is the index to sort by (the object not the name) reverse is the direction (sort_order) limit is the sort limit merge determines what is returned (1=brains, 0=a sorted list of rids), you probably want 1 So you could do: catalog = self._catalog rs = catalog.searchResults(..., _merge=0) ...do some stuff with rs... return catalog.sortResults(rs, self.getIndex(sort_key), ...)
another set of problems while so doing. I could display 5,000 threads ( about 20,000 article obects incuding all replies to the threads) in less than a second ( it takes a bit more when you load the board for the first time. ) The problems are...
I would be interested in using this data as a benchmark for improvements in 2.7...
Took me a whole day to generate these articles; had fun with them for about a week and lost them last night when the board's catalog went crazy with missing keys; I had to remove the board and the data went with it :-(
On a different note. Creating an article object doesn't require that much computation power. Just a bunch of init values for its properties. But instantiating an article in a for loop, for example, takes more than a second and it gets worse as the loop goes on. Is it because ZODB's transaction/version support? Normally, how long should it take to generate 20,000 not-so-heavy objects? Taking more than an hour seems not right with enough horsepower. When creating those test data, I had to take a long nap :-(
You should probably commit a subtransaction every so often so as not to use too much memory. Sounds like it was trying to commit a really big transaction. If these objects are all nested and you create a big hierarchy, that might explain it a bit. [snip]
that causes these key errors. Do these keyerrors happen at search time?
No, I meant a key error in 'mybrain.getObject()', that is, a ghost entry in the Catalog without the corresponding object. Guess it happens after a massive set of additions or deletions. Can't pinpoint a case. Fast reloads sometimes do generate ZODB conflict errors. If you reload while reindexing everything with heavy disk I/O, you usually get these ZODB conflicts. Maybe I should do some work on conflict resolutions?
That means an object was deleted without being unindexed. Sounds like an application bug somewhere. BTW: Calling getObject for every object found is a really bad idea and will kill performance. [snip]
I was wrong. I looked into the Catalog.py more closely and it was not the getIndex() call that was taking too long, but each index was. For example, 'Title' is a textindex and 'creation_date' is a DateIndex. Why would a TextIndex take that long to index an article with a simple title 'test'? The DateIndex is also painfully slow.
I can't tell you without seeing it myself ;^). If you can demonstrate this behavior in a relatively simple test case, I'd be interested in helping to fix it, if not for TextIndex then at least for DateIndex.
I would definitely Try ZCTextIndex, just because its searching works so much better.
Will try :-)
One general suggestion: What is your ZODB cache set to?
I'm running these tests both on my desktop Linux box and on a set of enterprise servers.
Desktop: Pentium 4, RH Linux 8.0 with all the latest errata applied, 512M RAM, cache set to 8,000, FileStorage
Enterprise servers:
- ZEO Storage Server: dual Xeon P4, RH Linux 8.0 with all the latest errata applied, 4G RAM, 430G SCSI HDDs with RAID 5, DirectoryStorage on ReiserFS with noatime on - ZEO Client: dual P3 tualatin, RH Linux 8.0 with all the latest errata applied, 2G RAM with ZODB cache set to 20,000.
Both my desktop and the ZEO client show the same symptoms. The ZEO servers render CMF/Plone + NeoBoard pages in an average of 0.3 ~ 0.5 second, so I don't think there's any hardware/cache problems.
Any help, hints or comments would be much appreciated. I do need to move on with this project :-( It's been almost a year now...ouch. Weeks became months; months became a whole year... whew.
Yup, been there ;^)
Been there too many times with other tools. Just hoped this time would be different with Zope :-)
Thanks for your help.
--------------------------------------------------------------- Wankyu Choi CEO/President NeoQuest Communications, Inc. http://www.zoper.net http://www.neoboard.net ---------------------------------------------------------------
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Wankyu Choi wrote at 2003-3-31 19:47 +0900:
- It takes ages when cataloging even a small number of articles. 18 seconds for cataloging 50 or so article objects with so little to index? Is it normal? Can't imagine recataloging 20,000 objects. For example, if you move a thread from one NeoBoard instance to another, you have to uncatalog the thread including all its replies in NeoBoad A and catalog them in NeoBoard B: cataloging a single article object takes more than 1 second. Don't think it's normal... Or is it?
The connection between objects and cataloguing is very generic and flexible but unfortunately also costly. Each index and each metadata entry causes one method to be called per document. If the call is expensive, this can slow things down considerably. Use only metadata entries, you really need. We have about 5 metadata entries and about 7 indexes and can index about 10 documents per second (on a 1.4 GH AMD). Batching indexing can help a lot to improve efficiency. We use Shane's QueueCatalog for this purpose. Dieter
participants (3)
-
Casey Duncan -
Dieter Maurer -
Wankyu Choi