I am trying to install a new version of our database and have encountered some performance problems. We are using: Zope 2.6.1 Python 2.1 Linux2 ZCatalog 2.2.0 ZCTextIndex (with Lexicon) The 'old' Data.fs file contains 1,830,854 ('objects in the database'). Retrieval times, based on CallProfiler, are typically < 2 secs. The ZCatalog contains 671,873 entries, and the actual documents are stored in two standard Folders. The Lexicon contains 605,984 words. The 'new' version uses DBTab to mount a second .fs file which contains the ZCatalog and corresponding documents (which are now stored in two BTreeFolder2 type folders). The .fs file contains 7,670,410 ('objects in the database'). Retrieval times, based on CallProfiler, are are in the range 1 to 14 secs. The ZCatalog contains 782,216entries. The Lexicon contains 1,741,003 words. In both versions, the Lexicon 'Input Pipeline Stages' are the same (Splitter, CaseNormalizer, StopWordAndSingleCharRemover). In the 'old' version, the documents stored in the ZCatalog (zclass instances) are 'catalog-aware'. In the 'new' version they are not. Also, in the 'old' version the complete full text of the document was compressed and stored in the metadata table (this was eliminated in the 'new' version to reduce the size of the metadata table). Because of the very slow retrieval times, I installed ZopeProfiler and here is an extract of 'typical' results: ncalls tottime percall cumtime percall filename:lineno(function) 2 0.000 0.000 11.462 5.731 Catalog.py:717(searchResults) 2 2.069 1.035 11.461 5.730 Catalog.py:446(search) 1 0.216 0.216 9.320 9.320 OkapiIndex.py:120(_search_wids) 2 0.000 0.000 9.391 4.696 ZCTextIndex.py:167(_apply_index) 1 0.000 0.000 9.390 9.390 ParseTree.py:117(executeQuery) 1 0.000 0.000 9.390 9.390 BaseIndex.py:169(search) 1 0.216 0.216 9.320 9.320 OkapiIndex.py:120(_search_wids) 2085 0.239 0.000 9.174 0.004 Connection.py:499(setstate) 2085 0.045 0.000 8.761 0.004 FileStorage.py:647(load) 2085 8.510 0.004 8.716 0.004 FileStorage.py:620(_load) 2085 0.151 0.000 0.181 0.000 fsIndex.py:61(__getitem__) This profile was generated using a very simple script to access the zcatalog: <dtml-let res="Catalog.searchResults({'all_searchable_text' : all_searchable_text})"> <b>Done Search</b><br> </dtml-let> It appears that the delay is due to the time required to load the zctextindex into memory. This is supported by the fact that subsequent searches for the same search term take less than 1 second and do not generate any FileStorage '_load's, and running iostat during the search shows LOTS of disk activity (cpu usage is minimal). The cache 'target' is set at 100,000 and currently has about 59,000 objects (61,000 ghost objects). So, finally, the questions: 1) What would cause the Lexicon to jump from 605k words to 1741k words (a 187% increase), even though the zcatalog entries only grew from 671k to 782k (a 16.5% increase)? 2) What would cause the .fs file to jump from 1.8m entries to 7.6m entries (a 322% increase), even though the zcatalog entries only grew from 671k to 782k (a 16.5% increase)? 3) Is there any way to preload the entire zctextindex? Any and all ideas, comments and suggestions will be greatfully received! Jonathan
Jonathan Hobbs wrote at 2004-9-16 11:44 -0400:
I am trying to install a new version of our database and have encountered some performance problems. We are using: ... 2 0.000 0.000 11.462 5.731 Catalog.py:717(searchResults) 2 2.069 1.035 11.461 5.730 Catalog.py:446(search) 1 0.216 0.216 9.320 9.320 OkapiIndex.py:120(_search_wids) ...
As you can see, most of the time is spend in "_search_wids".
... 1 0.216 0.216 9.320 9.320 OkapiIndex.py:120(_search_wids) 2085 0.239 0.000 9.174 0.004 Connection.py:499(setstate)
And it spends most of the time in "Connection...setstate". This indicates poor persistency design. To find out what goes wrong, you would need an instrumentation of the load process. I recently posted my instrumentation of "ZEO.ServerStub". Maybe, you do something similar for FileStorage? An alternative could be to look at the code of "OkapiIndex..._search_wids". Apparently, it loads to many persistent objects. -- Dieter
participants (2)
-
Dieter Maurer -
Jonathan Hobbs