[Zope] ZCatalog Performance problem

Thu Sep 16 11:44:42 EDT 2004

I am trying to install a new version of our database and have encountered
some performance problems. We are using:

Zope 2.6.1
Python 2.1
Linux2
ZCatalog 2.2.0
ZCTextIndex (with Lexicon)

The 'old' Data.fs file contains 1,830,854 ('objects in the database').
Retrieval times, based on CallProfiler, are typically < 2 secs. The ZCatalog
contains 671,873 entries, and the actual documents are stored in two
standard Folders. The Lexicon contains 605,984 words.

The 'new' version uses DBTab to mount a second .fs file which contains the
ZCatalog and corresponding documents (which are now stored in two
BTreeFolder2 type folders). The .fs file contains 7,670,410 ('objects in the
database'). Retrieval times, based on CallProfiler, are are in the range 1
to 14 secs. The ZCatalog contains 782,216entries. The Lexicon contains
1,741,003 words.

In both versions, the Lexicon 'Input Pipeline Stages' are the same
(Splitter, CaseNormalizer, StopWordAndSingleCharRemover).

In the 'old' version, the documents stored in the ZCatalog (zclass
instances) are 'catalog-aware'.  In the 'new' version they are not. Also, in
the 'old' version the complete full text of the document was compressed and
stored in the metadata table (this was eliminated in the 'new' version to
reduce the size of the metadata table).

Because of the very slow retrieval times, I installed ZopeProfiler and here
is an extract of 'typical' results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

        2    0.000    0.000   11.462    5.731 Catalog.py:717(searchResults)
        2    2.069    1.035   11.461    5.730 Catalog.py:446(search)
        1    0.216    0.216    9.320     9.320
OkapiIndex.py:120(_search_wids)
        2    0.000    0.000    9.391     4.696
ZCTextIndex.py:167(_apply_index)
        1    0.000    0.000    9.390     9.390
ParseTree.py:117(executeQuery)
        1    0.000    0.000    9.390     9.390 BaseIndex.py:169(search)
        1    0.216    0.216    9.320     9.320
OkapiIndex.py:120(_search_wids)
     2085    0.239    0.000    9.174    0.004 Connection.py:499(setstate)
     2085    0.045    0.000    8.761    0.004 FileStorage.py:647(load)
     2085    8.510    0.004    8.716    0.004 FileStorage.py:620(_load)
     2085    0.151    0.000    0.181    0.000 fsIndex.py:61(__getitem__)

This profile was generated using a very simple script to access the
zcatalog:

<dtml-let res="Catalog.searchResults({'all_searchable_text' :
all_searchable_text})">
   <b>Done Search</b><br>
</dtml-let>

It appears that the delay is due to the time required to load the
zctextindex into memory. This is supported by the fact that subsequent
searches for the same search term take less than 1 second and do not
generate any FileStorage '_load's, and running iostat during the search
shows LOTS of disk activity (cpu usage is minimal).

The cache 'target' is set at 100,000 and currently has about 59,000 objects
(61,000 ghost objects).

So, finally, the questions:

1) What would cause the Lexicon to jump from 605k words to 1741k words (a
187% increase), even though the zcatalog entries only grew from 671k to 782k
(a 16.5% increase)?

2) What would cause the .fs file to jump from 1.8m entries  to 7.6m entries
(a 322% increase), even though the zcatalog entries only grew from 671k to
782k (a 16.5% increase)?

3) Is there any way to preload the entire zctextindex?

Any and all ideas, comments and suggestions will be greatfully received!

Jonathan