Re: [Zope] ZCTextIndex - prefix wildcards not supported?

21 Jun 2004

      Hi Casey,

I am trying to implement your suggestion of accessing the '_docwords'
structure in an attempt to eliminate duplicate storage of data in the
ZCatalog.

I have created a test external method to retrieve the _docwords entry for a
specific object in an existing ZCatalog:

def jtmp(self):
   res = self.Catalog({'id' : '1086793690.85'})
   for item in res:
      rid = item.data_record_id_
   return
self.Catalog.getIndex('all_searchable_text').getEntryForObject(rid)

Executing this external method gives me a zope error:

Traceback (innermost last):
  Module ZPublisher.Publish, line 98, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 39, in call_object
  Module Products.ExternalMethod.ExternalMethod, line 224, in __call__
   - __traceback_info__: ((<Folder instance at a063d58>,), {}, None)
  Module /apps/zope/Extensions/jtmp.py, line 13, in jtmp
AttributeError: getIndex

I am confused (being a relative python newbie) because 'getIndex' and
'getEntryForObject' are functions defined within the Catalog class, so
shouldn't they be available?!

Is there a better way to go about this?

Thanks,

Jonathan

----- Original Message -----
From: "Casey Duncan" <casey@zope.com>
To: "Small Business Services" <toolkit@magma.ca>
Sent: November 21, 2003 4:28 PM
Subject: Re: [Zope] ZCTextIndex - prefix wildcards not supported?
...
On Fri, 21 Nov 2003 14:08:08 -0500
"Small Business Services" <toolkit@magma.ca> wrote:
...
The Zope Cache size is set at 10,000
There are 1,985,183 objects in the 'database'
Hmm, that's less then I would have thought.
...
Specifications for our update linux box:
Zope 2.6.1
   1 ghz PIII
   1.25 Gb RAM (pc133)
   3 disks (IBM ultrastar, scsi, ultra2mode - 10,000 rpm, 4.5ms access)
We are running the disks striped on a single controller, which gives us
amazing read/write capacity.  We rarely run at full capacity on the
disks.
We set the cache at the highest point possible (any higher and the
machine
swaps itself to death).
I think you could definitely use more RAM. But that is a given pretty
much. How big is the Data.fs file when you're through indexing? How does
that compare to the size of the document corpus itself?
Also I think you may want to try Zope 2.6.2. I made some changes to
ZCTextIndex in that version that could help performance. I would be
interested to hear if they help.
[snip]
...
We eventually came up with our current solution: at index time we
compress
the full-text and store it as binary data in the metadata table (getting
this to work was a challenge in itself).  We then decompress and scan
this
data to locate the relevant 2-3 lines at retrieval time (it is far
faster to
decompress & scan metadata then to access the objects directly).
Using metadata tends to wake up far fewer objects, which can be a win.
Interestingly ZCTextindex actually stores a similar compressed word list
internally. The actual index object stored in ZCTextIndex has an _docwords
BTree which stores a compressed wordlist for each document. This is used for
unindexing and phrase matching. Look at the search_phrase method in
BaseIndex.py for for info.
If you could use _docwords, you might be able to get rid of that redundant
data structure and the time it takes to build and store it. Retrieval time
should be on par with metadata.
...
Retrieval speeds for end users are excellent.  We have only been running
into difficulties lately because of the size of the database.  The
update
process now runs 24 hours per day for about 30 days (automating an
update
process that runs for 30 days was another exciting challenge!).  The
fact
that zope can handle this volume of processing is a testament to its
reliability and robustness!
I'm concerned that it takes that long to index. 30 days is like a
millenium of processor time. I'm curious how big your transactions are
during index processing.
I'm glad to see the retreival speeds are good. What roughly is the average
document size?
...
We have been working with Zope for about 3 years and think that it is a
FANTASTIC product!  We keep coming up with new things to use it for, its
great!
Thanks in advance for any ideas you may have - we are open to any and
all
suggestions!
Sounds like you have a very interesting application. I'd be very
interested to hear about and possibly try to help make it faster if I can.
-Casey