[Zope] ZCTextIndex - prefix wildcards not supported?
Andreas Jung
lists at zopyx.com
Thu Jun 24 01:44:23 EDT 2004
TextIndexNG2 supports "*term" queries.
-aj
--On Montag, 21. Juni 2004 9:49 Uhr -0400 Small Business Services
<toolkit at magma.ca> wrote:
> Hi Casey,
>
> I am trying to implement your suggestion of accessing the '_docwords'
> structure in an attempt to eliminate duplicate storage of data in the
> ZCatalog.
>
> I have created a test external method to retrieve the _docwords entry for
> a specific object in an existing ZCatalog:
>
> def jtmp(self):
> res = self.Catalog({'id' : '1086793690.85'})
> for item in res:
> rid = item.data_record_id_
> return
> self.Catalog.getIndex('all_searchable_text').getEntryForObject(rid)
>
>
> Executing this external method gives me a zope error:
>
> Traceback (innermost last):
> Module ZPublisher.Publish, line 98, in publish
> Module ZPublisher.mapply, line 88, in mapply
> Module ZPublisher.Publish, line 39, in call_object
> Module Products.ExternalMethod.ExternalMethod, line 224, in __call__
> - __traceback_info__: ((<Folder instance at a063d58>,), {}, None)
> Module /apps/zope/Extensions/jtmp.py, line 13, in jtmp
> AttributeError: getIndex
>
> I am confused (being a relative python newbie) because 'getIndex' and
> 'getEntryForObject' are functions defined within the Catalog class, so
> shouldn't they be available?!
>
> Is there a better way to go about this?
>
> Thanks,
>
> Jonathan
>
>
> ----- Original Message -----
> From: "Casey Duncan" <casey at zope.com>
> To: "Small Business Services" <toolkit at magma.ca>
> Sent: November 21, 2003 4:28 PM
> Subject: Re: [Zope] ZCTextIndex - prefix wildcards not supported?
>
>
>> On Fri, 21 Nov 2003 14:08:08 -0500
>> "Small Business Services" <toolkit at magma.ca> wrote:
>>
>> > The Zope Cache size is set at 10,000
>> >
>> > There are 1,985,183 objects in the 'database'
>>
>> Hmm, that's less then I would have thought.
>>
>> > Specifications for our update linux box:
>> >
>> > Zope 2.6.1
>> > 1 ghz PIII
>> > 1.25 Gb RAM (pc133)
>> > 3 disks (IBM ultrastar, scsi, ultra2mode - 10,000 rpm, 4.5ms access)
>> >
>> > We are running the disks striped on a single controller, which gives us
>> > amazing read/write capacity. We rarely run at full capacity on the
> disks.
>> > We set the cache at the highest point possible (any higher and the
> machine
>> > swaps itself to death).
>>
>> I think you could definitely use more RAM. But that is a given pretty
> much. How big is the Data.fs file when you're through indexing? How does
> that compare to the size of the document corpus itself?
>>
>> Also I think you may want to try Zope 2.6.2. I made some changes to
> ZCTextIndex in that version that could help performance. I would be
> interested to hear if they help.
>>
>> [snip]
>> > We eventually came up with our current solution: at index time we
> compress
>> > the full-text and store it as binary data in the metadata table
>> > (getting this to work was a challenge in itself). We then decompress
>> > and scan
> this
>> > data to locate the relevant 2-3 lines at retrieval time (it is far
> faster to
>> > decompress & scan metadata then to access the objects directly).
>>
>> Using metadata tends to wake up far fewer objects, which can be a win.
> Interestingly ZCTextindex actually stores a similar compressed word list
> internally. The actual index object stored in ZCTextIndex has an _docwords
> BTree which stores a compressed wordlist for each document. This is used
> for unindexing and phrase matching. Look at the search_phrase method in
> BaseIndex.py for for info.
>>
>> If you could use _docwords, you might be able to get rid of that
>> redundant
> data structure and the time it takes to build and store it. Retrieval time
> should be on par with metadata.
>>
>> > Retrieval speeds for end users are excellent. We have only been
>> > running into difficulties lately because of the size of the database.
>> > The
> update
>> > process now runs 24 hours per day for about 30 days (automating an
> update
>> > process that runs for 30 days was another exciting challenge!). The
> fact
>> > that zope can handle this volume of processing is a testament to its
>> > reliability and robustness!
>>
>> I'm concerned that it takes that long to index. 30 days is like a
> millenium of processor time. I'm curious how big your transactions are
> during index processing.
>>
>> I'm glad to see the retreival speeds are good. What roughly is the
>> average
> document size?
>>
>> > We have been working with Zope for about 3 years and think that it is a
>> > FANTASTIC product! We keep coming up with new things to use it for,
>> > its great!
>> >
>> > Thanks in advance for any ideas you may have - we are open to any and
> all
>> > suggestions!
>>
>> Sounds like you have a very interesting application. I'd be very
> interested to hear about and possibly try to help make it faster if I can.
>>
>> -Casey
>
>
>
> _______________________________________________
> Zope maillist - Zope at zope.org
> http://mail.zope.org/mailman-20/listinfo/zope
> ** No cross posts or HTML encoding! **
> (Related lists -
> http://mail.zope.org/mailman-20/listinfo/zope-announce
> http://mail.zope.org/mailman-20/listinfo/zope-dev )
Andreas Jung
zopyx.com - Software Development and Consulting Andreas Jung
More information about the Zope
mailing list