Hi Casey, I am trying to implement your suggestion of accessing the '_docwords' structure in an attempt to eliminate duplicate storage of data in the ZCatalog. I have created a test external method to retrieve the _docwords entry for a specific object in an existing ZCatalog: def jtmp(self): res = self.Catalog({'id' : '1086793690.85'}) for item in res: rid = item.data_record_id_ return self.Catalog.getIndex('all_searchable_text').getEntryForObject(rid) Executing this external method gives me a zope error: Traceback (innermost last): Module ZPublisher.Publish, line 98, in publish Module ZPublisher.mapply, line 88, in mapply Module ZPublisher.Publish, line 39, in call_object Module Products.ExternalMethod.ExternalMethod, line 224, in __call__ - __traceback_info__: ((<Folder instance at a063d58>,), {}, None) Module /apps/zope/Extensions/jtmp.py, line 13, in jtmp AttributeError: getIndex I am confused (being a relative python newbie) because 'getIndex' and 'getEntryForObject' are functions defined within the Catalog class, so shouldn't they be available?! Is there a better way to go about this? Thanks, Jonathan ----- Original Message ----- From: "Casey Duncan" <casey@zope.com> To: "Small Business Services" <toolkit@magma.ca> Sent: November 21, 2003 4:28 PM Subject: Re: [Zope] ZCTextIndex - prefix wildcards not supported?
On Fri, 21 Nov 2003 14:08:08 -0500 "Small Business Services" <toolkit@magma.ca> wrote:
The Zope Cache size is set at 10,000
There are 1,985,183 objects in the 'database'
Hmm, that's less then I would have thought.
Specifications for our update linux box:
Zope 2.6.1 1 ghz PIII 1.25 Gb RAM (pc133) 3 disks (IBM ultrastar, scsi, ultra2mode - 10,000 rpm, 4.5ms access)
We are running the disks striped on a single controller, which gives us amazing read/write capacity. We rarely run at full capacity on the disks. We set the cache at the highest point possible (any higher and the machine swaps itself to death).
I think you could definitely use more RAM. But that is a given pretty much. How big is the Data.fs file when you're through indexing? How does that compare to the size of the document corpus itself?
Also I think you may want to try Zope 2.6.2. I made some changes to ZCTextIndex in that version that could help performance. I would be interested to hear if they help.
[snip]
We eventually came up with our current solution: at index time we compress the full-text and store it as binary data in the metadata table (getting this to work was a challenge in itself). We then decompress and scan this data to locate the relevant 2-3 lines at retrieval time (it is far faster to decompress & scan metadata then to access the objects directly).
Using metadata tends to wake up far fewer objects, which can be a win. Interestingly ZCTextindex actually stores a similar compressed word list internally. The actual index object stored in ZCTextIndex has an _docwords BTree which stores a compressed wordlist for each document. This is used for unindexing and phrase matching. Look at the search_phrase method in BaseIndex.py for for info.
If you could use _docwords, you might be able to get rid of that redundant data structure and the time it takes to build and store it. Retrieval time should be on par with metadata.
Retrieval speeds for end users are excellent. We have only been running into difficulties lately because of the size of the database. The update process now runs 24 hours per day for about 30 days (automating an update process that runs for 30 days was another exciting challenge!). The fact that zope can handle this volume of processing is a testament to its reliability and robustness!
I'm concerned that it takes that long to index. 30 days is like a millenium of processor time. I'm curious how big your transactions are during index processing.
I'm glad to see the retreival speeds are good. What roughly is the average document size?
We have been working with Zope for about 3 years and think that it is a FANTASTIC product! We keep coming up with new things to use it for, its great!
Thanks in advance for any ideas you may have - we are open to any and all suggestions!
Sounds like you have a very interesting application. I'd be very interested to hear about and possibly try to help make it faster if I can.
-Casey