From: "Johan Carlsson" <johanc@easypublisher.com>
I am doing this to try to squeeze out some performance improvements from
a
ZCTextIndex. We have a zcatalog with about 1 million documents that we are full-text indexing and it no longer fits into memory (therefore requiring many disk i/o's during retrieval which is seriously degrading performance).
Our zcatalog currently has 5 indexes: 4 minor indexes and one major index (the main ZCTextIndex). I am attempting to split the zcatalog into two separate zcatalogs: one containing the 4 minor indexes and one containing the ZCTextIndex. The hope is that the zcatalog containing only the ZCTextIndex will be smaller and will again fit into memory.
Why would it be smaller? You still need to load the indexes when you do a search, right? Or do you intend to index different objects in different catalogs? In that case couldn't you use the idxs attribute of ZCatalog::catalog_object(self, obj, uid=None, idxs=None, update_metadata=1)?
Moving only the ZCTextIndex (and its Lexicon) into a separate ZCatalog should result in a smaller ZCatalog, as the space required by the other 4 indexes (3 Field Indexes and another ZCTextIndex) will be located in a different folder - I am going to load the ZCatalog containing the main ZCTextIndex into a Temporary Folder (to hold it in memory). Both ZCatalogs will index the same documents (stored in a separate BTreeFolder2).
The only difficulty is in combining the results from searches of two separate zcatalogs in an efficient manner. My best guess at this point is that I will have to patch the 'search' routine in ZCTextIndex to stop it from 'Lazifying' the result sets, so that I can join/intersect the result sets based on OIDs (instead of RIDs - which should be doable as the result sets prior to 'lazifying' are xxBTrees and the BTrees product comes with methods for join/intersection). I can then 'Lazify' the final result set and return it. At least that's the theory!
Maybe do a version of ZCatalog (or rather Catalog) that uses OIDs as RIDs? Only problem is that OIDs are int64 and BTrees.IISet et al. uses int32. So you would need a IISet that take long.
Thanks for the 'heads-up'. I had hoped to use OIDs instead of RIDs, but hadn't considered the 64/32 bit problem. I'll have to see if I can find a 64bit BTrees package, and failing that, try to modify the existing package to use long ints - this just keeps getting better and better :) Thanks for the help! Jonathan