[Zope-dev] ZCataloging imported data
Jason Spisak
444@hiretechs.com
Mon, 06 Mar 2000 14:47:34 -0500
R. David Murray:
I'm not suer if you got a reply on this one yet, but I have experience
here so I'll hope it helps. Michael P. is the real ZCatalog guru and
this information is a sum of what he imparted to me.
<snip>
he first problem I ran into was during the import. I wrote an
external method that read the data from a tab
delimited file
and used the data to build ZClass instances. I
tried to create
them all in one folder in one transaction. This
failed miserably
at somewhere around 1500 records. I got an error
about an
'frexp' call being out of range, somewhere in
ZCatalog's BTree
methods
</snip>
I loaded 35,000 records from a tab delimited file using an external
method. I never got an 'frexp' error so I can't help with that. I did
create them all in one folder though and was told that there is
effectively no hard limit to the amount of items that can be contained
in a folder.
<snip>
So I tried loading the records in batches, and at first that seemed
to work. Then I got an out of memory error.
</snip>
I tried batches too. But only after I got an out of memory, and then a
disk full error. I was told to take the subtransaction threshold in the
indexnig tab in the Catalog and push it up to 1000000. That way a
subtransaction would only committ to disk after quite a few records had
gone in. I was indexing some text that was 1000 words long and it was
committing on every single one. You have to figure out what you are
loading and make the transactions commit after a reasonable number of
records so that you don't exceed your resources.
<snip>
Now we find that when we enter certain keywords (the examples we have
found so far are 'well' and 'fire', which you
probably don't need
to know) on one of these by-hand ZClass instences,
they are *not* found
by a ZCatalog search on the keywords field index.
Other words
entered into the keywords field do cause the
record to be found
('wells', for example).
</snip>
Try finding C/C++ ;-) 'bill' doesn't show up either.
There are certain words that are hard coded to be skipped in the
indexing program. You can change these because this is Open Source and
you get it all. I think it's a C program but I dno't nkow which one.
Michael what was that one again? This is not to be confused with the
'stop words' in the Catalog.py. Those can be chanegd too, but at the
risk of bloating you indexes. Also '+' and '-' (dashes and minuses) get
skipped. To get speed and small manageable size some sacrifices have to
be made, but you can always work around them because this is Open
Source. I also am using a 'keyword' index for C/C++ and items that I
know are important to the usefulness of the Catalog.
I hope this helped.
All my best,
--
Jason Spisak
444@hiretechs.com