ZCataloging imported data
R. David Murray: I'm not suer if you got a reply on this one yet, but I have experience here so I'll hope it helps. Michael P. is the real ZCatalog guru and this information is a sum of what he imparted to me. <snip> he first problem I ran into was during the import. I wrote an external method that read the data from a tab delimited file and used the data to build ZClass instances. I tried to create them all in one folder in one transaction. This failed miserably at somewhere around 1500 records. I got an error about an 'frexp' call being out of range, somewhere in ZCatalog's BTree methods </snip> I loaded 35,000 records from a tab delimited file using an external method. I never got an 'frexp' error so I can't help with that. I did create them all in one folder though and was told that there is effectively no hard limit to the amount of items that can be contained in a folder. <snip> So I tried loading the records in batches, and at first that seemed to work. Then I got an out of memory error. </snip> I tried batches too. But only after I got an out of memory, and then a disk full error. I was told to take the subtransaction threshold in the indexnig tab in the Catalog and push it up to 1000000. That way a subtransaction would only committ to disk after quite a few records had gone in. I was indexing some text that was 1000 words long and it was committing on every single one. You have to figure out what you are loading and make the transactions commit after a reasonable number of records so that you don't exceed your resources. <snip> Now we find that when we enter certain keywords (the examples we have found so far are 'well' and 'fire', which you probably don't need to know) on one of these by-hand ZClass instences, they are *not* found by a ZCatalog search on the keywords field index. Other words entered into the keywords field do cause the record to be found ('wells', for example). </snip> Try finding C/C++ ;-) 'bill' doesn't show up either. There are certain words that are hard coded to be skipped in the indexing program. You can change these because this is Open Source and you get it all. I think it's a C program but I dno't nkow which one. Michael what was that one again? This is not to be confused with the 'stop words' in the Catalog.py. Those can be chanegd too, but at the risk of bloating you indexes. Also '+' and '-' (dashes and minuses) get skipped. To get speed and small manageable size some sacrifices have to be made, but you can always work around them because this is Open Source. I also am using a 'keyword' index for C/C++ and items that I know are important to the usefulness of the Catalog. I hope this helped. All my best, -- Jason Spisak 444@hiretechs.com
participants (1)
-
Jason Spisak