Well, I've been playing around with static DTML Documents and Zcatalogues from my SQL database. I use a set of dtml methods to generate static pages in a 2-level folder heirachy from my SQL database, I use one parent method calling 3 sub-methods (one to clear out the old structure and make a new one, one to get the sub-list of folders, one to make the actual documents in the folders, this seems to force Zope to commit a little more often than putting it all in one document.. which is a good thing) I have found the following: on my 500MHz P-II with 196MB of memory it takes: a - 22 minutes to create 8800 documents (smallish) in 1200 folders within zope, not too fast :( but not exactly a user-interaction-limiting factor :) b - too long to then try to do a search based add to a zcatalog, ie: netscape times out after only around 8 minutes, and the search has not finished! BTW, Zopes python process and postgresql take about 50% of the CPU each, and there is basically zero disk thrashing during this process (although zope does get up around 50MB of memory use..) Amy I trying to exceed the zcatalogs capabilities here? seems like a reasonable ask to me, but without some reference it's hard to tell - I'm really wanting to add 22000 larger documents, but that will have to wait till I find a faster way. Research continuing :) ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
Stuart Woolford wrote:
a - 22 minutes to create 8800 documents (smallish) in 1200 folders within zope, not too fast :( but not exactly a user-interaction-limiting factor :)
Yes, ZCatalog right now is tuned for incremental indexing and retrieval. There is a knob where you can trade memory during indexing for speed of indexing. This works by decreasing the frequency of subtransaction commits (which is where the performance hit lies).
b - too long to then try to do a search based add to a zcatalog, ie: netscape times out after only around 8 minutes, and the search has not finished!
This is absolutely, obviously something wrong with your ZCatalog settings. Make sure that you have indexed the properties that you are searching on, otherwise it will be a "grep" style search. I have seen ZCatalog go through a 12 Mb, 23 thousand line RDF version of the RPM repository. Search time was less than a tenth of a second. --Paul
On Mon, 13 Sep 1999, Paul Everitt wrote:
Stuart Woolford wrote:
a - 22 minutes to create 8800 documents (smallish) in 1200 folders within zope, not too fast :( but not exactly a user-interaction-limiting factor :)
Yes, ZCatalog right now is tuned for incremental indexing and retrieval. There is a knob where you can trade memory during indexing for speed of indexing. This works by decreasing the frequency of subtransaction commits (which is where the performance hit lies).
I'll have a look at that, currently I'm trying to redo the stuff I'm going to index using ZClasses to cut down on the raw size..
b - too long to then try to do a search based add to a zcatalog, ie: netscape times out after only around 8 minutes, and the search has not finished!
This is absolutely, obviously something wrong with your ZCatalog settings. Make sure that you have indexed the properties that you are searching on, otherwise it will be a "grep" style search.
I'm not talking about the actual searching using the ZCatalog, but trying to do an add to the ZCatalog, using the 'search for' system to locate the items to add to the catalog..
I have seen ZCatalog go through a 12 Mb, 23 thousand line RDF version of > the RPM repository. Search time was less than a tenth of a second. >
Once I get the items indexed, I bet it will rock!
--Paul
-- ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. -----------------------------------------------------------
Stuart Woolford wrote:
on my 500MHz P-II with 196MB of memory it takes:
a - 22 minutes to create 8800 documents (smallish) in 1200 folders within zope, not too fast :( but not exactly a user-interaction-limiting factor :)
7 documents per second aint too bad I don't think, it would be interesting to see how fast you could dump them to the filesystem.
b - too long to then try to do a search based add to a zcatalog, ie: netscape times out after only around 8 minutes, and the search has not finished!
Let me make sure we have the same terminology. 'Finding' objects into the catalog involves using the find tab to search recursively down from the catalog. 'Searching' means typing search criteria into an allready loaded catalog and getting results. It sounds like your talking about 'finding'. If it's taking 8 minutes to do a *search*, that's a bug. If it's finding your taking about, try increasing the sub transaction threshold (on the status screen) by an order of magnitude or two. This will cause Zope to commit sub-transactions less frequently. 1000, the default, is probably two low but since this is the first version of Zope with a catalog in it, it's not gotten any real world use. We'll probably jack it up to at least 10,000 for 2.1.
BTW, Zopes python process and postgresql take about 50% of the CPU each, and there is basically zero disk thrashing during this process (although zope does get up around 50MB of memory use..)
Yes mass indexing is inneficient at the moment. I recently recieved 'Managing Gigabytes' which was recommended by someone on the list. It has some very cool stuff in it that we might put into the catalog to speed up indexing and searching (although as far as I can tell, searches with ZCatalog are *damn* fast), and reduce memory and object database consumption with slicker aglorithms and compression. It also has some cool stuff about wildcard/globbing searches at the expense of some extra memory. Note that the time it takes to mass index will improve as we improve the algorithm, but in reality indexing allways takes time. Once your 'corpus' of documents is created, it would be much, much faster to incrementally index new and changed documents into the catalog then to mass index everything over again. -Michel
On Tue, 14 Sep 1999, Michel Pelletier wrote:
Stuart Woolford wrote:
on my 500MHz P-II with 196MB of memory it takes:
a - 22 minutes to create 8800 documents (smallish) in 1200 folders within zope, not too fast :( but not exactly a user-interaction-limiting factor :)
7 documents per second aint too bad I don't think, it would be interesting to see how fast you could dump them to the filesystem.
I can produce documents to the FS around 10 times that speed, but I'm not complaining, I think it is not too bad..
b - too long to then try to do a search based add to a zcatalog, ie: netscape times out after only around 8 minutes, and the search has not finished!
Let me make sure we have the same terminology. 'Finding' objects into the catalog involves using the find tab to search recursively down from the catalog. 'Searching' means typing search criteria into an allready loaded catalog and getting results. It sounds like your talking about 'finding'. If it's taking 8 minutes to do a *search*, that's a bug. If it's finding your taking about, try increasing the sub transaction threshold (on the status screen) by an order of magnitude or two. This will cause Zope to commit sub-transactions less frequently. 1000, the default, is probably two low but since this is the first version of Zope with a catalog in it, it's not gotten any real world use. We'll probably jack it up to at least 10,000 for 2.1.
you are right, I'm findingdocs into the zcatalogue, not searching it (yet).
BTW, Zopes python process and postgresql take about 50% of the CPU each, and there is basically zero disk thrashing during this process (although zope does get up around 50MB of memory use..)
Yes mass indexing is inneficient at the moment. I recently recieved 'Managing Gigabytes' which was recommended by someone on the list. It has some very cool stuff in it that we might put into the catalog to speed up indexing and searching (although as far as I can tell, searches with ZCatalog are *damn* fast), and reduce memory and object database consumption with slicker aglorithms and compression. It also has some cool stuff about wildcard/globbing searches at the expense of some extra memory.
I was thinking that a 50% share was not to bad for a non-native-compiled.. pretty much on target I would say.
Note that the time it takes to mass index will improve as we improve the algorithm, but in reality indexing allways takes time. Once your 'corpus' of documents is created, it would be much, much faster to incrementally index new and changed documents into the catalog then to mass index everything over again.
One VERY interesting think I have noticed: around 5 minutes into the add, watching TOP on the unix system, I see that the python process splits (it's around 11MB at this stage), than a little after I get another postmaster (the database) process appearing, and from then on we have a 4-way split of CPU, instead of 2-way, I don't see any reason for Zope to split off a new process (it has no other connections while doing this) - is this a bug perhaps?
-Michel
-- ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
Stuart Woolford wrote:
If it's finding your taking about, try increasing the sub transaction threshold (on the status screen) by an order of magnitude or two. This will cause Zope to commit sub-transactions less frequently. 1000, the default, is probably two low but since this is the first version of Zope with a catalog in it, it's not gotten any real world use. We'll probably jack it up to at least 10,000 for 2.1.
you are right, I'm findingdocs into the zcatalogue, not searching it (yet).
Did increasing the threshold help? -Michel
On Tue, 14 Sep 1999, Michel Pelletier wrote:
Stuart Woolford wrote:
If it's finding your taking about, try increasing the sub transaction threshold (on the status screen) by an order of magnitude or two. This will cause Zope to commit sub-transactions less frequently. 1000, the default, is probably two low but since this is the first version of Zope with a catalog in it, it's not gotten any real world use. We'll probably jack it up to at least 10,000 for 2.1.
you are right, I'm findingdocs into the zcatalogue, not searching it (yet).
Did increasing the threshold help?
Well, I upped it to 10000, and also converted all the docs to ZClasses, and index specific properties instead of a html body. The down side is it now takes 40 minutes to generate 8800 items (I've still got to optimise this, I'm sure it can be improved), but the finding into the ZCatalogue is not great -3 minutes, with indexing taking another 4 minutes. I've noticed one 'feature' - whing a basic ZSearch, I have a text indexed 'name' field (tha name of a book, FWIW), when I search (for 'computer', for example) I only get the search word back here, not the whole name, is this a bug or a feature? I've not looked closely yet, so it can quite probably be fixed.. ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
Stuart Woolford wrote:
On Tue, 14 Sep 1999, Michel Pelletier wrote:
Stuart Woolford wrote:
If it's finding your taking about, try increasing the sub transaction threshold (on the status screen) by an order of magnitude or two. This will cause Zope to commit sub-transactions less frequently. 1000, the default, is probably two low but since this is the first version of Zope with a catalog in it, it's not gotten any real world use. We'll probably jack it up to at least 10,000 for 2.1.
you are right, I'm findingdocs into the zcatalogue, not searching it (yet).
Did increasing the threshold help?
Well, I upped it to 10000, and also converted all the docs to ZClasses, and index specific properties instead of a html body.
The down side is it now takes 40 minutes to generate 8800 items (I've still got to optimise this, I'm sure it can be improved), but the finding into the ZCatalogue is not great -3 minutes, with indexing taking another 4 minutes.
Yes but this is the first index. In the next revision, I'll have implimented an optimization where when you run find the second time, it snifs the modification time of each object and only bothers to re-index the objects that changed since the last index. Trivial, but big wins for large bodies of unchanging documents. I hope we still got some cvs testers out there. A further optimization Jim pointed out today is a bit more advanced, using multiple sorted indexes with merges. This should reduce alot of the IO thrashing that mass indexing does. In terms of ZCatalog as it stands now, mass indexing is it's weakness. 3-4 minutes isn't bad though, it would be nice to know a total count of how many 'unique entities' (stemmed words) a catalog has seen over a period of time or even better a log of total words indexed in the last n transaction commits.
I've noticed one 'feature' - whing a basic ZSearch, I have a text indexed 'name' field (tha name of a book, FWIW), when I search (for 'computer', for example) I only get the search word back here, not the whole name, is this a bug or a feature? I've not looked closely yet, so it can quite probably be fixed..
I'm sorry, I don't understand your problem. Can you rephrase it? -Michel
participants (3)
-
Michel Pelletier -
Paul Everitt -
Stuart Woolford