Hi, i'm currently working on a product that allows to attach relational information to zope-objects. It works quite well so far, but to further enhance it i need to make some changes to the Catalog. I could perhaps implement it as a separate product, but i strongly feel that those changes are best applied to the Catalog itself, as they are of general use (i think) and involve a lot of changes to the inner workings of the Catalog. In particular i need the following: - named/stored queries these are precompiled queries, so they can be executed without parsing and are easily cacheable i.e. similar to what is implemented in CMFTopic, but stored in the Catalog and a bit smarter - caching support - unions and intersections sub-queries (i.e. queries that are directed at a certain index) should be more flexibly combineable I searched this mailing-list as well as zope.org to get an idea about what has already been discussed and requested, and there seems to be some interest in improving the Catalog. Some people even seem to have worked on this, perhaps they could give an update on this? Possibly i don't have to write everything from scratch... I would have put this into a proposal, but there already are two proposals that deal with the features i want, one is dedicated to unions/intersections, the other (TopicIndexes) to performance issues (i dont't know what's the status of these though, especially the first one is rather old), and i don't want to hijack them without asking. As so often i will need to complete my current project first, but would then like to help in improving the Catalog for a more general use. So, if there is interest, i would propose to collect some ideas and comments about how a better Catalog should look like, how it could be best implemented and how to organize this effort (with respect to the already existing proposals). -- Wolfram Kerber Gallileus GmbH http://www.gallileus.info/
On Tuesday, November 20, 2001, at 03:35 PM, Wolfram Kerber wrote:
Hi,
i'm currently working on a product that allows to attach relational information to zope-objects. It works quite well so far, but to further enhance it i need to make some changes to the Catalog. I could perhaps implement it as a separate product, but i strongly feel that those changes are best applied to the Catalog itself, as they are of general use (i think) and involve a lot of changes to the inner workings of the Catalog. In particular i need the following:
- named/stored queries these are precompiled queries, so they can be executed without parsing and are easily cacheable i.e. similar to what is implemented in CMFTopic, but stored in the Catalog and a bit smarter
There used to be something like this in ZTables/Tabula (a Zope 1.x product that was sort of the genesis of the Catalog, for better or worse) called 'Hierarchies'. Hierarchies were actually indexes (I think the current Keyword index is descended from the Keyword Hierarchy). I don't know what happened to that code. If it's not available, you could probably achieve the effect that you're looking for here with PluginIndexes, which wouldn't require changing the Catalog at all. Just write a "Query Index" that indexes objects that match its pre-cooked Query. This would speed up searching tremendously, but you could take a big hit at indexing time if you have many of them. Jeffrey P Shell, jeffrey@cuemedia.com
----- Original Message ----- From: "Jeffrey P Shell" <jeffrey@cuemedia.com> To: <zope-dev@zope.org> Sent: Wednesday, November 21, 2001 7:38 PM Subject: Re: [Zope-dev] Catalog improvements
On Tuesday, November 20, 2001, at 03:35 PM, Wolfram Kerber wrote:
Hi,
i'm currently working on a product that allows to attach relational information to zope-objects. It works quite well so far, but to further enhance it i need to make some changes to the Catalog. I could perhaps implement it as a separate product, but i strongly feel that those changes are best applied to the Catalog itself, as they are of general use (i think) and involve a lot of changes to the inner workings of the Catalog. In particular i need the following:
- named/stored queries these are precompiled queries, so they can be executed without parsing and are easily cacheable i.e. similar to what is implemented in CMFTopic, but stored in the Catalog and a bit smarter
There used to be something like this in ZTables/Tabula (a Zope 1.x product that was sort of the genesis of the Catalog, for better or worse) called 'Hierarchies'. Hierarchies were actually indexes (I think the current Keyword index is descended from the Keyword Hierarchy).
I don't know what happened to that code. If it's not available, you could probably achieve the effect that you're looking for here with PluginIndexes
I think your right. Indexes also have a management interface that could be used to define the query. It could result in a nesting problem however, if 'QueryIndexes' rely on each others results (that they should be able to). I would possibly need a management view that shows the hirarchical structure of the Indexes, but it can be merely that, a view. I'll try this out...
, which wouldn't require changing the Catalog at all.
I'd say, if i would _not_ store the result of the query and just delegate to other indexes this would be true, otherwise i would need some notify mechanism to tell if my result is affected by an indexing call, and/or at least be notified when the call is over so i can update the result by issuing a query, but the latter would mean to 'take the big hit' as you mentioned, wich i think isn't acceptable.
Just write a "Query Index" that indexes objects that match its pre-cooked Query. This would speed up searching tremendously, but you could take a big hit at indexing time if you have many of them.
Jeffrey P Shell, jeffrey@cuemedia.com
thanks, Wolfram
On Tuesday 20 November 2001 05:35 pm, Wolfram Kerber allegedly wrote:
Hi,
i'm currently working on a product that allows to attach relational information to zope-objects. It works quite well so far, but to further enhance it i need to make some changes to the Catalog. I could perhaps implement it as a separate product, but i strongly feel that those changes are best applied to the Catalog itself, as they are of general use (i think) and involve a lot of changes to the inner workings of the Catalog. In particular i need the following:
- named/stored queries these are precompiled queries, so they can be executed without parsing and are easily cacheable i.e. similar to what is implemented in CMFTopic, but stored in the Catalog and a bit smarter
- caching support
- unions and intersections sub-queries (i.e. queries that are directed at a certain index) should be more flexibly combineable
I have some code that implements this in my CatalogQuery product. It creates a query object from a string. Presently these are not persistent, but they could easily be made to be to create precompiled queries. code at: http://www.zope.org/Members/Kaivo/CatalogQuery
I searched this mailing-list as well as zope.org to get an idea about what has already been discussed and requested, and there seems to be some interest in improving the Catalog. Some people even seem to have worked on this, perhaps they could give an update on this? Possibly i don't have to write everything from scratch...
I would be willing to help both in coding and getting the code put into the Zope core.
I would have put this into a proposal, but there already are two proposals that deal with the features i want, one is dedicated to unions/intersections, the other (TopicIndexes) to performance issues (i dont't know what's the status of these though, especially the first one is rather old), and i don't want to hijack them without asking. As so often i will need to complete my current project first, but would then like to help in improving the Catalog for a more general use.
Possibly we need to rekindle discussion. I would suggest contacting the authors of those proposals to see how compatible your concepts are wth theirs. Perhaps a new proposal should be drafted with the new ideas and ty them back to the previous ones. If there is redundancy, that can be worked out.
So, if there is interest, i would propose to collect some ideas and comments about how a better Catalog should look like, how it could be best implemented and how to organize this effort (with respect to the already existing proposals).
I am very interested in such a discussion. Let me know what I can do to help. /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/
Casey Duncan wrote:
I have some code that implements this in my CatalogQuery product. It creates a query object from a string. Presently these are not persistent, but they could easily be made to be to create precompiled queries.
Casey, did you get a chance to look at my patches for adding an extended uniqueValues method to CatalogQuery?
I would be willing to help both in coding and getting the code put into the Zope core.
<raises hand> me too!
So, if there is interest, i would propose to collect some ideas and comments about how a better Catalog should look like, how it could be best implemented and how to organize this effort (with respect to the already existing proposals).
I am very interested in such a discussion. Let me know what I can do to help.
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope. -- Steve Alexander Software Engineer Cat-Box limited
On Tuesday 27 November 2001 09:49 am, Steve Alexander allegedly wrote:
Casey Duncan wrote:
I have some code that implements this in my CatalogQuery product. It creates a query object from a string. Presently these are not persistent, but they could easily be made to be to create precompiled queries.
Casey, did you get a chance to look at my patches for adding an extended uniqueValues method to CatalogQuery?
No unfortunately I think it got lost in the shuffle around the time of my cross-country move. Any chance of sending it over again? I am revamping some of my "old" products, perhaps this will give me an excuse to release a new version of catquery.
I would be willing to help both in coding and getting the code put into the Zope core.
<raises hand> me too!
So, if there is interest, i would propose to collect some ideas and comments about how a better Catalog should look like, how it could be best implemented and how to organize this effort (with respect to the already existing proposals).
I am very interested in such a discussion. Let me know what I can do to help.
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more. Perhaps we should arrange an "indexing and catalog" chat on #zope. /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/
On Tue, 27 Nov 2001, Casey Duncan wrote:
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more. Perhaps we should arrange an "indexing and catalog" chat on #zope.
I would like in on that too :) About a year or so ago I was working on a full-text indexing system for indexing several gigabytes of text (mailing list archives). Most of it was written in C and uses quite a lot of cool algorithms from various information retrieval papers and books. I have been hoping to have the time to take parts of it and work it into the new PluginIndex architecture. The existing code uses BerkeleyDB files to hold the index structures, but I would like to use ZODB instead to give it a bit more modularity. -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
Is this code available for public ? Andreas ----- Original Message ----- From: "Matt Hamilton" <matth@netsight.co.uk> To: "Casey Duncan" <c.duncan@nlada.org> Cc: "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Tuesday, November 27, 2001 10:06 Subject: Re: [Zope-dev] Catalog improvements
On Tue, 27 Nov 2001, Casey Duncan wrote:
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more. Perhaps we should arrange an "indexing and catalog" chat on #zope.
I would like in on that too :) About a year or so ago I was working on a full-text indexing system for indexing several gigabytes of text (mailing list archives). Most of it was written in C and uses quite a lot of cool algorithms from various information retrieval papers and books. I have been hoping to have the time to take parts of it and work it into the new PluginIndex architecture. The existing code uses BerkeleyDB files to hold the index structures, but I would like to use ZODB instead to give it a bit more modularity.
On Tue, 27 Nov 2001, Andreas Jung wrote:
Is this code available for public ?
Sort of :) It used to be around, but the server with it on is currently offline and in need of a new disk controller, so it is not to hand. It is also poorly commented :( and written in very highly optimised (read: illegible) C. The main bits needed from it are the routines to store an retrieve compressed lists of ascending integers (ie. used in indexes). I want to write a python wrapper around them and release a list-like python data structure that will allow efficient storage of indexes. The other bit is the code for doing the cosine ranking similarity comparison in order to rank the documents in order of relevance to a query. Most of the code is taken from the book/code 'Managing Gigabytes' by Witten, Moffat & Bell (http://www.cs.mu.OZ.AU/mg/) The code is quite old now (1999) and designed for quite large systems, or reletively static text (ie. doesn't do incremental indexing very well). I worked on developing a 'forward' index which could be easily updated, and then inverted quite quickly on a regular basis (since it didn't need to parse the source text again). -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
Matt Hamilton wrote:
I would like in on that too :) About a year or so ago I was working on a full-text indexing system for indexing several gigabytes of text (mailing list archives). Most of it was written in C and uses quite a lot of cool algorithms from various information retrieval papers and books. I have been hoping to have the time to take parts of it and work it into the new PluginIndex architecture. The existing code uses BerkeleyDB files to hold the index structures, but I would like to use ZODB instead to give it a bit more modularity.
Hi Matt, Are any of these algorithms publicly available? I'd be _very_ interested in them :-) Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Matt Hamilton" <matth@netsight.co.uk> Cc: "Casey Duncan" <c.duncan@nlada.org>; "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Wednesday, November 28, 2001 09:27 Subject: Re: [Zope-dev] Catalog improvements
Matt Hamilton wrote:
I would like in on that too :) About a year or so ago I was working on
a
full-text indexing system for indexing several gigabytes of text (mailing list archives). Most of it was written in C and uses quite a lot of cool algorithms from various information retrieval papers and books. I have been hoping to have the time to take parts of it and work it into the new PluginIndex architecture. The existing code uses BerkeleyDB files to hold the index structures, but I would like to use ZODB instead to give it a bit more modularity.
Hi Matt,
Are any of these algorithms publicly available? I'd be _very_ interested in them :-)
I think the software "MG" from the book "Managing Gigabytes" is GPLed and currently released as mg-1.21. Walking through the TOC of the book, it seems to be a very detailed sources about text processing and gives very much informations about different indexes types. But I miss some explanations about current data structures like suffix arrays or suffix tree that have several advantages for text processing compared to B-Trees. Andreas --------------------------------------------------------------------- - Andreas Jung Zope Corporation - - EMail: andreas@zope.com http://www.zope.com - - "Python Powered" http://www.python.org - - "Makers of Zope" http://www.zope.org - - "Life is a fulltime occupation" - ---------------------------------------------------------------------
On Wed, 28 Nov 2001, Andreas Jung wrote:
I think the software "MG" from the book "Managing Gigabytes" is GPLed and currently released as mg-1.21. Walking through the TOC of the book, it seems to be a very detailed sources about text processing and gives very much informations about different indexes types. But I miss some explanations about current data structures like suffix arrays or suffix tree that have several advantages for text processing compared to B-Trees.
Suffix Trees/Tries take up a *lot* of space. But they are very fast, and useful for searching for substrings. The main gist of the stuff in 'Managing Gigabytes' is that it is possible to store an ascending list of integers in a compressed form, such that on average each integer requires only 4 bits to represent it. This is obviously much more compact than a straight list of 32 or 64 bit integers/longs (plus any overhead python adds to its inbuild list type). The other point is that you can read and decode the lists very quickly (you don't need to decompress the entire list first before reading it). Also consecutive numbers only take 1 bit of storage, this means that 'stopwords' that are normally omitted from indexes due to their very high frequency (and hence bloat of the index) can be stored very efficiently. One problem is that all of the research done in MG is based on much older hardware than is currently availible and they try to make certain optimisations, which nowadays don't save much time. -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
----- Original Message ----- From: "Matt Hamilton" <matth@netsight.co.uk> To: "Andreas Jung" <andreas@zope.com> Cc: "Chris Withers" <chrisw@nipltd.com>; "Casey Duncan" <c.duncan@nlada.org>; "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Wednesday, November 28, 2001 09:55 Subject: Re: [Zope-dev] Catalog improvements
On Wed, 28 Nov 2001, Andreas Jung wrote:
I think the software "MG" from the book "Managing Gigabytes" is GPLed and currently released as mg-1.21. Walking through the TOC of the book, it seems to be a very detailed sources about text processing and gives very much informations about different indexes types. But I miss some explanations about current data structures like suffix arrays or suffix tree that have several advantages for text processing compared to B-Trees.
Suffix Trees/Tries take up a *lot* of space. But they are very fast, and useful for searching for substrings.
Usually four times the amount of the data to be indexed ;-) Andreas
Andreas Jung wrote:
I think the software "MG" from the book "Managing Gigabytes" is GPLed and currently released as mg-1.21. Walking through the TOC of the book, it seems to be a very detailed sources about text processing and gives very much informations about different indexes types. But I miss some explanations about current data structures like suffix arrays or suffix tree that have several advantages for text processing compared to B-Trees.
Hmmm... looks like it's time ot go buy a book :-) cheers, Chris
Casey Duncan wrote:
I would be willing to help both in coding and getting the code put into the Zope core.
<raises hand> me too!
Me three! :-) Just to put my take on all of this... As some of you may know, I've been looking at indexign for a while now in one way or another...
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more.
I don't believe this is a good idea anymore, especially if you get into any kind of amount of data. ZODB simple doesn't seem to scale to indexing very well. You all have no doubt experienced this with ZCatalog TextIndexes... I have a more flexible and pluggable indexer written for ZODB (not only Zope! ;-) but it didn't scale to anything like I needed :-( FileStorage goes through RAM at a rate of knots. Jim has a patch for this, but I haven't had a chance o stress test it yet. bsddb2Storage currently hammers disk meaning it has worse performance when indexing than FileStorage ;-) I'm currently working on a MySQL-based full text indexer with phrase matching, and potentially wildcards some time soon. For me, once this is cracked, FieldIndexes and the like are trivial in SQL and I intend to encapsulate the whole thing in a python class for ease of use. This is what I think might be the best solution; relational databases to tables well, that's what indexing is all about: tables. That said, I wasn't aware of Matt's work up until very recently. I'd love to see an Indexer that didn't require an RDB (or BerkleyDB :-P) and scaled to GigaBytes of Data...
Perhaps we should arrange an "indexing and catalog" chat on #zope.
...definitely. When shall we set a time and date? cheers, Chris
On Wed, 28 Nov 2001 14:37:57 +0000, Chris Withers <chrisw@nipltd.com> wrote:
bsddb2Storage currently hammers disk meaning it has worse performance when indexing than FileStorage ;-)
FileStorage is 'damn fast', so almost anything is going to be slower. How much slower was is? Did you measure ratios (between the two storages) of time per indexing operation, or ratios of disk blocks transferred per indexing operation? Toby Dickenson tdickenson@geminidataloggers.com
Toby Dickenson wrote:
FileStorage is 'damn fast', so almost anything is going to be slower.
Indeed, until it runs out of RAM for its indexes ;-)
How much slower was is? Did you measure ratios (between the two storages) of time per indexing operation, or ratios of disk blocks transferred per indexing operation?
In my tests, Barry agreed with me that Berkley was turning out between one and two _orders or magnitude_ slower than FileStorage :-( Chris
"CW" == Chris Withers <chrisw@nipltd.com> writes:
>> How much slower was is? Did you measure ratios (between the two >> storages) of time per indexing operation, or ratios of disk >> blocks transferred per indexing operation? CW> In my tests, Barry agreed with me that Berkley was turning out CW> between one and two _orders or magnitude_ slower than CW> FileStorage :-( Actually, let me clarify this! I just pointed out that your numbers showed you were seeing a two orders of magnitude difference. However, in my own testing, on my own data, I've been able to reduce the performance difference to about a factor of 4.5 -- much better than the factor of 100 your numbers showed for you data! I would not make the blanket assertion that Berkeley storage is 100 times slower than FileStorage. Let me just reiterate: it's vitally important to tune your Berkeley storage for your system and application, especially with regards to cachesize. E.g. Getting the cachesize wrong can definitely destroy your performance, maybe producing numbers as bad as you're seeing. I won't claim that Berkeley DB is easy to tune, though. -Barry
"Barry A. Warsaw" wrote:
than the factor of 100 your numbers showed for you data! I would not make the blanket assertion that Berkeley storage is 100 times slower than FileStorage.
Sorry, let me clarify as well, I only meant in the context of searching and indexing...
Let me just reiterate: it's vitally important to tune your Berkeley storage for your system and application, especially with regards to cachesize. E.g. Getting the cachesize wrong can definitely destroy your performance, maybe producing numbers as bad as you're seeing. I won't claim that Berkeley DB is easy to tune, though.
Indeed... and I've spent a while twiddling cache sizes to no avail ;-) cheers, Chris
Chris Withers wrote:
Toby Dickenson wrote:
FileStorage is 'damn fast', so almost anything is going to be slower.
Indeed, until it runs out of RAM for its indexes ;-)
I wish you would finish testing the change I made for you. It should reduce the memory consumption by an order of magnitude. I took an afternoon out of a rather busy schedule to put this together for you. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (888) 344-4332 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
Jim Fulton wrote:
Chris Withers wrote:
Toby Dickenson wrote:
FileStorage is 'damn fast', so almost anything is going to be slower.
Indeed, until it runs out of RAM for its indexes ;-)
I wish you would finish testing the change I made for you.
Sorry, to be clear, my comment was in the context of using a FileStorage to exclusively store searching and indexing information. Jim has provided a patch which I was trying to test, sadly, for whatever reason, it went wrong and killed the box I was testing on. There are some issues preventing me resurrecting the box (location, staff, etc) but I will let you guys know as soon as I get some information... cheers, Chris
Jim Fulton wrote:
Chris Withers wrote:
Toby Dickenson wrote:
FileStorage is 'damn fast', so almost anything is going to be slower.
Indeed, until it runs out of RAM for its indexes ;-)
I wish you would finish testing the change I made for you. It should reduce the memory consumption by an order of magnitude. I took an afternoon out of a rather busy schedule to put this together for you.
As I currently run 30-60 storage servers on a machine, I would be very interested in testing out such a patch, if you'd be willing to send it along. Thanks, -- ethan mindlace fremen | iMeme - The most full featured Zope Host http://mindlace.net | Root, ZEO, MySQL, Mailman, Unlimited Domains iMeme Partner | http://iMeme.net "It is our desire to remain what we are that limits us. -- Project 2501"
emf wrote:
As I currently run 30-60 storage servers on a machine, I would be very interested in testing out such a patch, if you'd be willing to send it along.
It's in CVS, just check out the appropriate branch: Jim Fulton wrote:
OK, I made a CVS branch, BTreeFSIndex-branch (made from the Zope-2_4-branch), for just the BTrees and ZODB directories. If you update to that branch you should get my experimental changes. The BTrees package has a new extension, _fsBTrees that has 2-char to 6-char BTree types.
The ZODB fsIndex.py provides a FileStorage index based on this BTree. You should get a memory consumption of only a little more than 8 bytes per object. Note that the file size is limited to about 256 terabytes. Nothing is free. :)
cheers, Chris PS: Still haven't managed to get the machine resurrected :-(
Thanks Chris, Important note: To get the benefit of this, you have to remove your old index. FileStorage uses an index object that it pickles to it's index file when you save the index (normal shutdown or pack). If FileStorage reads the old index from the index file, the index will have the old type (aka type({}) ;). To get the FileStorage to use the new BTree-based index implementation, you need to get it to build a new index by starting without an index file or packing. Jim Chris Withers wrote:
emf wrote:
As I currently run 30-60 storage servers on a machine, I would be very interested in testing out such a patch, if you'd be willing to send it along.
It's in CVS, just check out the appropriate branch:
Jim Fulton wrote:
OK, I made a CVS branch, BTreeFSIndex-branch (made from the Zope-2_4-branch), for just the BTrees and ZODB directories. If you update to that branch you should get my experimental changes. The BTrees package has a new extension, _fsBTrees that has 2-char to 6-char BTree types.
The ZODB fsIndex.py provides a FileStorage index based on this BTree. You should get a memory consumption of only a little more than 8 bytes per object. Note that the file size is limited to about 256 terabytes. Nothing is free. :)
cheers,
Chris
PS: Still haven't managed to get the machine resurrected :-(
-- Jim Fulton mailto:jim@zope.com Python Powered! CTO (888) 344-4332 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Casey Duncan" <c.duncan@nlada.org> Cc: "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Wednesday, November 28, 2001 09:37 Subject: Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB
Casey Duncan wrote:
I would be willing to help both in coding and getting the code put
into
the Zope core.
<raises hand> me too!
Me three! :-)
Just to put my take on all of this...
As some of you may know, I've been looking at indexign for a while now in one way or another...
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more.
I don't believe this is a good idea anymore, especially if you get into any kind of amount of data. ZODB simple doesn't seem to scale to indexing very well. You all have no doubt experienced this with ZCatalog TextIndexes... I have a more flexible and pluggable indexer written for ZODB (not only Zope! ;-) but it didn't scale to anything like I needed :-(
FileStorage goes through RAM at a rate of knots. Jim has a patch for this, but I haven't had a chance o stress test it yet. bsddb2Storage currently hammers disk meaning it has worse performance when indexing than FileStorage ;-)
I'm currently working on a MySQL-based full text indexer with phrase matching, and potentially wildcards some time soon. For me, once this is cracked, FieldIndexes and the like are trivial in SQL and I intend to encapsulate the whole thing in a python class for ease of use. This is what I think might be the best solution; relational databases to tables well, that's what indexing is all about: tables.
That said, I wasn't aware of Matt's work up until very recently. I'd love to see an Indexer that didn't require an RDB (or BerkleyDB :-P) and scaled to GigaBytes of Data...
Perhaps we should arrange an "indexing and catalog" chat on #zope.
Storage of indexed data is one aspect but there is also need for components like lexers, stemmers, splitters etc. Oracle Intermedia as an example has a very flexible architecture to handle these components (for all that Oracle Intermedia sucks). It would be also interesting to catalog structured documents (e.g. XML) to be able to specifies queries that involve structural informations. Such a project is not trivial and can not be handled by one person but requires several volunteers ;-) Andreas
Andreas Jung wrote:
Storage of indexed data is one aspect but there is also need for components like lexers, stemmers, splitters etc. Oracle Intermedia as an example has a very flexible architecture to handle these components (for all that Oracle Intermedia sucks).
Hmmm... hopefully that isn't _why_ it sucks ;-)
It would be also interesting to catalog structured documents (e.g. XML) to be able to specifies queries that involve structural informations.
Yup, although right now I'm specifically interested in giving python and zope an easy to use indexing engine that does full text properly and scales well. The other problems appear to be somewhat easier to solve...
Such a project is not trivial and can not be handled by one person but requires several volunteers ;-)
If you're making an assumption about the way I'm working there, you may be mistaken ;-) Chris
On Wednesday 28 November 2001 09:37 am, Chris Withers allegedly wrote:
Casey Duncan wrote:
I would be willing to help both in coding and getting the code put into the Zope core.
<raises hand> me too!
Me three! :-)
Just to put my take on all of this...
As some of you may know, I've been looking at indexign for a while now in one way or another...
I'm interested in this too, and I'm keen to get a solution that will work with just the ZODB, without needing all of Zope.
Yes, I second, third and forth that motion. I have a bunch of ideas kicking around for ZODB-level indexing. Let's talk more.
I don't believe this is a good idea anymore, especially if you get into any kind of amount of data. ZODB simple doesn't seem to scale to indexing very well. You all have no doubt experienced this with ZCatalog TextIndexes... I have a more flexible and pluggable indexer written for ZODB (not only Zope! ;-) but it didn't scale to anything like I needed :-(
I'm not sure I want to store the indexes in the ZODB, just index ZODB data at a low level.
FileStorage goes through RAM at a rate of knots. Jim has a patch for this, but I haven't had a chance o stress test it yet. bsddb2Storage currently hammers disk meaning it has worse performance when indexing than FileStorage ;-)
Yup, I think I have a solution, but it'll involve some coding ;^)
I'm currently working on a MySQL-based full text indexer with phrase matching, and potentially wildcards some time soon. For me, once this is cracked, FieldIndexes and the like are trivial in SQL and I intend to encapsulate the whole thing in a python class for ease of use. This is what I think might be the best solution; relational databases to tables well, that's what indexing is all about: tables.
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).
That said, I wasn't aware of Matt's work up until very recently. I'd love to see an Indexer that didn't require an RDB (or BerkleyDB :-P) and scaled to GigaBytes of Data...
Yup, me too.
Perhaps we should arrange an "indexing and catalog" chat on #zope.
...definitely. When shall we set a time and date?
OK, I'm available all this week, but I'm not as available the next two weeks. Lets find a good time.
cheers,
Chris
/---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/
participants (11)
-
Andreas Jung -
barry@zope.com -
Casey Duncan -
Chris Withers -
emf -
Jeffrey P Shell -
Jim Fulton -
Matt Hamilton -
Steve Alexander -
Toby Dickenson -
Wolfram Kerber