Request for a Pluggin Index (NameIndex)
Hi, If anyone's got the time or fancies a challenge, could they write an index that behaves as follows: Indexed values: 1) C.J.Withers 2) Chris Withers 3) C Petrilli 4) Christopher McDonough search result C 1,2,3,4 C.J.Withers 1 c.j.Withers 1 withers mcdonough 1,2,4 Chris 2,4 Christo 4 I think the basic rules are: - split on whitespace and punctuation (not accentuated characters and the like ;-) - index each remaining name part - when searching, return all records where any of the name parts match something like: string.find(name_part,search_expression) ...oh yeah, and do it blindingly quickly ;-) This would be really useful for the Creator dublin core field and anywhere you're searching for someone's name. The CMF could benefit from it and would eliminate the phrase next to the Creator field which has haunted me from Squishdot: " Note that you must enter their username exactly. " cheers, Chris
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog. Andreas ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: <zope-dev@zope.org> Sent: Monday, June 04, 2001 4:05 PM Subject: [Zope-dev] Request for a Pluggin Index (NameIndex)
Hi,
If anyone's got the time or fancies a challenge, could they write an index that behaves as follows:
Indexed values: 1) C.J.Withers 2) Chris Withers 3) C Petrilli 4) Christopher McDonough
search result C 1,2,3,4 C.J.Withers 1 c.j.Withers 1 withers mcdonough 1,2,4 Chris 2,4 Christo 4
I think the basic rules are: - split on whitespace and punctuation (not accentuated characters and the like ;-) - index each remaining name part - when searching, return all records where any of the name parts match something like: string.find(name_part,search_expression)
...oh yeah, and do it blindingly quickly ;-)
This would be really useful for the Creator dublin core field and anywhere you're searching for someone's name. The CMF could benefit from it and would eliminate the phrase next to the Creator field which has haunted me from Squishdot:
" Note that you must enter their username exactly. "
cheers,
Chris
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog.
Yeah, I know all that, and I'm very much looking forward to playing with this. :-) However, the email was an invitation for anyone who's interested and currently has time on their hands (yeah, I know, there's lots of us like that ;-) to have a go at writing the index type for me... cheers, Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Andreas Jung" <andreas@andreas-jung.com> Cc: "zope-dev" <zope-dev@zope.org> Sent: Tuesday, June 05, 2001 11:30 AM Subject: Re: [Zope-dev] Request for a Pluggin Index (NameIndex)
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog.
Yeah, I know all that, and I'm very much looking forward to playing with this. :-) However, the email was an invitation for anyone who's interested and currently has time on their hands (yeah, I know, there's lots of us like that ;-) to have a go at writing the index type for me...
I think it should not be a large problem to write such an index because it looks like you can subclass the TextIndex class and replace/extend the needed functionality. Andreas
On Tue, 5 Jun 2001, Chris Withers wrote:
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog.
Yeah, I know all that, and I'm very much looking forward to playing with this. :-) However, the email was an invitation for anyone who's interested and currently has time on their hands (yeah, I know, there's lots of us like that ;-) to have a go at writing the index type for me...
I would like to help if I had time :) I think the most efficient way of doing what you want is to construct an index based on a 'Suffix Trie' this essentially allows matching of arbitrary substrings very quickly, the only problem is that it takes up a fair amount of space. The upside is that it can be updated and incrementally added to quite easily (unlike many inverted list implementations). I confess I have not had the chance to look at the pluggable index types in 2.4, but would really like to as I would like to port over some indexing code I was working on for another project that allows compressed storage of inverted lists for indexes. On average you can store a 32-bit document id/ref in around 4 bits, which means you save a lot of space and can keep stopwords in the lexicon (as an example try searching for 'to be or not to be' in an index that removes stopwords :). Not only do you save space, but due to the way the inverted list is read and decompressed you save time on disk access for large indexes as there is less to physically read. -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
Matt Hamilton wrote:
I would like to help if I had time :) I think the most efficient way of doing what you want is to construct an index based on a 'Suffix Trie' this essentially allows matching of arbitrary substrings very quickly, the only problem is that it takes up a fair amount of space. The upside is that it can be updated and incrementally added to quite easily (unlike many inverted list implementations).
I confess I have not had the chance to look at the pluggable index types in 2.4, but would really like to as I would like to port over some indexing code I was working on for another project that allows compressed storage of inverted lists for indexes. On average you can store a 32-bit document id/ref in around 4 bits, which means you save a lot of space and can keep stopwords in the lexicon (as an example try searching for 'to be or not to be' in an index that removes stopwords :). Not only do you save space, but due to the way the inverted list is read and decompressed you save time on disk access for large indexes as there is less to physically read.
Wow Matt, you seem to know what you're talking about :-) If you get a chance to implement the index I asked about, please gimme a shout, I'd love to try it out... cheers, Chris PS: Whereabouts in the UK are you?
There is a new How-To for PlugginIndexes: http://www.zope.org/Members/ajung/howto/PluginIndexes/index_html Andreas ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Matt Hamilton" <matth@netsight.co.uk> Cc: "Andreas Jung" <andreas@andreas-jung.com>; "zope-dev" <zope-dev@zope.org> Sent: Monday, June 11, 2001 9:10 AM Subject: Re: [Zope-dev] Request for a Pluggin Index (NameIndex)
Matt Hamilton wrote:
I would like to help if I had time :) I think the most efficient way of doing what you want is to construct an index based on a 'Suffix Trie'
this
essentially allows matching of arbitrary substrings very quickly, the only problem is that it takes up a fair amount of space. The upside is that it can be updated and incrementally added to quite easily (unlike many inverted list implementations).
I confess I have not had the chance to look at the pluggable index types in 2.4, but would really like to as I would like to port over some indexing code I was working on for another project that allows compressed storage of inverted lists for indexes. On average you can store a 32-bit document id/ref in around 4 bits, which means you save a lot of space and can keep stopwords in the lexicon (as an example try searching for 'to be or not to be' in an index that removes stopwords :). Not only do you save space, but due to the way the inverted list is read and decompressed you save time on disk access for large indexes as there is less to physically read.
Wow Matt, you seem to know what you're talking about :-)
If you get a chance to implement the index I asked about, please gimme a shout, I'd love to try it out...
cheers,
Chris
PS: Whereabouts in the UK are you?
Andreas Jung wrote:
There is a new How-To for PlugginIndexes:
http://www.zope.org/Members/ajung/howto/PluginIndexes/index_html
Looks great :-) Coupla Questions: Is there anything you can do in the index_object method to re-use ZCatalog's "get all attributes and call them if they're callable"? In uniqueValues, what do the lengths that withLengths returns actually mean? In _apply_index, are ResultSet objects and how to build them documented anywhere? What is cid used for? I take it query_options is better understood by looking at the PathIndex example? And finally, has anyone considered writing a Pluggable Index that uses an SQL index or tabel of some sort to do its indexing? How abotu a 'Generic Pluggable Index' that lets you implement the interface using Python Scripts, ZSQL methods, etc? cheers, Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Andreas Jung" <andreas@digicool.com> Cc: "Matt Hamilton" <matth@netsight.co.uk>; "zope-dev" <zope-dev@zope.org> Sent: Tuesday, June 12, 2001 9:41 AM Subject: [Zope-dev] Pluggable Index How-To Questions
Andreas Jung wrote:
There is a new How-To for PlugginIndexes:
http://www.zope.org/Members/ajung/howto/PluginIndexes/index_html
Looks great :-)
Coupla Questions:
Is there anything you can do in the index_object method to re-use ZCatalog's "get all attributes and call them if they're callable"?
Don't understand the question...maybe I don't know this ZCatalog feature.
In uniqueValues, what do the lengths that withLengths returns actually
mean? Good question - I think uniqueValues is only used for FieldIndex. I think you usually must not implement it - I must check this...
In _apply_index, are ResultSet objects and how to build them documented anywhere? What is cid used for?
Best way is to take a look in PathIndex.py..
I take it query_options is better understood by looking at the PathIndex example?
There is a new API for passing parameters to the searchResults() of the ZCatalog (see ZCatalog/help/ZCatalog_Parameters.stx and doc/changenotes). The query_options is a list of options that the index is interested when it gets a search request.
And finally, has anyone considered writing a Pluggable Index that uses an
SQL
index or tabel of some sort to do its indexing?
never heard of it.
How abotu a 'Generic Pluggable Index' that lets you implement the interface using Python Scripts, ZSQL methods, etc?
uuuuuuuhhhhhhhhhhhhh.....I think you can write that as a Product. But I don't think we will write this. And I don't like the idea...I prefer to write such a package with VI and store it in the filesystem :-) Andreas
Andreas Jung wrote:
Is there anything you can do in the index_object method to re-use ZCatalog's "get all attributes and call them if they're callable"?
Don't understand the question...maybe I don't know this ZCatalog feature.
My perception is that the 'classic' ZCatalog Indexes have a method something like: def getValue(self,obj): try: value = getattr(obj,self.id) # self.id is the name of the index. # The interface doesn't specify how to # get hold of this :-S if callable(value): value=value() except AttributeError,TypeError: value=None return value ...which gets the value to index. I was wondering if that function is available anywhere rather than having to re-implement it each time you write a new pluggable index. That said, I guess the 'classic' indexes have been re-implemented as PluggableIndexes?
In uniqueValues, what do the lengths that withLengths returns actually mean?
Good question - I think uniqueValues is only used for FieldIndex. I think you usually must not implement it - I must check this...
Well, IMHO, uniqueValues shouldn't be part of the interface. AFAIK, it only makes sense with certain types of index: KeywordIdnexes and possibly FieldIndexes. Is that the case?
In _apply_index, are ResultSet objects and how to build them documented anywhere? What is cid used for?
Best way is to take a look in PathIndex.py..
OK :-)
There is a new API for passing parameters to the searchResults() of the ZCatalog (see ZCatalog/help/ZCatalog_Parameters.stx and doc/changenotes). The query_options is a list of options that the index is interested when it gets a search request.
OK...
How abotu a 'Generic Pluggable Index' that lets you implement the interface using Python Scripts, ZSQL methods, etc?
uuuuuuuhhhhhhhhhhhhh.....I think you can write that as a Product.
hehe... cool :-)
But I don't think we will write this. And I don't like the idea...I prefer to write such a package with VI and store it in the filesystem :-)
I agree, but when you're exploring what you want to do and don't have access to the filesystem, it could be really useful. thanks for the answers, Chris
On Tue, 12 Jun 2001, Andreas Jung wrote:
In uniqueValues, what do the lengths that withLengths returns actually mean?
Good question - I think uniqueValues is only used for FieldIndex.
Right
I think you usually must not implement it - I must check this...
if you're making a text index, I guess you could return a list of all the unique words, but this is probably not that useful. uniqueValuesFor is for field indexes (and keyword indexes) -Michel
On Mon, 11 Jun 2001, Chris Withers wrote:
Wow Matt, you seem to know what you're talking about :-)
My final year University project was to create an Open Source mailing list archive :) I did quite a bit of reading into information retrieval and assorted algorithms and data structures. I had a prototype running for quite some time, but is currently down as I am wiping the machine to start again in python :) The original system was a mix of C/Perl/Python and returned results in XML which then were formatted via XSLT. Once I get a spare minute I am going to try and re-implement it in Python and using ZODB (with BerkeleyDB storage) I might try and port some of the code over to work as a PluggableIndex too. One of the main tasks is to write a python wrapper around my compression code. I will have to look more closely at how to write Python modules in C, as it does lots of bit twiddling which is in a very tight loop. The object will basically be a compressed list to which you can append ascending integers and will allow various fast union/intersection operations with other similar objects. This in itself may be sufficent to use in a PlugginIndex.
If you get a chance to implement the index I asked about, please gimme a shout, I'd love to try it out...
Unfortunately I don't have the time. Unless I can use it myself directly in a project we have funding for (or unless anyone wants to fund my time to develop it) I will have to wait until I have some more time on my hands.
PS: Whereabouts in the UK are you?
Bristol. -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
Matt Hamilton wrote:
On Mon, 11 Jun 2001, Chris Withers wrote:
Wow Matt, you seem to know what you're talking about :-)
My final year University project was to create an Open Source mailing list archive :) I did quite a bit of reading into information retrieval and assorted algorithms and data structures.
Ah, okay :-)
Once I get a spare minute I am going to try and re-implement it in Python and using ZODB (with BerkeleyDB storage) I might try and port some of the code over to work as a PluggableIndex too.
Cool...
One of the main tasks is to write a python wrapper around my compression code. I will have to look more closely at how to write Python modules in C, as it does lots of bit twiddling which is in a very tight loop. The object will basically be a compressed list to which you can append ascending integers and will allow various fast union/intersection operations with other similar objects. This in itself may be sufficent to use in a PlugginIndex.
Yeah, I'd love to see it...
Unfortunately I don't have the time. Unless I can use it myself directly in a project we have funding for (or unless anyone wants to fund my time to develop it) I will have to wait until I have some more time on my hands.
No worries... cheers, Chris
PS: Whereabouts in the UK are you?
Bristol.
hehe... will be out celebrating my birthday there this Wednesday evening :-) If you see me lying in a gutter on Thursday morning, please don't kick me too hard ;-)
On Monday 04 June 2001 16:55, Andreas Jung wrote:
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog.
this brings up an interesting question of what is the best way to register a new plugindex thats distributed with a product. Glancing over the cvs logs it looks as though plugin indexes are arranged to be the first product installed in Application.py. Given that what is the suggested method for registering a new plugin index? Kapil
Andreas ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: <zope-dev@zope.org> Sent: Monday, June 04, 2001 4:05 PM Subject: [Zope-dev] Request for a Pluggin Index (NameIndex)
Hi,
If anyone's got the time or fancies a challenge, could they write an index that behaves as follows:
Indexed values: 1) C.J.Withers 2) Chris Withers 3) C Petrilli 4) Christopher McDonough
search result C 1,2,3,4 C.J.Withers 1 c.j.Withers 1 withers mcdonough 1,2,4 Chris 2,4 Christo 4
I think the basic rules are: - split on whitespace and punctuation (not accentuated characters and the like ;-) - index each remaining name part - when searching, return all records where any of the name parts match something like: string.find(name_part,search_expression)
...oh yeah, and do it blindingly quickly ;-)
This would be really useful for the Creator dublin core field and anywhere you're searching for someone's name. The CMF could benefit from it and
would
eliminate the phrase next to the Creator field which has haunted me from Squishdot:
" Note that you must enter their username exactly. "
cheers,
Chris
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
----- Original Message ----- From: "ender" <kthangavelu@earthlink.net> To: "Andreas Jung" <andreas@andreas-jung.com> Cc: "zope-dev" <zope-dev@zope.org> Sent: Wednesday, June 06, 2001 5:30 PM Subject: Re: [Zope-dev] Request for a Pluggin Index (NameIndex)
On Monday 04 June 2001 16:55, Andreas Jung wrote:
Looks like you should write your own index type. Zope 2.4 comes with an PlugableIndex interface to allow third-party indexes to be integrated into the Catalog.
this brings up an interesting question of what is the best way to register a new plugindex thats distributed with a product. Glancing over the cvs logs it looks as though plugin indexes are arranged to be the first product installed in Application.py. Given that what is the suggested method for registering a new plugin index?
I think this should be subject of a small How-To. Anyway...to register a plugin index you have to call "context.registerClass(...)". Take a look at PluginIndexes/__init__.py how Zopes indexes are registered. Other indexes should do it in the same way. The reason why PluginIndexes are installed as first product is that there are some dependencies between PluginIndexes and other Zope Products. Products are usually inialized in alphabetical order. But in this case we made an exception. Andreas
participants (7)
-
Andreas Jung -
Andreas Jung -
Andreas Jung -
Chris Withers -
ender -
Matt Hamilton -
Michel Pelletier