Casey Duncan wrote:
I'm not sure I want to store the indexes in the ZODB, just index ZODB data at a low level.
Ah, okay, and yes, in that case, I am in complete agreement ;-) (the level I'm aiming at is just to be able to index python objects, I'll leave plugging that into the ZODB architecture up to someone who understands it better...)
Yup, I think I have a solution, but it'll involve some coding ;^)
Ooo...care to explain? :-)
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).
Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.
if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?
That said, I wasn't aware of Matt's work up until very recently. I'd love to see an Indexer that didn't require an RDB (or BerkleyDB :-P) and scaled to GigaBytes of Data...
Yup, me too.
Well, I'm just purchasing my copy of Managing Gigabytess now ;-)
OK, I'm available all this week, but I'm not as available the next two weeks. Lets find a good time.
I'm available any time and date, just as long as I get a coupla days notice...
cheers,
Chris
On Thu, 29 Nov 2001, Chris Withers wrote:
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).
Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.
if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?
Is there not a set datatype in python that could be used? Admittedly, most of the stuff in MG is about textual searches rather than exact searches (it can do boolean searches too, but the book is mainly about ranking). It uses an algorithm called the 'Cosine Ranking Algorithm'. Basically if you imagine an N-dimensional space, where N is the number of terms in your vocabulary and represent a document as a vector in that space whose direction is the composite of the terms that appear in it. You then represent a query string as a vector in the same space, the similarity between the document and the query is the angle between the two vectors... the smaller the angle the greater the similarity.
Still with me? :)
-Matt
On Thursday, November 29, 2001, at 04:03 AM, Matt Hamilton wrote:
On Thu, 29 Nov 2001, Chris Withers wrote:
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).
Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.
if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?
Is there not a set datatype in python that could be used? Admittedly,
[SNIP]
There is a 'sets' directory in the Python CVS (in the 'nondist/sandbox' area). I think it was a proposed datatype that didn't quite make the cut for 2.2..(?)
Jeffrey P Shell, jeffrey@cuemedia.com
There are also set objects like OOSets and IISets that can be used in intersection and union operations as documented in the BTrees module.
----- Original Message ----- From: "Jeffrey P Shell" jeffrey@cuemedia.com To: "Matt Hamilton" matth@netsight.co.uk Cc: zope-dev@zope.org Sent: Thursday, November 29, 2001 12:18 PM Subject: Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB
On Thursday, November 29, 2001, at 04:03 AM, Matt Hamilton wrote:
On Thu, 29 Nov 2001, Chris Withers wrote:
I would rather avoid having to use a relational database unless
I
have to. Perhaps the index pluggability could be made to support
different
backends (like FileStorage et al does).
Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.
if index1 is in SQL and index2 is in ZODB, for example, how would
you
go about efficiently combining results?
Is there not a set datatype in python that could be used?
Admittedly,
[SNIP]
There is a 'sets' directory in the Python CVS (in the 'nondist/sandbox' area). I think it was a proposed datatype that didn't quite make the cut for 2.2..(?)
Jeffrey P Shell, jeffrey@cuemedia.com
Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Chris McDonough wrote:
There are also set objects like OOSets and IISets that can be used in intersection and union operations as documented in the BTrees module.
I never understood the documentation :-(
What's the difference between an OOSet and and IISet? (I think i know the answer to that one but thought I'd check...)
Do things like IOSets and OISets exist, if so, what are they?
cheers,
Chris
AFAIK, OOSet is a set of objects and IISet is a set of integers. That's the simplest usage of them. Anything beyond that you'll need to experiment.
----- Original Message ----- From: "Chris Withers" chrisw@nipltd.com To: "Chris McDonough" chrism@zope.com Cc: "Matt Hamilton" matth@netsight.co.uk; "Jeffrey P Shell" jeffrey@cuemedia.com; zope-dev@zope.org Sent: Tuesday, December 04, 2001 11:29 AM Subject: Re: [Zope-dev] Sets
Chris McDonough wrote:
There are also set objects like OOSets and IISets that can be used
in
intersection and union operations as documented in the BTrees
module.
I never understood the documentation :-(
What's the difference between an OOSet and and IISet? (I think i know the answer to that one but thought I'd check...)
Do things like IOSets and OISets exist, if so, what are they?
cheers,
Chris
Matt Hamilton wrote:
if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?
Is there not a set datatype in python that could be used?
There is, but what would happen if index1 returned 25,000 results and index2 returned 250 and you were going to AND the results. Pouring the sets into Python data structures all the time doesn't sound too efficient...
I favour the idea of having several 'engines', like MySQL has table handlers, and letting the user pick which one they want to use.
I'm gonna try all this out at: http://sourceforge.net/projects/pythonindexer/
Admittedly, most of the stuff in MG is about textual searches rather than exact searches
Yeah, but that's the most difficult thing ;-)
(it can do boolean searches too, but the book is mainly about ranking).
Please god tell me they cover phrase matching :-S
[snip headf*ck]
Urm, maybe they'll take it a little slower than that? ;-)
cheers,
Chris
On Tue, 4 Dec 2001, Chris Withers wrote:
(it can do boolean searches too, but the book is mainly about ranking).
Please god tell me they cover phrase matching :-S
No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.
[snip headf*ck]
Urm, maybe they'll take it a little slower than that? ;-)
Yes they do. Condensing ~500 pages into one paragraph is a bit tricky :)
-Matt
Matt Hamilton wrote:
No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.
*puts gun to head*
Chris
Can you all recommend any other books on information retrieval? I was looking at Amazon last night, and I found a few that looked interesting, I'm just the type of guy that has to "leaf through" before I buy the damn thing. I guess I just need to find a good tech book store around here...
I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.
BTW: I still wanna chat some time about this, I just need to decide when, perhaps late next week...
-Casey
--- Matt Hamilton matth@netsight.co.uk wrote:
On Tue, 4 Dec 2001, Chris Withers wrote:
(it can do boolean searches too, but the book is
mainly about
ranking).
Please god tell me they cover phrase matching :-S
No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.
[snip headf*ck]
Urm, maybe they'll take it a little slower than
that? ;-)
Yes they do. Condensing ~500 pages into one paragraph is a bit tricky :)
-Matt
-- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration
__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com
On Tue, 4 Dec 2001, Casey Duncan wrote:
Can you all recommend any other books on information retrieval? I was looking at Amazon last night, and I found a few that looked interesting, I'm just the type of guy that has to "leaf through" before I buy the damn thing. I guess I just need to find a good tech book store around here...
"Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribeiro-Neto is pretty good too. Covers much more than just indexing (eg. user interfaces, languages, evaluation of effectiveness, distributed IR, Digital Libraries etc).
If you are a member of the ACM there is lots of IR stuff in their digital library. Also the New Zealand Digital Library (www.nzdl.org) has some good links on it (the site can be quite slow at times)
I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.
BTW: I still wanna chat some time about this, I just need to decide when, perhaps late next week...
I've normally got a window open on #zope most of the day (GMT), my nick is HammerToe (long story, that is not as interesting or painful as it sounds, and involves neither a hammer nor a toe). You might need to /msg me or beep me to get my attention :)
-Matt
Casey Duncan wrote:
I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.
*cough* --->
http://sourceforge.net/projects/pythonindexer
Care to help? I'm hoping to get started on it this evening with Docs and the like :-)
cheers,
Chris
Could you explain the problems that should be solved by this a little? I find it rather hard to contribute anything useful without knowing some concrete examples ... Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants to discuss that and/or take a look, just mail me.
Wolfram
----- Original Message ----- From: "Chris Withers" chrisw@nipltd.com To: "Casey Duncan" casey_duncan@yahoo.com Cc: "Matt Hamilton" matth@netsight.co.uk; "Casey Duncan" c.duncan@nlada.org; "Steve Alexander" steve@cat-box.net; "Wolfram Kerber" wk@gallileus.de; zope-dev@zope.org Sent: Tuesday, December 04, 2001 10:36 PM Subject: Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB
Casey Duncan wrote:
I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.
*cough* --->
http://sourceforge.net/projects/pythonindexer
Care to help? I'm hoping to get started on it this evening with Docs and
the
like :-)
cheers,
Chris
Wolfram Kerber wrote:
Could you explain the problems that should be solved by this a little?
The idea is to provide a flexible, scalable and powerful indexing solution that works out of the box for Python, rather than Zope.
I find it rather hard to contribute anything useful without knowing some concrete examples ...
Hehe, wait for the docs, which I'm hopefully gonna start checking in today...
Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants to discuss that and/or take a look, just mail me.
Well, I'd love to see this, where can I get the code?
cheers,
Chris
Wolfram Kerber wrote:
Could you explain the problems that should be solved by this a little?
The idea is to provide a flexible, scalable and powerful indexing solution
that
works out of the box for Python, rather than Zope.
I find it rather hard to contribute anything useful without knowing some concrete examples ...
Hehe, wait for the docs, which I'm hopefully gonna start checking in
today...
Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants
to
discuss that and/or take a look, just mail me.
Well, I'd love to see this, where can I get the code?
Ok, i've put it here : http://www.gallileus.info/gallileus/members/m_wolf/publications/100756611705 /10075665840/protoCat.zip
I should put together some doc about the changes as well ...
Wolfram Kerber wrote:
Ok, i've put it here : http://www.gallileus.info/gallileus/members/m_wolf/publications/100756611705 /10075665840/protoCat.zip
I should put together some doc about the changes as well ...
Indeed :-)
I shall have a look though...
Chris
Chris, hows about adding me to this project, my s'forge username is cduncan.
Thanks.
-Casey
--- Chris Withers chrisw@nipltd.com wrote:
Matt Hamilton wrote:
if index1 is in SQL and index2 is in ZODB, for
example, how would you
go about efficiently combining results?
Is there not a set datatype in python that could
be used?
There is, but what would happen if index1 returned 25,000 results and index2 returned 250 and you were going to AND the results. Pouring the sets into Python data structures all the time doesn't sound too efficient...
I favour the idea of having several 'engines', like MySQL has table handlers, and letting the user pick which one they want to use.
I'm gonna try all this out at: http://sourceforge.net/projects/pythonindexer/
Admittedly, most of the stuff in MG is about textual searches
rather than exact
searches
Yeah, but that's the most difficult thing ;-)
(it can do boolean searches too, but the book is
mainly about
ranking).
Please god tell me they cover phrase matching :-S
[snip headf*ck]
Urm, maybe they'll take it a little slower than that? ;-)
cheers,
Chris
__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com
I posted a few references I found around the web on info retrieval and indexing to the s'forge doc area. I think Chris'll need to "approve" them first I think before they are publically accessible.
Chris: Do I have commit privs to the CVS? If so, I'll start helping out with some requirements and whatever other documentation we need.
Also, we should probably create a prototypes or some-such directory in the CVS for existing code. I have a couple of things that can go up there for reference or just as samples.
Also Chris, please create a mailing list for the project. And grant me admin to whatever you want to, I can certainly help administer the project as well.
ttfn,
Casey
--- Chris Withers chrisw@nipltd.com wrote:
Casey Duncan wrote:
Chris, hows about adding me to this project, my s'forge username is cduncan.
done... lemme know if ya need anything else :-)
Chris
__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com
Casey Duncan wrote:
I posted a few references I found around the web on info retrieval and indexing to the s'forge doc area.
SF understands HTML, not structured text ;-)
I think Chris'll need to "approve" them first I think before they are publically accessible.
Yeah, you can do this now too...
Chris: Do I have commit privs to the CVS? If so, I'll start helping out with some requirements and whatever other documentation we need.
You do now :-) Please have a look at the 'Documentation' module and commit any changes you wanna make on a branch :-)
Also, we should probably create a prototypes or some-such directory in the CVS for existing code.
I've created a SourceForge Tracker for this and put up mine and Marcus' Collins' prototype SQLIndexer there...
I have a couple of things that can go up there for reference or just as samples.
Cool :-)
Also Chris, please create a mailing list for the project.
http://lists.sourceforge.net/mailman/listinfo/pythonindexer-discuss
Can anyone who's interested please sign up to that...
cheers,
Chris