Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

newer
Re: [Zope-dev] More signal 11...

older
corrupted data.fs

Chris Withers

29 Nov 2001 29 Nov '01

10 a.m.

Casey Duncan wrote:

...

I'm not sure I want to store the indexes in the ZODB, just index ZODB data at a low level.

Ah, okay, and yes, in that case, I am in complete agreement ;-) (the level I'm aiming at is just to be able to index python objects, I'll leave plugging that into the ZODB architecture up to someone who understands it better...)

...

Yup, I think I have a solution, but it'll involve some coding ;^)

Ooo...care to explain? :-)

...

I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).

Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg. if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?

...

...
That said, I wasn't aware of Matt's work up until very recently. I'd love to see an Indexer that didn't require an RDB (or BerkleyDB :-P) and scaled to GigaBytes of Data...

Yup, me too.

Well, I'm just purchasing my copy of Managing Gigabytess now ;-)

...

OK, I'm available all this week, but I'm not as available the next two weeks. Lets find a good time.

I'm available any time and date, just as long as I get a coupla days notice... cheers, Chris

Show replies by date

Matt Hamilton

29 Nov 29 Nov

11:03 a.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

On Thu, 29 Nov 2001, Chris Withers wrote:

...

...
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).

Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.

if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?

Is there not a set datatype in python that could be used? Admittedly, most of the stuff in MG is about textual searches rather than exact searches (it can do boolean searches too, but the book is mainly about ranking). It uses an algorithm called the 'Cosine Ranking Algorithm'. Basically if you imagine an N-dimensional space, where N is the number of terms in your vocabulary and represent a document as a vector in that space whose direction is the composite of the terms that appear in it. You then represent a query string as a vector in the same space, the similarity between the document and the query is the angle between the two vectors... the smaller the angle the greater the similarity. Still with me? :) -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration

Jeffrey P Shell

5:18 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

On Thursday, November 29, 2001, at 04:03 AM, Matt Hamilton wrote:

...

On Thu, 29 Nov 2001, Chris Withers wrote:

...
...
I would rather avoid having to use a relational database unless I have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).

Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.

if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?

Is there not a set datatype in python that could be used? Admittedly,

[SNIP] There is a 'sets' directory in the Python CVS (in the 'nondist/sandbox' area). I think it was a proposed datatype that didn't quite make the cut for 2.2..(?) Jeffrey P Shell, jeffrey@cuemedia.com

Chris McDonough

5:54 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

There are also set objects like OOSets and IISets that can be used in intersection and union operations as documented in the BTrees module. ----- Original Message ----- From: "Jeffrey P Shell" <jeffrey@cuemedia.com> To: "Matt Hamilton" <matth@netsight.co.uk> Cc: <zope-dev@zope.org> Sent: Thursday, November 29, 2001 12:18 PM Subject: Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

...

On Thursday, November 29, 2001, at 04:03 AM, Matt Hamilton wrote:

...
On Thu, 29 Nov 2001, Chris Withers wrote:

...
...
I would rather avoid having to use a relational database unless

...

...
...
...
have to. Perhaps the index pluggability could be made to support different backends (like FileStorage et al does).

Yeah, unfortunately, the difficult bit is combining queries: gimme the results where index1=='fish' and index2 is between 2 and 5kg.

if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?

Is there not a set datatype in python that could be used? Admittedly,

[SNIP]

There is a 'sets' directory in the Python CVS (in the 'nondist/sandbox' area). I think it was a proposed datatype that didn't quite make the cut for 2.2..(?)

Jeffrey P Shell, jeffrey@cuemedia.com

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

Chris Withers

4 Dec 4 Dec

4:29 p.m.

New subject: [Zope-dev] Sets

Chris McDonough wrote:

...

There are also set objects like OOSets and IISets that can be used in intersection and union operations as documented in the BTrees module.

I never understood the documentation :-( What's the difference between an OOSet and and IISet? (I think i know the answer to that one but thought I'd check...) Do things like IOSets and OISets exist, if so, what are they? cheers, Chris

Chris McDonough

4:47 p.m.

New subject: [Zope-dev] Sets

AFAIK, OOSet is a set of objects and IISet is a set of integers. That's the simplest usage of them. Anything beyond that you'll need to experiment. ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Chris McDonough" <chrism@zope.com> Cc: "Matt Hamilton" <matth@netsight.co.uk>; "Jeffrey P Shell" <jeffrey@cuemedia.com>; <zope-dev@zope.org> Sent: Tuesday, December 04, 2001 11:29 AM Subject: Re: [Zope-dev] Sets

...

Chris McDonough wrote:

...
There are also set objects like OOSets and IISets that can be used

in

...
intersection and union operations as documented in the BTrees module.

I never understood the documentation :-(

What's the difference between an OOSet and and IISet? (I think i know the answer to that one but thought I'd check...)

Do things like IOSets and OISets exist, if so, what are they?

cheers,

Chris

Chris Withers

4:26 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Matt Hamilton wrote:

...

...
if index1 is in SQL and index2 is in ZODB, for example, how would you go about efficiently combining results?

Is there not a set datatype in python that could be used?

There is, but what would happen if index1 returned 25,000 results and index2 returned 250 and you were going to AND the results. Pouring the sets into Python data structures all the time doesn't sound too efficient... I favour the idea of having several 'engines', like MySQL has table handlers, and letting the user pick which one they want to use. I'm gonna try all this out at: http://sourceforge.net/projects/pythonindexer/

...

Admittedly, most of the stuff in MG is about textual searches rather than exact searches

Yeah, but that's the most difficult thing ;-)

...

(it can do boolean searches too, but the book is mainly about ranking).

Please god tell me they cover phrase matching :-S [snip headf*ck] Urm, maybe they'll take it a little slower than that? ;-) cheers, Chris

Matt Hamilton

4:33 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

On Tue, 4 Dec 2001, Chris Withers wrote:

...

...
(it can do boolean searches too, but the book is mainly about ranking).

Please god tell me they cover phrase matching :-S

No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.

...

[snip headf*ck]

Urm, maybe they'll take it a little slower than that? ;-)

Yes they do. Condensing ~500 pages into one paragraph is a bit tricky :) -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration

Chris Withers

4:46 p.m.

New subject: [Zope-dev] *bang*

Matt Hamilton wrote:

...

No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.

*puts gun to head* Chris

Casey Duncan

6:52 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Can you all recommend any other books on information retrieval? I was looking at Amazon last night, and I found a few that looked interesting, I'm just the type of guy that has to "leaf through" before I buy the damn thing. I guess I just need to find a good tech book store around here... I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy. BTW: I still wanna chat some time about this, I just need to decide when, perhaps late next week... -Casey --- Matt Hamilton <matth@netsight.co.uk> wrote:

...

On Tue, 4 Dec 2001, Chris Withers wrote:

...
...
(it can do boolean searches too, but the book is mainly about ranking).

Please god tell me they cover phrase matching :-S

No they don't really (if I remember right). I think they do talk about storing the position of the word in the document, so that could help. I need to dig the book out, I looked at it about 18 months ago.

...
[snip headf*ck]

Urm, maybe they'll take it a little slower than that? ;-)

Yes they do. Condensing ~500 pages into one paragraph is a bit tricky :)

-Matt

-- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration

__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com

Matt Hamilton

7:03 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

On Tue, 4 Dec 2001, Casey Duncan wrote:

...

Can you all recommend any other books on information retrieval? I was looking at Amazon last night, and I found a few that looked interesting, I'm just the type of guy that has to "leaf through" before I buy the damn thing. I guess I just need to find a good tech book store around here...

"Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribeiro-Neto is pretty good too. Covers much more than just indexing (eg. user interfaces, languages, evaluation of effectiveness, distributed IR, Digital Libraries etc). If you are a member of the ACM there is lots of IR stuff in their digital library. Also the New Zealand Digital Library (www.nzdl.org) has some good links on it (the site can be quite slow at times)

...

I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.

BTW: I still wanna chat some time about this, I just need to decide when, perhaps late next week...

I've normally got a window open on #zope most of the day (GMT), my nick is HammerToe (long story, that is not as interesting or painful as it sounds, and involves neither a hammer nor a toe). You might need to /msg me or beep me to get my attention :) -Matt -- Matt Hamilton matth@netsight.co.uk Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration

Chris Withers

9:36 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Casey Duncan wrote:

...

I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.

*cough* ---> http://sourceforge.net/projects/pythonindexer Care to help? I'm hoping to get started on it this evening with Docs and the like :-) cheers, Chris

Wolfram Kerber

5 Dec 5 Dec

6:09 a.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Could you explain the problems that should be solved by this a little? I find it rather hard to contribute anything useful without knowing some concrete examples ... Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants to discuss that and/or take a look, just mail me. Wolfram ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Casey Duncan" <casey_duncan@yahoo.com> Cc: "Matt Hamilton" <matth@netsight.co.uk>; "Casey Duncan" <c.duncan@nlada.org>; "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Tuesday, December 04, 2001 10:36 PM Subject: Re: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

...

Casey Duncan wrote:

...
I think my first real proposal of any significance will be to replace the catalog with a truely industrial strength indexing bohemoth, that can be plugged into this whole "component arch." thingamawhammy.

*cough* --->

http://sourceforge.net/projects/pythonindexer

Care to help? I'm hoping to get started on it this evening with Docs and the like :-)

cheers,

Chris

Chris Withers

9:26 a.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Wolfram Kerber wrote:

...

Could you explain the problems that should be solved by this a little?

The idea is to provide a flexible, scalable and powerful indexing solution that works out of the box for Python, rather than Zope.

...

I find it rather hard to contribute anything useful without knowing some concrete examples ...

Hehe, wait for the docs, which I'm hopefully gonna start checking in today...

...

Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants to discuss that and/or take a look, just mail me.

Well, I'd love to see this, where can I get the code? cheers, Chris

Wolfram Kerber

6:12 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

...

Wolfram Kerber wrote:

...
Could you explain the problems that should be solved by this a little?

The idea is to provide a flexible, scalable and powerful indexing solution that works out of the box for Python, rather than Zope.

...
I find it rather hard to contribute anything useful without knowing some concrete examples ...

Hehe, wait for the docs, which I'm hopefully gonna start checking in today...

...
Btw, i have a prototype Catalog that speeds up things a bit in usual scenarios. It is cacheable, implements an extended query interface and solves two performance issues with certain queries. I intend to put that into a proposal :-), but probably not until next week. If anybody wants to discuss that and/or take a look, just mail me.

Well, I'd love to see this, where can I get the code?

Ok, i've put it here : http://www.gallileus.info/gallileus/members/m_wolf/publications/100756611705 /10075665840/protoCat.zip I should put together some doc about the changes as well ...

Chris Withers

6 Dec 6 Dec

12:05 a.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Wolfram Kerber wrote:

...

Ok, i've put it here : http://www.gallileus.info/gallileus/members/m_wolf/publications/100756611705 /10075665840/protoCat.zip

I should put together some doc about the changes as well ...

Indeed :-) I shall have a look though... Chris

Casey Duncan

4 Dec 4 Dec

6:59 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Chris, hows about adding me to this project, my s'forge username is cduncan. Thanks. -Casey --- Chris Withers <chrisw@nipltd.com> wrote:

...

Matt Hamilton wrote:

...
...
if index1 is in SQL and index2 is in ZODB, for

example, how would you

...
...
go about efficiently combining results?

Is there not a set datatype in python that could be used?

There is, but what would happen if index1 returned 25,000 results and index2 returned 250 and you were going to AND the results. Pouring the sets into Python data structures all the time doesn't sound too efficient...

I favour the idea of having several 'engines', like MySQL has table handlers, and letting the user pick which one they want to use.

I'm gonna try all this out at: http://sourceforge.net/projects/pythonindexer/

...
Admittedly, most of the stuff in MG is about textual searches rather than exact searches

Yeah, but that's the most difficult thing ;-)

...
(it can do boolean searches too, but the book is mainly about ranking).

Please god tell me they cover phrase matching :-S

[snip headf*ck]

Urm, maybe they'll take it a little slower than that? ;-)

cheers,

Chris

__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com

Chris Withers

9:38 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Casey Duncan wrote:

...

Chris, hows about adding me to this project, my s'forge username is cduncan.

done... lemme know if ya need anything else :-) Chris

Casey Duncan

5 Dec 5 Dec

2:01 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

I posted a few references I found around the web on info retrieval and indexing to the s'forge doc area. I think Chris'll need to "approve" them first I think before they are publically accessible. Chris: Do I have commit privs to the CVS? If so, I'll start helping out with some requirements and whatever other documentation we need. Also, we should probably create a prototypes or some-such directory in the CVS for existing code. I have a couple of things that can go up there for reference or just as samples. Also Chris, please create a mailing list for the project. And grant me admin to whatever you want to, I can certainly help administer the project as well. ttfn, Casey --- Chris Withers <chrisw@nipltd.com> wrote:

...

Casey Duncan wrote:

...
Chris, hows about adding me to this project, my s'forge username is cduncan.

done... lemme know if ya need anything else :-)

Chris

__________________________________________________ Do You Yahoo!? Buy the perfect holiday gifts at Yahoo! Shopping. http://shopping.yahoo.com

Chris Withers

4:08 p.m.

New subject: [Zope-dev] Searching/Indexing/ZODB/SQL/BerkleyDB

Casey Duncan wrote:

...

I posted a few references I found around the web on info retrieval and indexing to the s'forge doc area.

SF understands HTML, not structured text ;-)

...

I think Chris'll need to "approve" them first I think before they are publically accessible.

Yeah, you can do this now too...

...

Chris: Do I have commit privs to the CVS? If so, I'll start helping out with some requirements and whatever other documentation we need.

You do now :-) Please have a look at the 'Documentation' module and commit any changes you wanna make on a branch :-)

...

Also, we should probably create a prototypes or some-such directory in the CVS for existing code.

I've created a SourceForge Tracker for this and put up mine and Marcus' Collins' prototype SQLIndexer there...

...

I have a couple of things that can go up there for reference or just as samples.

Cool :-)

...

Also Chris, please create a mailing list for the project.

http://lists.sourceforge.net/mailman/listinfo/pythonindexer-discuss Can anyone who's interested please sign up to that... cheers, Chris

8966

Age (days ago)

8973

Last active (days ago)

List overview

19 comments

6 participants

participants (6)

Casey Duncan
Chris McDonough
Chris Withers
Jeffrey P Shell
Matt Hamilton
Wolfram Kerber