[ZODB-Dev] Re: [Zope3-dev] Re: Community opinion about
search+filter
Jim Fulton
jim at zope.com
Mon Mar 26 15:55:33 EDT 2007
On Mar 26, 2007, at 3:28 PM, Dieter Maurer wrote:
> Jim Fulton wrote at 2007-3-25 09:53 -0400:
>>
>> On Mar 25, 2007, at 3:01 AM, Adam Groszer wrote:
>>> MF> I think one of the main limitations of the current catalog (and
>>> MF> hurry.query) is efficient support for sorting and batching the
>>> query
>>> MF> results. The Zope 3 catalog returns all matching results, which
>>> can then
>>> MF> be sorted and batched. This will stop being scalable for large
>>> MF> collections. A relational database is able to do this
>>> internally, and is
>>> MF> potentially able to use optimizations there.
>>
>> What evidence to you have to support this assertion? We did some
>> literature search on this a few years ago and found no special trick
>> to avoid sorting costs.
>>
>> I know of 2 approaches to reducing sort cost:
>>
>> 1. Sort your results based on the "primary key" and therefore, pick
>> your primary key to match your sort results. In terms of the Zope
>> catalog framework, the primary keys are the document IDs, which are
>> traditionally chosen randomly. You can pick your primary keys based
>> on a desired sort order instead. A variation on this theme is to use
>> multiple sets of document ids, storing multiple sets of ids in each
>> index. Of course, this approach doesn't help with something like
>> relevance ranks.
>>
>> 2. Use an N-best algorithm. If N is the size of the batch and M is
>> the corpus size, then this is O(M*ln(N)) rather than O(M*ln(M)) which
>> is a significant improvement if N << M, but still quite expensive.
>
> The major costs in sorting are usually not the "log(n)" but
> the very high linear costs fetching the sort keys (although for
> huge n,
> we will reach the asymptotic limits).
Right. The problem is the "N" not the log(N). :)
> Under normal conditions, a relational database can be far more
> efficient
> to fetch values either from index structures or the data records
> than Zope -- as
>
> * its data representation is much more compact
>
> * it often supports direct access
>
> * the server itself can access and process all data.
>
>
> With the ZODB, the data is hidden in pickles (less compact), there is
> no direct access (instead the complete pickle need to be decoded)
The catalog sort index mechanism uses the un-index data structure in
the sort index to get sort keys. This is a pretty compact data
structure.
> and
> all operations are done in the client (rather than in the server).
Which is often fine if the desired data are in the client cache. It
avoids making the storage a bottleneck.
Jim
--
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the ZODB-Dev
mailing list