[ZODB-Dev] RFE: Spec for ZODB Indexing

Thu, 6 Jun 2002 18:24:50 -0300

I've discussed with Patrick O'Brien and Johan Dahlin over the past week
on and off something that has turned into this specification, and I
would like to share it with the other list members so they can make a
critique of our intention. We already have some code that shows that it
can be done, but we're not very experienced with the ZODB, and it will
probably show :-)

ZODB Indexing DRAFT 1
Christian Reis for Async Open Source
kiko@async.com.br
June 2002

1. Introduction

    ZODB is an object-oriented persistence mechanism for Python.  In the
    relational world, a database provides both persistence and retrieval
    means; the ZODB has very simplified provisions for retrieval, and
    this makes applications always harder to create when they need to
    access data without a very structured path.

    This proposal intends to uncover the key aspects of a retrieval and
    indexing mechanism that would work externally from the ZODB, and
    independently from Zope, the most common application to use the
    ZODB. In other words, it intends to specify a set of classes that
    could be used directly from any application that desires to retrieve
    objects easily from a ZODB store.

2. Key concepts

    The ZODB provides more than decent persistence. This means that we
    can trust the objects are saved accordingly, and that they can be
    retrieved through a graph of object references, including extended
    types which allow the use of lists and mappings of objects. It also
    provides transactional mechanisms that allow application-database
    consistency to be maintained.

    Our mechanism intends to store and manage these references, so it
    provides a front-end for getting objects out of the database. 

    For retrieval to work, there must be a way to specify which
    object(s) we want. There are a couple of scenarios here:

        1. We want all objects in the database
        2. We want a subset of objects in the database, which possess 
           attributes according to a specified condition. 
        3. We want a specific object, identified by a reference number
           which would be something like an oid. XXX: does such a thing
           exist for OO programming?

    The first is easily provided by a dump()-like call. The last can be
    provided by the existing mechanisms for ZODB retrieval. However, the
    most common form of request is the 2nd, and for that an API for
    specifying the instances desired must be created.

    One way to implement the API for retrievals is to provide a
    simplified query language, where the attributes and conditions could
    be specified. OQL is quite probably overkill for this, but a
    simplified, pythonesque language could be devised.

2.1. Common conditions

    When retrieving objects, we'd like to be able to specify the
    attributes that identify the object in a certain way. The basic
    Python conditional operators provide a mecahnism we could use:

    - Equality, and the "is" operator
    - Greater/Less than comparisons
    - The in operator for lists
    - XXX: anything else?

    We would also like to support boolean operations upon these
    conditionals - and, or and not.

2.2. Joins and totals

    The SQL join concept is probably not immediately necessary here; it can
    be argued that joins should really be, in OO modelling, replaced by
    designing a good network of object references between instances and
    the other data they will commonly be associated with.

    However, in the case of making summaries of information stored in
    the ZODB, there is still need for the suggestion of a good solution.
    For a simple example, we'll have a database with a collection of
    Product instances, each with individual stock quantities. How do we
    provide a means to discover the total stock in the collection?
    Should the application handle this by caching this information based
    on updates to individual stock?

2.3 Query language

    An idea for specifying conditions is using a string as a pseudo-query
    language, which can be easily parsed by a python eval(), such as:

        "obj.a == foo and obj.b <= bar and obj.c in [ baz, noogie ]"

    This has the advantage of having this format is familiarity with the
    actualy language. Granted, we are generating a string for our query,
    which is definitely not native, but it seems to be an acceptable
    compromise. XXX: is there another solution, apart from ugly things
    like [ "obj.a", "<", "43" ]?

    In this string, we use the special name "obj" to refer to the object
    which we are querying against. All attributes must be qualified by
    this name.

    Additionally, if we can hook into the Python parser for the
    evaluation, type errors can be automatically caught.  Finally, the
    possibility of doing str([ a, b, c]) seems interesting for
    implementing the condition for the in operation. 

    However, it remains to be seen how easy this is to integrate with
    indexing, described further on, specially for comparisons and
    substring matches.

3. Implementation

    The most simple way to implement this API is to provide parsing, and
    when processing a query, to do sequencial, brute-force searches
    using an API for the conditionals described above. However, for
    large collections of objects, this will be insufficient for
    performance reasons.

    In this case, an indexing mechanism is necessary. Indexes would
    allow us to avoid sequential searching of all the objects,
    providing a fast lookup means for each attribute specified. Indexes
    require knowledge of instance meta-data, so either they are kept
    dynamically adapted to the instances schema, or they are explicitly
    created and destroyed. XXX: do we provide for schema changes?

    A concept for instance aggregation has to be provided too. This
    aggregator would hold a collection or instance references and the
    indexes, and would provide both query interface and index
    maintenence functionality. This aggregator would also be a ZODB
    persistent object.

    String Indexes can use the TextIndex module that is available from
    Zope.com. XXX: only allows searches based on words, would have to
    revert to sequential scan for substring, is there a solution?

3.1 Index queries

    The simplest identity comparisons are easily performed with indexes;
    however comparisons and substring/list operations may prove more
    difficult. To address this, there should be a fallback to sequential
    scanning when addressing an query unsupported by the index, as in
    relational database query processing.

    This will allow development of this indexing mechanism to evolve
    gradually; as it becomes more mature, less queries will result in
    sequential scans.

4. API example

        class Catalog(Persistent):
            XXX: Entity?
            def __init__(self, XXX):
                XXX: define what kind of instance it stores
                XXX: instance meta-data acquisition?
                XXX: index auto-creation?
            def dump(self):
                XXX: return list of instances
            def query(self, q):
                XXX: process query
            def init_index(self, instance_attr):
                XXX: specify only indexes you want to avoid bloat?

4.1 Query regex

The expression we are using for parsing at the moment is:

    "\s*(\".*?\"|'.*?'|\S+)\s+(==|\!=|<=|>=|<|>)\s+(\".*?\"|'.*?'|\S+)"

4.2 Query samples 

    "obj.a < 10"
    "obj.b > 10 and obj.a < 10"
    'obj.a == "foobar"'
    'obj.a in %s' % str([1,2,3]) # ugly
    'obj.x and obj.y == 1 and obj.z <= 3'

5. Development steps

    - Stabilize requirements for conditionals, summaries and joins
    - Think up Catalog API
    - Design query language
    - Implement basic, brute-force queries
    - Implement simple indexing
    - Link queries into indexes
    - Implement complete indexes for more data types
    - Make test application
    - Extend indexing to conditionals
    - Extend indexing to summaries?

Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 272 3330 | NMFL