[ZODB-Dev] RFE: Spec for ZODB Indexing
Christian Reis
kiko@async.com.br
Thu, 6 Jun 2002 18:24:50 -0300
I've discussed with Patrick O'Brien and Johan Dahlin over the past week
on and off something that has turned into this specification, and I
would like to share it with the other list members so they can make a
critique of our intention. We already have some code that shows that it
can be done, but we're not very experienced with the ZODB, and it will
probably show :-)
ZODB Indexing DRAFT 1
Christian Reis for Async Open Source
kiko@async.com.br
June 2002
1. Introduction
ZODB is an object-oriented persistence mechanism for Python. In the
relational world, a database provides both persistence and retrieval
means; the ZODB has very simplified provisions for retrieval, and
this makes applications always harder to create when they need to
access data without a very structured path.
This proposal intends to uncover the key aspects of a retrieval and
indexing mechanism that would work externally from the ZODB, and
independently from Zope, the most common application to use the
ZODB. In other words, it intends to specify a set of classes that
could be used directly from any application that desires to retrieve
objects easily from a ZODB store.
2. Key concepts
The ZODB provides more than decent persistence. This means that we
can trust the objects are saved accordingly, and that they can be
retrieved through a graph of object references, including extended
types which allow the use of lists and mappings of objects. It also
provides transactional mechanisms that allow application-database
consistency to be maintained.
Our mechanism intends to store and manage these references, so it
provides a front-end for getting objects out of the database.
For retrieval to work, there must be a way to specify which
object(s) we want. There are a couple of scenarios here:
1. We want all objects in the database
2. We want a subset of objects in the database, which possess
attributes according to a specified condition.
3. We want a specific object, identified by a reference number
which would be something like an oid. XXX: does such a thing
exist for OO programming?
The first is easily provided by a dump()-like call. The last can be
provided by the existing mechanisms for ZODB retrieval. However, the
most common form of request is the 2nd, and for that an API for
specifying the instances desired must be created.
One way to implement the API for retrievals is to provide a
simplified query language, where the attributes and conditions could
be specified. OQL is quite probably overkill for this, but a
simplified, pythonesque language could be devised.
2.1. Common conditions
When retrieving objects, we'd like to be able to specify the
attributes that identify the object in a certain way. The basic
Python conditional operators provide a mecahnism we could use:
- Equality, and the "is" operator
- Greater/Less than comparisons
- The in operator for lists
- XXX: anything else?
We would also like to support boolean operations upon these
conditionals - and, or and not.
2.2. Joins and totals
The SQL join concept is probably not immediately necessary here; it can
be argued that joins should really be, in OO modelling, replaced by
designing a good network of object references between instances and
the other data they will commonly be associated with.
However, in the case of making summaries of information stored in
the ZODB, there is still need for the suggestion of a good solution.
For a simple example, we'll have a database with a collection of
Product instances, each with individual stock quantities. How do we
provide a means to discover the total stock in the collection?
Should the application handle this by caching this information based
on updates to individual stock?
2.3 Query language
An idea for specifying conditions is using a string as a pseudo-query
language, which can be easily parsed by a python eval(), such as:
"obj.a == foo and obj.b <= bar and obj.c in [ baz, noogie ]"
This has the advantage of having this format is familiarity with the
actualy language. Granted, we are generating a string for our query,
which is definitely not native, but it seems to be an acceptable
compromise. XXX: is there another solution, apart from ugly things
like [ "obj.a", "<", "43" ]?
In this string, we use the special name "obj" to refer to the object
which we are querying against. All attributes must be qualified by
this name.
Additionally, if we can hook into the Python parser for the
evaluation, type errors can be automatically caught. Finally, the
possibility of doing str([ a, b, c]) seems interesting for
implementing the condition for the in operation.
However, it remains to be seen how easy this is to integrate with
indexing, described further on, specially for comparisons and
substring matches.
3. Implementation
The most simple way to implement this API is to provide parsing, and
when processing a query, to do sequencial, brute-force searches
using an API for the conditionals described above. However, for
large collections of objects, this will be insufficient for
performance reasons.
In this case, an indexing mechanism is necessary. Indexes would
allow us to avoid sequential searching of all the objects,
providing a fast lookup means for each attribute specified. Indexes
require knowledge of instance meta-data, so either they are kept
dynamically adapted to the instances schema, or they are explicitly
created and destroyed. XXX: do we provide for schema changes?
A concept for instance aggregation has to be provided too. This
aggregator would hold a collection or instance references and the
indexes, and would provide both query interface and index
maintenence functionality. This aggregator would also be a ZODB
persistent object.
String Indexes can use the TextIndex module that is available from
Zope.com. XXX: only allows searches based on words, would have to
revert to sequential scan for substring, is there a solution?
3.1 Index queries
The simplest identity comparisons are easily performed with indexes;
however comparisons and substring/list operations may prove more
difficult. To address this, there should be a fallback to sequential
scanning when addressing an query unsupported by the index, as in
relational database query processing.
This will allow development of this indexing mechanism to evolve
gradually; as it becomes more mature, less queries will result in
sequential scans.
4. API example
class Catalog(Persistent):
XXX: Entity?
def __init__(self, XXX):
XXX: define what kind of instance it stores
XXX: instance meta-data acquisition?
XXX: index auto-creation?
def dump(self):
XXX: return list of instances
def query(self, q):
XXX: process query
def init_index(self, instance_attr):
XXX: specify only indexes you want to avoid bloat?
4.1 Query regex
The expression we are using for parsing at the moment is:
"\s*(\".*?\"|'.*?'|\S+)\s+(==|\!=|<=|>=|<|>)\s+(\".*?\"|'.*?'|\S+)"
4.2 Query samples
"obj.a < 10"
"obj.b > 10 and obj.a < 10"
'obj.a == "foobar"'
'obj.a in %s' % str([1,2,3]) # ugly
'obj.x and obj.y == 1 and obj.z <= 3'
5. Development steps
- Stabilize requirements for conditionals, summaries and joins
- Think up Catalog API
- Design query language
- Implement basic, brute-force queries
- Implement simple indexing
- Link queries into indexes
- Implement complete indexes for more data types
- Make test application
- Extend indexing to conditionals
- Extend indexing to summaries?
Take care,
--
Christian Reis, Senior Engineer, Async Open Source, Brazil.
http://async.com.br/~kiko/ | [+55 16] 272 3330 | NMFL