Re: [Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage

7 Apr 2010

      Le mardi 6 avril 2010 18:54:24, Jim Fulton a écrit :
...
Is there a document you can point to that provide a description of the
approach used?
A document is in the works, in the while I'll try to describe the architecture 
briefly (should be readable to people knowing a bit about TPC) and give some 
typical use cases.

NEO is distributed, with different types of nodes involved.
The 3 main types are:
- Client nodes: what gets embedded in any application using NEO, so typically
  Zope + ZODB
- Master nodes: one of them gets elected (to become the "primary master node",
  other becoming "secondary master nodes"). The primary handles all
  centralised tasks, such as:
  - oid/tid generation
  - cluster consistency checking (high level: "is there enough nodes to cover
    all data")
  - broadcasting cluster changes to all nodes (new storage node, storage node
    getting disconnected, etc)
  When a primary master falls, secondaries take over by restarting an
  election.
  There is no persistent data on a master node, asides from its configuration
  (the name of the NEO cluster it belongs to, the addresses of other master
  nodes and its own listening address)
- Storage nodes: they "contain" object data, stored & accessed through them
  and ultimately stored in a local storage backend (MySQL currently, but all
  NEO really needs is something which can guarantee some atomicity and limited
  index lookups)

Other nodes (existing or planned) are:
- Admin node
  Cluster monitoring (human-level: book-keeping of cluster health)
- Control command
  Administrator CLI tool to trigger cluster actions.
  Technical note: for now, it uses the admin node as a stepping stone in all
  cases, though it could be avoided in some cases.
- Backup node
  Very similar technically to a storage node, but only dumping data out of
  storage nodes, and being a CLI tool, probably not a daemon (not implemented
  yet).

Remarks:
- there is currently no security in NEO (cryptography/authentication)
- expected cluster size is of around 1000 (client + storage) nodes, probably
  more (depends on the node type ratio and usage pattern)

Some sample use cases:
"Client wants data"
Client connects to the primary master node (happens once upon startup,
and again if primary master dies). During the initial handshake, the client 
receives the "partition table", which tells him which storage nodes are part 
of the cluster and which one contains which object. Then, client connects to 
one of the storage nodes which is expected to contain desired data, and asks 
it to be served.

"Client wants to commit"
Client is already known to cluster and enters TPC in ZODB: it sends object 
data to all candidate storage (known by looking up in its partition table) for 
each object. Those storage locally handle locking at object level, a write 
lock being taken upon such data transmission. Once the "store" phase is over, 
vote phase starts and the vote results depends on storage refusing data (base 
version of object not being the latest one at commit time, etc), resulting in 
conflict resolution, and ultimately in ConflictError exception.
Then the client notified the primary master of ZODB decision (finish or 
abort), which in turn asks all involved storage nodes to make changes 
persistent or trash them (releasing object-level write locks).
If it chose to finish, involved storage take a read lock on objects ("barrier" 
kind of use), answer master that they are done acquiring this lock. In turn, 
master asks them to release all locks (barrier effect is achieved, write lock 
get released to allow further transactions) and send invalidation requests to 
all clients.

"New storage enters cluster"
When a storage node enters an existing (and working) cluster, it will be 
assigned to some partitions, in order to achieve load balancing. It will by 
itself connect to storage nodes having data for those partitions, and start 
replicating them. Those partitions might or might not be ultimately be dropped 
from their original container nodes, depending on the data replication 
constraints (how many copy of those partitions exist in the whole cluster).

"Storage dies"
When a storage dies (disconnected, or request timeout), it is set as 
"temporarily down" in primary master partition table (change which is then 
broadcast to all other nodes). it means that the storage node might still have 
its data, but is currently unavailable. If the lost storage contained the last 
copy of any partition, the cluster lost a part of its content: it asks all 
nodes to interrupt service (they stay running, but refuse to serve further 
requests until the cluster gets back to a running state).
Note that currently, loosing a storage doesn't trigger automatically the 
symmetric action to adding a storage node: partitions are not added to 
existing nodes. It can only be triggered by manual action, via the admin node 
and its command line tool. This was chosen to avoid seeing the cluster waste 
time balancing data over and over if a node starts flapping, and could be made 
configurable.

"Storage comes back"
When a storage comes back to life (after a power failure or whatever), it asks 
other storage nodes for the transactions it missed, and replicate them.
...
That's unfortunate.  Why not a less restrictive license?
Because we have several FSF members in Nexedi, so we use the GPL for all our 
products (with very few exceptions, notably monkey-patches reusing code under 
different licenses).
...
These seem to be very high level.
You provide a link to the source, which is rather low level.
Anything in between?
You mean, a presentation of NEO (maybe what I wrote above would fit) ?
Or on the "write scalable Zope application" topic ?

Regards,
-- 
Vincent Pelletier