Le mardi 6 avril 2010 18:54:24, Jim Fulton a écrit :
Is there a document you can point to that provide a description of the approach used?
A document is in the works, in the while I'll try to describe the architecture briefly (should be readable to people knowing a bit about TPC) and give some typical use cases. NEO is distributed, with different types of nodes involved. The 3 main types are: - Client nodes: what gets embedded in any application using NEO, so typically Zope + ZODB - Master nodes: one of them gets elected (to become the "primary master node", other becoming "secondary master nodes"). The primary handles all centralised tasks, such as: - oid/tid generation - cluster consistency checking (high level: "is there enough nodes to cover all data") - broadcasting cluster changes to all nodes (new storage node, storage node getting disconnected, etc) When a primary master falls, secondaries take over by restarting an election. There is no persistent data on a master node, asides from its configuration (the name of the NEO cluster it belongs to, the addresses of other master nodes and its own listening address) - Storage nodes: they "contain" object data, stored & accessed through them and ultimately stored in a local storage backend (MySQL currently, but all NEO really needs is something which can guarantee some atomicity and limited index lookups) Other nodes (existing or planned) are: - Admin node Cluster monitoring (human-level: book-keeping of cluster health) - Control command Administrator CLI tool to trigger cluster actions. Technical note: for now, it uses the admin node as a stepping stone in all cases, though it could be avoided in some cases. - Backup node Very similar technically to a storage node, but only dumping data out of storage nodes, and being a CLI tool, probably not a daemon (not implemented yet). Remarks: - there is currently no security in NEO (cryptography/authentication) - expected cluster size is of around 1000 (client + storage) nodes, probably more (depends on the node type ratio and usage pattern) Some sample use cases: "Client wants data" Client connects to the primary master node (happens once upon startup, and again if primary master dies). During the initial handshake, the client receives the "partition table", which tells him which storage nodes are part of the cluster and which one contains which object. Then, client connects to one of the storage nodes which is expected to contain desired data, and asks it to be served. "Client wants to commit" Client is already known to cluster and enters TPC in ZODB: it sends object data to all candidate storage (known by looking up in its partition table) for each object. Those storage locally handle locking at object level, a write lock being taken upon such data transmission. Once the "store" phase is over, vote phase starts and the vote results depends on storage refusing data (base version of object not being the latest one at commit time, etc), resulting in conflict resolution, and ultimately in ConflictError exception. Then the client notified the primary master of ZODB decision (finish or abort), which in turn asks all involved storage nodes to make changes persistent or trash them (releasing object-level write locks). If it chose to finish, involved storage take a read lock on objects ("barrier" kind of use), answer master that they are done acquiring this lock. In turn, master asks them to release all locks (barrier effect is achieved, write lock get released to allow further transactions) and send invalidation requests to all clients. "New storage enters cluster" When a storage node enters an existing (and working) cluster, it will be assigned to some partitions, in order to achieve load balancing. It will by itself connect to storage nodes having data for those partitions, and start replicating them. Those partitions might or might not be ultimately be dropped from their original container nodes, depending on the data replication constraints (how many copy of those partitions exist in the whole cluster). "Storage dies" When a storage dies (disconnected, or request timeout), it is set as "temporarily down" in primary master partition table (change which is then broadcast to all other nodes). it means that the storage node might still have its data, but is currently unavailable. If the lost storage contained the last copy of any partition, the cluster lost a part of its content: it asks all nodes to interrupt service (they stay running, but refuse to serve further requests until the cluster gets back to a running state). Note that currently, loosing a storage doesn't trigger automatically the symmetric action to adding a storage node: partitions are not added to existing nodes. It can only be triggered by manual action, via the admin node and its command line tool. This was chosen to avoid seeing the cluster waste time balancing data over and over if a node starts flapping, and could be made configurable. "Storage comes back" When a storage comes back to life (after a power failure or whatever), it asks other storage nodes for the transactions it missed, and replicate them.
That's unfortunate. Why not a less restrictive license?
Because we have several FSF members in Nexedi, so we use the GPL for all our products (with very few exceptions, notably monkey-patches reusing code under different licenses).
These seem to be very high level. You provide a link to the source, which is rather low level. Anything in between?
You mean, a presentation of NEO (maybe what I wrote above would fit) ? Or on the "write scalable Zope application" topic ? Regards, -- Vincent Pelletier