Hi All,
As my ZODB data files become larger and larger I am looking at ways to make the structure of my objects more efficient. To simplify my question, suppose I have two different classes and both contain a list of a objects from a third class:
class x has the attribute x.elements = [objects of class z] class y has the attribute y.elements = [objects of class z]
As far as I understand python the lists x.elements and y.elements contain pointers to the z objects previously defined. What I wanted to know is how ZODB handles that (or maybe I should say: how pickle handles that) when saving to a file. Will the pointers be converted to a copy of the z class objects or will one copy of the z class objects be saved and than the x.elements and y.elements will still be a list of pointers?
Thanks for the help, Yair
As far as I am aware, ZODB will store a list of pointers to the lists of z objects. What you should be careful of for efficient use of ZODB is that your list is stored in an efficient way, well if the list is updated often or long anyway.
When you pack your ZODB does it take up a lot less space? If so it may be that a lot of space is being wasted storing the updated lists of object references. Unless you use a special PersistentList ZODB will have no choice but to store a new copy of the whole list when that list is modified. If you have long lists then this can be a big problem. The Persistent classes have special handling to make them more efficent.
So instead of lists use PersistentLists and instead of dicts use BTrees, as these may be stored more efficiently in the ZODB.
Also have a look at the analyze.py script to try and track down where the space is being used. My notes here may be helpful too http://zopelabs.com/cookbook/1114086617
Hope that helps,
Laurence
Yair Benita wrote:
Hi All,
As my ZODB data files become larger and larger I am looking at ways to make the structure of my objects more efficient. To simplify my question, suppose I have two different classes and both contain a list of a objects from a third class:
class x has the attribute x.elements = [objects of class z] class y has the attribute y.elements = [objects of class z]
As far as I understand python the lists x.elements and y.elements contain pointers to the z objects previously defined. What I wanted to know is how ZODB handles that (or maybe I should say: how pickle handles that) when saving to a file. Will the pointers be converted to a copy of the z class objects or will one copy of the z class objects be saved and than the x.elements and y.elements will still be a list of pointers?
Thanks for the help, Yair
[Laurence Rowe]
... Unless you use a special PersistentList ZODB will have no choice but to store a new copy of the whole list when that list is modified.
Caution: that's true of a PersistentList too. The purpose of PersistentList isn't realy to supply more-effecient storage (that's the purpose of the various BTree classes). The purpose of PersistentList is this:
myobject.my_list_attibute[3] = 4
If my_list_attribute is a plain Python list, the persistence machinery has no way to know that my_list_attribute's state mutated, so the assignment above will not get stored to disk at the next commit unless you _also_ do
myobject._p_changed = True # or 1
If my_list_attribute is a PersistentList, then the persistence machinery does know when its state mutates, and there's no need to manage _p_changed manually.
But in either case, the entire state of my_list_attribute gets stored to disk whenever any part of it changes. The only difference in what gets stored in the example above is that myobject's state also gets stored to disk if my_list_attribute is a Python list (assuming myobject._p_changed gets set to a true value by hand), while myobject's state does not need to get written to disk again if my_list_attribute is a PersistentList (then myobject refers to my_list_attribute via the latter's oid, and that oid hasn't changed, so there's no need to store myobject's state again). The entire state of the list attribute gets written out in either case.
If you have long lists then this can be a big problem.
Very true.
The Persistent classes have special handling to make them more efficent.
Sometimes true, but not in the PersistentList case.
So instead of lists use PersistentLists
If the goal is to save space, generally no, PersistentList won't help that; to the contrary, their state takes a little more space on disk than a plain list.
and instead of dicts use BTrees,
That one's differenent: a BTree is really a graph of (potentially _very_) many distinct perisistent objects, and BTrees were designed to support space- and time- efficient mutation.
as these may be stored more efficiently in the ZODB.
For BTrees, yes.
[Yair Benita]
... suppose I have two different classes and both contain a list of a objects from a third class:
class x has the attribute x.elements = [objects of class z] class y has the attribute y.elements = [objects of class z]
As far as I understand python the lists x.elements and y.elements contain pointers to the z objects previously defined.
Yes, Python lists always contain pointers -- even if it's a list of integers, the list actually contains pointers to integer objects. But since that's always true, it's not much help in answering your real question. In general, pointers "make sense" only so long as an object resides in memory.
What I wanted to know is how ZODB handles that (or maybe I should say: how pickle handles that) when saving to a file. Will the pointers be converted to a copy of the z class objects or will one copy of the z class objects be saved and than the x.elements and y.elements will still be a list of pointers?
Persistence has its own rules: if an object is persistent (an instance of a subclass of Persistent|), then its current state is stored uniquely in the database, and all references to it just save away (in effect) its persistent object id (oid, usually a 64-bit identifier uniquely assigned to each persistent object, and which retains its value for as long as the database exists). There are no exceptions to this for persistent objects. Oids are effectively a mechanism for building "persistent pointers", and apply only to persistent objects.
If an object is not persistent (is not an instance of a subclass of Persistent), it doesn't have an oid, and then there's very little possibility to share references to it on disk. Instead, on disk a copy of its state will usually get made everywhere it's referenced.
So the answer to your specific question depends mostly on something you didn't reveal: does class z derive from Persistent? If it does, then _every_ reference on disk to an instance z1 is via z1's oid. If z doesn't derive from Perisistent, then almost all references on disk to an instance z1 will be via a physically distinct copy of z1's full state.
As always. Clear, detailed and to the point. Thanks Tim. Actually, the z class isn't a subclass of persistent because it just holds data (has no methods) and never changes. Same goes to the lists of x and y, they tend to hold a few elements and also never change. The X and Y classes are more complex and are stored using BTrees. Reading this answer I understand that anything I store should be persistent, even if its a list I don't plan to edit. I was under the impression that a subclass of persistent will be larger in size for storage, so I avoided it in the cases mentioned. Is this true?
Thanks again for the help, Yair
On Jun 20, 2005, at 4:00 , Tim Peters wrote:
[Yair Benita]
... suppose I have two different classes and both contain a list of a objects from a third class:
class x has the attribute x.elements = [objects of class z] class y has the attribute y.elements = [objects of class z]
As far as I understand python the lists x.elements and y.elements contain pointers to the z objects previously defined.
Yes, Python lists always contain pointers -- even if it's a list of integers, the list actually contains pointers to integer objects. But since that's always true, it's not much help in answering your real question. In general, pointers "make sense" only so long as an object resides in memory.
What I wanted to know is how ZODB handles that (or maybe I should say: how pickle handles that) when saving to a file. Will the pointers be converted to a copy of the z class objects or will one copy of the z class objects be saved and than the x.elements and y.elements will still be a list of pointers?
Persistence has its own rules: if an object is persistent (an instance of a subclass of Persistent|), then its current state is stored uniquely in the database, and all references to it just save away (in effect) its persistent object id (oid, usually a 64-bit identifier uniquely assigned to each persistent object, and which retains its value for as long as the database exists). There are no exceptions to this for persistent objects. Oids are effectively a mechanism for building "persistent pointers", and apply only to persistent objects.
If an object is not persistent (is not an instance of a subclass of Persistent), it doesn't have an oid, and then there's very little possibility to share references to it on disk. Instead, on disk a copy of its state will usually get made everywhere it's referenced.
So the answer to your specific question depends mostly on something you didn't reveal: does class z derive from Persistent? If it does, then _every_ reference on disk to an instance z1 is via z1's oid. If z doesn't derive from Perisistent, then almost all references on disk to an instance z1 will be via a physically distinct copy of z1's full state.
[Yair Benita]
... Reading this answer I understand that anything I store should be persistent, even if its a list I don't plan to edit.
I wouldn't say that. For example, for _most_ applications it would be foolish to create a subclass of Persistent to store an integer, as opposed to just storing an integer directly. I can conceive of (unlikely!) applications where there may be advantages to storing integers as perisistent objects, though. In the same vein, if there aren't multiple references to a single small list that doesn't change, there seems little (if any) point to making that a PersistentList.
Note that there are other tradeoffs here too. For example, an attribute whose value is persistent is not loaded into RAM when its parent is loaded into RAM, but the full state of non-persistent attributes is loaded into RAM at the time their parent is loaded into RAM. That can have a major effect on time and memory demands, and in opposing directions. Or it may not -- it depends on details of the application's object access patterns.
I was under the impression that a subclass of persistent will be larger in size for storage, so I avoided it in the cases mentioned. Is this true?
Create a specific class definition, and it's easy to measure. It depends on the class. Certainly it costs more space to create a persistent version of a builtin Python type, and for the same reason it costs more space too to create any user-defined subclass of a builtin Python type. But for an object of a user-defined class, a persistent version takes more RAM when it's in memory (because it has to store info like the oid, and _p_changed, that non-persistent objects don't have), but the on-disk size is at worst roughly the same (e.g., the values of persistent attributes like _p_changed and _p_state don't get stored to disk, they only exist while the persistent object is in RAM).
If I were you, I'd spend some quality time with fsdump, and figure out where the bulk of your space is going. Details matter more than "general principles" here. If you use the fsdump.py from ZODB 3.4 (which can be used with .fs files created by ZODB 3.1 and 3.2 too), it displays the byte size of data records, which can be a real help in such analysis.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Peters wrote:
[Yair Benita]
... Reading this answer I understand that anything I store should be persistent, even if its a list I don't plan to edit.
I wouldn't say that. For example, for _most_ applications it would be foolish to create a subclass of Persistent to store an integer, as opposed to just storing an integer directly. I can conceive of (unlikely!) applications where there may be advantages to storing integers as perisistent objects, though.
As, for instance, where the integer changes much more frequently than the other attributes, which are large enough that re-storing them just because the integer attribute changed is painful. Making the attribute a persistent sub-object also eliminates the chance of a ConflictError based on changes to the other attributes. This is the use case which drives BTrees.Length, right?
Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com
[Yair Benita]
Reading this answer I understand that anything I store should be persistent, even if its a list I don't plan to edit.
[Tim Peters]
I wouldn't say that. For example, for _most_ applications it would be foolish to create a subclass of Persistent to store an integer, as opposed to just storing an integer directly. I can conceive of (unlikely!) applications where there may be advantages to storing integers as perisistent objects, though.
[Tres Seaver]
As, for instance, where the integer changes much more frequently than the other attributes, which are large enough that re-storing them just because the integer attribute changed is painful.
Yup, that's a possible reason. Another recently popped up, which I'll exaggerate to make the point: you have 100,000 distinct integer ids, and you have 10,000 objects each with a (Python) list containing 10,000 of those ids. If you load those all into memory, Python will allocate space for 10000*10000 = 100 million integer objects, and that will consume more than a gigabyte of RAM. But if integers are stored as one unique persistent object per unique integer, it can't require more than 100 thousand distinct persistent integers in memory (because that's the total number of distinct integer ids). The RAM difference is a factor of about 1000 (but ignoring that it takes more RAM to hold a persistent wrapper than to hold a straight integer).
I'll note that IISets avoid this problem via a different route: they hold their integers as raw bits, not as Python integer objects. When you extract an element from an IISet, a Python integer object is created on-the-fly to wrap the bits.
Making the attribute a persistent sub-object also eliminates the chance of a ConflictError based on changes to the other attributes.
I didn't follow that one. If other attributes change, they can trigger conflict errors, right?
This is the use case which drives BTrees.Length, right?
The important part of that is its conflict resolution method, which keeps track of the correct final size of a BTree in the face of concurrent mutations. BTrees don't keep track of their own size because every addition or deletion would have to percolate the change in size back up to the root of the BTree, and we'd get conflict errors on the root object then. As is, most additions and deletions change only the leaf Bucket node where the mutation takes place, giving mutation often-useful spatial locality in the face of concurrent mutations.
I wish we could do better than that, though: from what I see, most people don't realize that len(some_BTree) takes time linear in the number of elements, and sucks the entire BTree into RAM. The rest seem to have trouble, at least at first, using BTrees.Length correctly. I suppose that's what you get when a scheme is driven by pragmatic implementation compromises instead of by semantic necessity. Give enough pain, it should be possible to hide the BTrees.Length strategy under the covers, although I'm not sure the increase in storage size could be justified to users who have mastered the details of doing it manually (the problem being that many uses for BTrees never care to ask for the size, so wouldn't want to pay extra overheads for keeping track of size efficiently).
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Peters wrote:
[Tres Seaver]
Making the attribute a persistent sub-object also eliminates the chance of a ConflictError based on changes to the other attributes.
I didn't follow that one. If other attributes change, they can trigger conflict errors, right?
Imaging object A with attributes 'foo' (a string), 'bar' (a normal Python int), and 'baz' (a hypothetical persistent int). Assigning directly to 'baz' would still conflict with assigning to 'foo' or 'bar'; however, the "persistent int" object might have an update protocol which made its value changeable without needing to rebind another PI into its parent.
This is the use case which drives BTrees.Length, right?
The important part of that is its conflict resolution method, which keeps track of the correct final size of a BTree in the face of concurrent mutations. BTrees don't keep track of their own size because every addition or deletion would have to percolate the change in size back up to the root of the BTree, and we'd get conflict errors on the root object then. As is, most additions and deletions change only the leaf Bucket node where the mutation takes place, giving mutation often-useful spatial locality in the face of concurrent mutations.
I wish we could do better than that, though: from what I see, most people don't realize that len(some_BTree) takes time linear in the number of elements, and sucks the entire BTree into RAM. The rest seem to have trouble, at least at first, using BTrees.Length correctly. I suppose that's what you get when a scheme is driven by pragmatic implementation compromises instead of by semantic necessity. Give enough pain, it should be possible to hide the BTrees.Length strategy under the covers, although I'm not sure the increase in storage size could be justified to users who have mastered the details of doing it manually (the problem being that many uses for BTrees never care to ask for the size, so wouldn't want to pay extra overheads for keeping track of size efficiently).
OK, cool.
Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com