Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ... With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated. Nguyen Quan Son.
Nguyen Quan Son wrote:
Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ...
With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated. Nguyen Quan Son.
it's not the catalog that's slow: t1=time.time() docs = container.portal_catalog(meta_type='Document', ...) t2=time.time() for doc in docs: obj = doc.getObject() value = obj.attr ... t3=time.time() print out the times and you'll see that the the finding is fast. the problem is that you are inflating each and every document one after another, and that takes time. Romain Slootmaekers.
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Nguyen Quan Son wrote:
Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ...
With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated. Nguyen Quan Son.
Most likely you are filling the memory of your server so that you are swapping to disk. Try cutting the query into smaller pieces so that the memory doesn't get filled up. regards Max M
Max M wrote:
Nguyen Quan Son wrote:
Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ...
With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated. Nguyen Quan Son.
Most likely you are filling the memory of your server so that you are swapping to disk.
Try cutting the query into smaller pieces so that the memory doesn't get filled up.
If you can't use catalog metadata as Seb suggests (eg. you are actually accessing many attributes, large values, etc.) and if indeeed memory is the problem (which seems likely) then you can ghostify the objects that were ghosts to begin with, and it will save memory (unless all those objects are already in cache). The problem with this strategy though is that doc.getObject() method used in your code activates the object and hence you won't know if it was a ghost already or not. To get around this you can shortcut this method and do something like : docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.aq_parent.unrestrictedTraverse(doc.getPath()) was_ghost = obj._p_changed is None value = obj.attr if was_ghost:obj._p_deactivate() You can test this by running your code on a freshly restarted server, and check the number of objects in cache. The number shouldn't change much after running the above method, but will increase dramatically if you just used 'obj = doc.getObject()' instead, or didn't do the deactivating of the objects. The lower number of objects in your cache should in turn keep your memory usage down, and prevent your computer paging through the request, and hence speed things up considerably! Another option would be to reduce the size of your cache so that the amount of memory your zope instance consumes doesn't cause your computer to swap, though doing the above code changes will also help keep your cache with the 'right' objects in it as well, which in turn will further help with the performance of subsequent requests. Cheers, JB.
John Barratt wrote:
the problem (which seems likely) then you can ghostify the objects that were ghosts to begin with, and it will save memory (unless all those objects are already in cache).
This is rather interesting, but I don't quite follow what's happening. If you can say a little more, or suggest a doc reference, I'm all ears.
Simon Michael wrote:
John Barratt wrote:
the problem (which seems likely) then you can ghostify the objects that were ghosts to begin with, and it will save memory (unless all those objects are already in cache).
This is rather interesting, but I don't quite follow what's happening. If you can say a little more, or suggest a doc reference, I'm all ears.
In general when an object is first loaded from the ZODB it is in a 'ghost' state, and is only a shell, it has no attributes etc. When you access (almost) any attribute on that object (eg. do : value = ob.attr), it gets activated (the contents are loaded automatically, and then the value returned). This is when the real memory usage takes place. So if you get an object from the ZODB and don't access any attributes, it will remain in a ghosted state. Some core python attributes *don't* cause it to activate such as accessing __dict__, and also clearly the reserved persistent _p_* attributes. If you look at the Cache Paramaters tab of your Database in the Control Panel (at least with Zope 2.6.2, perhaps 2.6.1) you can see how many objects are in memory, and how many are just 'ghosts'. I think ghosts are only 'removed' after a restart, and essentially just contain a _p_oid that references the object in the ZODB, ready for re-activation. A general reference for the ZODB can be found here that explains more : http://www.python.org/workshops/2000-01/proceedings/papers/fulton/zodb3.html An example use (and good discussion) that is similar, can be found at the link below. I found this after having problems with objects not de-ghostifying properly when just accessing __dict__ : http://aspn.activestate.com/ASPN/Mail/Message/zodb-dev/913762 Also a grep through the zope source code & some products will also find many examples of 'deactivating' objects after a 'walk' : eg. From OFS.ObjectManager : def manage_afterAdd(self, item, container): for object in self.objectValues(): try: s=object._p_changed except: s=0 if hasattr(aq_base(object), 'manage_afterAdd'): object.manage_afterAdd(item, container) if s is None: object._p_deactivate() A change to my example code that would be advisable is the wrapping of the _p_changed test in a try/except incase the object is None, or for some reason isn't persistent, and hence doesn't have a _p_changed. I hope this helps & makes sense! Cheers, JB.
John - your post and the links helped a lot. Thanks!
I've added catalog metadata as Seb suggested and it works fine. Thank you very much. Nguyen Quan Son
Nguyen Quan Son wrote:
Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ...
With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated.
From: "John Barratt" <jlb@ball.langarson.com.au> To: <zope-dev@zope.org> Sent: Wednesday, September 10, 2003 6:41 PM Subject: Re: [Zope-dev] Catalog performance
If you can't use catalog metadata as Seb suggests (eg. you are actually accessing many attributes, large values, etc.) and if indeeed memory is the problem (which seems likely) then you can ghostify the objects that were ghosts to begin with, and it will save memory (unless all those objects are already in cache).
The problem with this strategy though is that doc.getObject() method used in your code activates the object and hence you won't know if it was a ghost already or not. To get around this you can shortcut this method and do something like :
docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.aq_parent.unrestrictedTraverse(doc.getPath()) was_ghost = obj._p_changed is None value = obj.attr if was_ghost:obj._p_deactivate()
You can test this by running your code on a freshly restarted server, and check the number of objects in cache. The number shouldn't change much after running the above method, but will increase dramatically if you just used 'obj = doc.getObject()' instead, or didn't do the deactivating of the objects. The lower number of objects in your cache should in turn keep your memory usage down, and prevent your computer paging through the request, and hence speed things up considerably!
Another option would be to reduce the size of your cache so that the amount of memory your zope instance consumes doesn't cause your computer to swap, though doing the above code changes will also help keep your cache with the 'right' objects in it as well, which in turn will further help with the performance of subsequent requests.
Cheers,
JB.
From: "Seb Bacon" <seb@jamkit.com> To: <zope-dev@zope.org> Sent: Wednesday, September 10, 2003 6:18 PM Subject: [Zope-dev] Re: Catalog performance
With getObject(), you're loading entire objects into memory in order to grab a single attribute. This is very wasteful. Try putting the attribute into the metadata for the catalog and grabbing it from there. Then you can do:
for doc in docs: value = doc.attr
seb
John Barratt wrote:
docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.aq_parent.unrestrictedTraverse(doc.getPath()) was_ghost = obj._p_changed is None value = obj.attr if was_ghost:obj._p_deactivate()
Bear in mind though, that you can only do this in an external method... Chris
Chris Withers wrote:
John Barratt wrote:
docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.aq_parent.unrestrictedTraverse(doc.getPath()) was_ghost = obj._p_changed is None value = obj.attr if was_ghost:obj._p_deactivate()
Bear in mind though, that you can only do this in an external method... Why can you only do this in an external method? This idea (deactivating objects) is used quite a extensively in many parts of core zope such as OFS.ObjectManager as I mentioned in another post, and we use it in our product code quite a bit as well.
JB.
John Barratt wrote:
Chris Withers wrote:
John Barratt wrote:
docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.aq_parent.unrestrictedTraverse(doc.getPath()) was_ghost = obj._p_changed is None value = obj.attr if was_ghost:obj._p_deactivate()
Bear in mind though, that you can only do this in an external method...
..or product code ;-)
Why can you only do this in an external method? This idea (deactivating objects) is used quite a extensively in many parts of core zope such as OFS.ObjectManager as I mentioned in another post, and we use it in our product code quite a bit as well.
Nguyen was asking about a python script, you can't do these things there as the necessary methods don't have security declarations, and the methods start with '_', which I think the Zope Security Policy denies access to... Chris
Nguyen Quan Son wrote:
Hi, I have a problem with performance and memory consumption when trying to do some statistics, using following code: ... docs = container.portal_catalog(meta_type='Document', ...) for doc in docs: obj = doc.getObject() value = obj.attr ...
With about 10.000 documents this Python script takes 10 minutes and more than 500MB of memory, after that I had to restart Zope. I am running Zope 2.6.1 + Plone 1.0 on Windows 2000, Xeon P4 with 1GB RAM. What's wrong with this code? Any suggestion is appreciated.
With getObject(), you're loading entire objects into memory in order to grab a single attribute. This is very wasteful. Try putting the attribute into the metadata for the catalog and grabbing it from there. Then you can do: for doc in docs: value = doc.attr seb
participants (8)
-
Chris Withers -
John Barratt -
Max M -
Nguyen Quan Son -
Romain Slootmaekers -
Seb Bacon -
Simon Michael -
Toby Dickenson