Very severe memory leak
Hi, For a long time now, one of our clients, running Zope 2.5.1, has been experiencing memory leaks. For a time this has been relieved by restarting Zope every day at 4am. Lately this was not enough, however, as Zope started taking more and more memory, to the point that it frequently required more than one restart during the day. The machine this is running on is a Xeon 500 MHz with 512 Mb of memory. Zope is a very memory intensive application, but 512Mb used to be enough for a medium sized portal. This Zope site makes very heavy use of both ZClasses and ZCatalog, and it appears to me that this "worsening" of perfomance and memory consumption was being caused by the increase in content, which caused an increase in the size of the ZCatalog and in the number of ZClass instances. A total of 6796 ZClass instances representing News Items are cataloged. There are other ZCatalogs and ZClass instances representing calendar events and other stuff (this is a very old Zope portal coded by hand in ZClasses, no CMF, no Plone, not even Pagina1, our ZClass based CMS) Fiddling with the cache parameters in the control panel showed that while we could keep the memory consumption to a point where the daily restart would be enough (say, with 10k objects per thread), we would get constant thrashing of cache objects, specially DateTime objects due to the ZCatalog queries, and the machine performance would be close to intolerable, whereas if the cache parameters where set to allow a fast performance (with 50k objects per thread) the machine would run out of memory in 3 to 4 hours. Needless to say this was with heavy use of RAMCacheManagers, not counting the accelerator proxy in front of it. Without the RAMCaches, the machine would go down in under 5 minutes of work hour load. Even with the caches on, the load would never go down from 2.0 during work hours. Last tuesday we decided no longer to wait for 2.6.2 and migrated the site to 2.6.1. We dealt with the ObjectManager-based-ZClass issue, reformed the ZCatalogs to replace the DateTime FieldIndexes with DateTimeIndexes and then had a working testing environment, which we stress tested lightly without detecting any problems and quickly moved to production. This was late at night The next morning we were surprised to notice the machine very quickly ran out of memory. The memory leak was *far more severe* than before. Zope needed a restart every 15m or so before it would send the machine into heavy swaping. In a very non-intuitive hunch I suggested we shut down all RAMCaches and, amazingly enough, this made the situation a bit more manageable. We're now restarting every 45 minutes. To our relief, disabling the RAMCaches had only a barely noticeable effect on performance. The site kept churning out pages really fast, a testatment to the optimization job done in the 2.6 series. The load on the machine is rarely above 0.8, except when it goes into swap :-) The number of DateTime refcounts in the Control_Panel, although much smaller than in Zope 2.5.1 is very high and, mostly importantly, constantly increasing, as far as I can tell. After 12 minutes of uptime, the top refcounts are: DateTime.DateTime.DateTime: 96168 BTrees._IOBTree.IOBucket: 43085 BTrees._IIBTree.IIBTree: 40400 BTrees._IIBTree.IIBucket: 23696 OFS.DTMLDocument.DTMLDocument: 23190 BTrees.OIBTree.OIBucket: 14582 BTrees._IIBTree.IISet: 12479 BTrees._IIBTree.IITreeSet: 10823 BTrees.OOBTree.OOBucket: 7088 OFS.Image.Image: 6860 OFS.DTMLMethod.DTMLMethod: 5894 DocumentTemplate.DT_Util.Eval: 3250 OFS.Image.File: 2796 BTrees._IOBTree.IOBTree: 2761 ZClasses.Method.MWp: 1592 In time, DateTime refcounts eventually dwarves the second place by an order of magnitude. I think this is related to the fact that DateTime instances are stored as metadata, even though the date indexes have been converted to DateTime indexes. The question is, why aren't those instances being released? What is holding on to them? I tried installing the LeakFinder product but discovered it didn't work before stumbling in a message in the archives that told me exactly that :-) The RefCounts view in the LeakFinder object fails with the following traceback: Traceback (innermost last): [...] Module DocumentTemplate.DT_Util, line 201, in eval - __traceback_info__: REQUEST Module <string>, line 0, in ? Module Products.LeakFinder.LeakFinder, line 240, in manage_getSample Module Products.LeakFinder.LeakFinder, line 163, in getControlledRefcounts Module Products.LeakFinder.LeakFinder, line 188, in resetCache TypeError: function takes at most 2 arguments (3 given) The code in question is: def resetCache(c, gc): cache = c._cache if gc: cache.minimize(3) # The minimum that actually performs gc. # Replace it with an empty cache. c._cache = PickleCache(c, cache.cache_size, cache.cache_age) PickleCache is imported from ZODB.cPickleCache, which is (surprise!) in C and I couldn't figure out how I should fix the call above to work. In an ExternalMethod, I consulted the python module gc and checked that gc *was* enabled and *no* uncollectable cycles where found. In trying to narrow down the causes of the leak, we detected what looks like a minor one. If you insert a very simple dtml method in Zope root and bang on it at a rate of, say, 100 requests per second, Zope will increase it's memory footprint very slowly, say, a couple hundred Ks every 10 minutes (both RSS and virtual size). I figure this has something to do with memory fragmentation or some other low level stuff. But this is definetly not our main concern, which is to find out how is it that Zope is leaking DateTimes so heavily in our site, and if it is really all those DateTimes that are hogging our memory. We need help, and we need it desperately. If anyone wants any other information we'd be happy to provide. -- Leonardo Rochael Almeida <leo@hiper.com.br> Hiperlogica informatica e educacao ltda.
Well, to at least reduce the immediate pressure, why don't you throw more RAM into that server? Memory is cheap. jens On Friday, Aug 22, 2003, at 17:38 US/Eastern, Leonardo Rochael Almeida wrote:
Hi,
For a long time now, one of our clients, running Zope 2.5.1, has been experiencing memory leaks. For a time this has been relieved by restarting Zope every day at 4am.
Lately this was not enough, however, as Zope started taking more and more memory, to the point that it frequently required more than one restart during the day.
The machine this is running on is a Xeon 500 MHz with 512 Mb of memory. Zope is a very memory intensive application, but 512Mb used to be enough for a medium sized portal.
This Zope site makes very heavy use of both ZClasses and ZCatalog, and it appears to me that this "worsening" of perfomance and memory consumption was being caused by the increase in content, which caused an increase in the size of the ZCatalog and in the number of ZClass instances. A total of 6796 ZClass instances representing News Items are cataloged. There are other ZCatalogs and ZClass instances representing calendar events and other stuff (this is a very old Zope portal coded by hand in ZClasses, no CMF, no Plone, not even Pagina1, our ZClass based CMS)
Fiddling with the cache parameters in the control panel showed that while we could keep the memory consumption to a point where the daily restart would be enough (say, with 10k objects per thread), we would get constant thrashing of cache objects, specially DateTime objects due to the ZCatalog queries, and the machine performance would be close to intolerable, whereas if the cache parameters where set to allow a fast performance (with 50k objects per thread) the machine would run out of memory in 3 to 4 hours. Needless to say this was with heavy use of RAMCacheManagers, not counting the accelerator proxy in front of it. Without the RAMCaches, the machine would go down in under 5 minutes of work hour load. Even with the caches on, the load would never go down from 2.0 during work hours.
Last tuesday we decided no longer to wait for 2.6.2 and migrated the site to 2.6.1. We dealt with the ObjectManager-based-ZClass issue, reformed the ZCatalogs to replace the DateTime FieldIndexes with DateTimeIndexes and then had a working testing environment, which we stress tested lightly without detecting any problems and quickly moved to production. This was late at night
The next morning we were surprised to notice the machine very quickly ran out of memory. The memory leak was *far more severe* than before. Zope needed a restart every 15m or so before it would send the machine into heavy swaping.
In a very non-intuitive hunch I suggested we shut down all RAMCaches and, amazingly enough, this made the situation a bit more manageable. We're now restarting every 45 minutes. To our relief, disabling the RAMCaches had only a barely noticeable effect on performance. The site kept churning out pages really fast, a testatment to the optimization job done in the 2.6 series. The load on the machine is rarely above 0.8, except when it goes into swap :-)
The number of DateTime refcounts in the Control_Panel, although much smaller than in Zope 2.5.1 is very high and, mostly importantly, constantly increasing, as far as I can tell. After 12 minutes of uptime, the top refcounts are:
DateTime.DateTime.DateTime: 96168 BTrees._IOBTree.IOBucket: 43085 BTrees._IIBTree.IIBTree: 40400 BTrees._IIBTree.IIBucket: 23696 OFS.DTMLDocument.DTMLDocument: 23190 BTrees.OIBTree.OIBucket: 14582 BTrees._IIBTree.IISet: 12479 BTrees._IIBTree.IITreeSet: 10823 BTrees.OOBTree.OOBucket: 7088 OFS.Image.Image: 6860 OFS.DTMLMethod.DTMLMethod: 5894 DocumentTemplate.DT_Util.Eval: 3250 OFS.Image.File: 2796 BTrees._IOBTree.IOBTree: 2761 ZClasses.Method.MWp: 1592
In time, DateTime refcounts eventually dwarves the second place by an order of magnitude. I think this is related to the fact that DateTime instances are stored as metadata, even though the date indexes have been converted to DateTime indexes. The question is, why aren't those instances being released? What is holding on to them?
I tried installing the LeakFinder product but discovered it didn't work before stumbling in a message in the archives that told me exactly that :-) The RefCounts view in the LeakFinder object fails with the following traceback:
Traceback (innermost last): [...] Module DocumentTemplate.DT_Util, line 201, in eval - __traceback_info__: REQUEST Module <string>, line 0, in ? Module Products.LeakFinder.LeakFinder, line 240, in manage_getSample Module Products.LeakFinder.LeakFinder, line 163, in getControlledRefcounts Module Products.LeakFinder.LeakFinder, line 188, in resetCache TypeError: function takes at most 2 arguments (3 given)
The code in question is:
def resetCache(c, gc): cache = c._cache if gc: cache.minimize(3) # The minimum that actually performs gc. # Replace it with an empty cache. c._cache = PickleCache(c, cache.cache_size, cache.cache_age)
PickleCache is imported from ZODB.cPickleCache, which is (surprise!) in C and I couldn't figure out how I should fix the call above to work.
In an ExternalMethod, I consulted the python module gc and checked that gc *was* enabled and *no* uncollectable cycles where found.
In trying to narrow down the causes of the leak, we detected what looks like a minor one. If you insert a very simple dtml method in Zope root and bang on it at a rate of, say, 100 requests per second, Zope will increase it's memory footprint very slowly, say, a couple hundred Ks every 10 minutes (both RSS and virtual size). I figure this has something to do with memory fragmentation or some other low level stuff. But this is definetly not our main concern, which is to find out how is it that Zope is leaking DateTimes so heavily in our site, and if it is really all those DateTimes that are hogging our memory. We need help, and we need it desperately. If anyone wants any other information we'd be happy to provide.
-- Leonardo Rochael Almeida <leo@hiper.com.br> Hiperlogica informatica e educacao ltda.
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Leonardo, If you're using Linux, you might be able to take some of the pressure off by using AutoLance to automatically restart the server when memory grows above a certain threshhold. http://zope.org/Members/mcdonc/Products/AutoLance/view On Sat, 2003-08-23 at 15:17, Jens Vagelpohl wrote:
Well, to at least reduce the immediate pressure, why don't you throw more RAM into that server? Memory is cheap.
jens
Leonardo Rochael Almeida wrote at 2003-8-22 18:38 -0300:
... I tried installing the LeakFinder product but discovered it didn't work before stumbling in a message in the archives that told me exactly that :-) The RefCounts view in the LeakFinder object fails with the following traceback:
Traceback (innermost last): [...] Module DocumentTemplate.DT_Util, line 201, in eval - __traceback_info__: REQUEST Module <string>, line 0, in ? Module Products.LeakFinder.LeakFinder, line 240, in manage_getSample Module Products.LeakFinder.LeakFinder, line 163, in getControlledRefcounts Module Products.LeakFinder.LeakFinder, line 188, in resetCache TypeError: function takes at most 2 arguments (3 given) ... c._cache = PickleCache(c, cache.cache_size, cache.cache_age)
"PickleCache" now has dropped the "cache_age" parameter. Try: "PickleCache(c,cache.cache_size)" Dieter
On 08/22/2003 05:38 PM, Leonardo Rochael Almeida wrote:
In time, DateTime refcounts eventually dwarves the second place by an order of magnitude. I think this is related to the fact that DateTime instances are stored as metadata, even though the date indexes have been converted to DateTime indexes. The question is, why aren't those instances being released? What is holding on to them?
When you flush the cache, those DateTimes should disappear. If they don't, the leak is keeping them.
I tried installing the LeakFinder product but discovered it didn't work before stumbling in a message in the archives that told me exactly that :-) The RefCounts view in the LeakFinder object fails with the following traceback:
LeakFinder is an early attempt to share some of the techniques I use for finding leaks. Unfortunately, those techniques aren't very useful until you've already searched for weeks, which means LeakFinder isn't very good for emergencies. Here are the things you should look at first: 1) The ZODB cache size. The meaning of this number changed dramatically in 2.6. Before 2.6 it was a very vague number. In 2.6 it's a target number of objects that Zope actually tries to maintain. Before 2.6 it might have made sense to set the ZODB cache size to some arbitrarily high number like 100,000; in 2.6 you want to start at about 2000 and adjust from there. There are tools in 2.6 for helping you adjust the number. 2) The number of ImplicitAcquisitionWrappers present in the system. I have found it to be a reliable indicator of whether you have a leak or not. Expect this number to stay under 400 or so. If it grows gradually, there's a leak. Watch the refcounts screen. 3) Is Python compiled with cyclic garbage collection enabled? 2.4 and above absolutely require cyclic garbage collection. 4) Don't use 2.6.1. Use 2.6.2, which has fixes for known leaks. It is actually already tagged in CVS as "Zope-2_6_2", and it's what zope.org is now running. Various unrelated things prevented a formal release this week. If all else fails, grep all Python modules for "sys._getframe()" and "sys.exc_info()". These are the primary causes of memory leaks in Python 2.1 and below. If you're brave, you can just run Zope under Python 2.2, which fixes those particular leaks AFAIK. Finally, there's always hope. :-) The latest thing I've been doing is running Zope in a debug build of Python. A debug build makes a magical "sys._getobjects()" available, allowing you to inspect all live objects through a remote console. Since debug builds aren't much slower than standard builds, you can even run a debug build in production for short periods of time. I've been building a small library of functions for working in this mode, and if you need them, I'll pass them along. I'd have to warn you that they are anything but intuitive in their purpose and use, though. Shane
On Sat, 2003-08-23 at 22:18, Shane Hathaway wrote:
On 08/22/2003 05:38 PM, Leonardo Rochael Almeida wrote:
In time, DateTime refcounts eventually dwarves the second place by an order of magnitude. I think this is related to the fact that DateTime instances are stored as metadata, even though the date indexes have been converted to DateTime indexes. The question is, why aren't those instances being released? What is holding on to them?
When you flush the cache, those DateTimes should disappear. If they don't, the leak is keeping them.
They are disappearing. Too bad they return immediately after, as they're comming from a very heavy catalog query result.
LeakFinder is an early attempt to share some of the techniques I use for finding leaks. Unfortunately, those techniques aren't very useful until you've already searched for weeks, which means LeakFinder isn't very good for emergencies.
Well, I have been searching for close to a week now :-) BTW, removing the third parameter in the PickleCache call makes the RefCounts tab work again. I'll see if I can get some tracebacks out of object creations now.
Here are the things you should look at first:
1) The ZODB cache size. The meaning of this number changed dramatically in 2.6. Before 2.6 it was a very vague number. In 2.6 it's a target number of objects that Zope actually tries to maintain. Before 2.6 it might have made sense to set the ZODB cache size to some arbitrarily high number like 100,000; in 2.6 you want to start at about 2000 and adjust from there. There are tools in 2.6 for helping you adjust the number.
We're keeping it at 15,000. Most threads keep around 12k and 13k. One of them sometimes exagerates at 100k or so. Hmm, funny... when I look at them now, they're all under 5k...
2) The number of ImplicitAcquisitionWrappers present in the system. I have found it to be a reliable indicator of whether you have a leak or not. Expect this number to stay under 400 or so. If it grows gradually, there's a leak. Watch the refcounts screen.
It keeps under 200. At least when I'm looking :-)
3) Is Python compiled with cyclic garbage collection enabled? 2.4 and above absolutely require cyclic garbage collection.
Yes, I mentioned I checked it in my message. I also checked that the gc didn't find any uncollectables, so it's definetly *not* a cycle leak. It's a please-release-my-reference-and-lemme-die-in-peace leak.
4) Don't use 2.6.1. Use 2.6.2, which has fixes for known leaks. It is actually already tagged in CVS as "Zope-2_6_2", and it's what zope.org is now running. Various unrelated things prevented a formal release this week.
You guys are sure about this? The client is very much against running things from CVS because they just don't want to go thru another upgrade processes when the "real" 2.6.2 shows up.
If all else fails, grep all Python modules for "sys._getframe()" and "sys.exc_info()". These are the primary causes of memory leaks in Python 2.1 and below. If you're brave, you can just run Zope under Python 2.2, which fixes those particular leaks AFAIK.
Do I need to be particularly brave to run 2.6.2 (as tagged in the CVS) under Python2.2? is it still an "unsupported" combination? I know it's 2.7's job to be 2.2 compliant, but I've seen reports of more and more people running Zope 2.6.x under Python 2.2.
Finally, there's always hope. :-) The latest thing I've been doing is running Zope in a debug build of Python. A debug build makes a magical "sys._getobjects()" available,
Where do you discover these things? :-)
allowing you to inspect all live objects through a remote console.
Interesting, which one do you use? does Boa have a remote console? or is it just plain monitor port?
Since debug builds aren't much slower than standard builds, you can even run a debug build in production for short periods of time. I've been building a small library of functions for working in this mode, and if you need them, I'll pass them along. I'd have to warn you that they are anything but intuitive in their purpose and use, though.
I'll let you know if it gets to that :-) For now we'll be stress testing to try and locate the URLs that cause the most leakage. Cheers, Leo PS: to the people who offered other suggestions: Yes, we're already running ZEO to restart faster. No, buying more memory now is not an option. Yes, we're already automatically restarting the process when the available memory in the machine gets to a critical state, but not thru Autolance. -- Ideas don't stay in some minds very long because they don't like solitary confinement.
Leonardo Rochael Almeida wrote:
On Sat, 2003-08-23 at 22:18, Shane Hathaway wrote:
When you flush the cache, those DateTimes should disappear. If they don't, the leak is keeping them.
They are disappearing. Too bad they return immediately after, as they're comming from a very heavy catalog query result.
Hmm, that tends to indicate you really need that much memory.
1) The ZODB cache size. The meaning of this number changed dramatically in 2.6. Before 2.6 it was a very vague number. In 2.6 it's a target number of objects that Zope actually tries to maintain. Before 2.6 it might have made sense to set the ZODB cache size to some arbitrarily high number like 100,000; in 2.6 you want to start at about 2000 and adjust from there. There are tools in 2.6 for helping you adjust the number.
We're keeping it at 15,000. Most threads keep around 12k and 13k. One of them sometimes exagerates at 100k or so. Hmm, funny... when I look at them now, they're all under 5k...
Maybe the individual ZODB objects are excessively large. A few months ago we found a bug in the new zope.org software that made it store file uploads as a single chunk. A lot of those files were over 1 MB in size. This exhausted 1 GB of RAM pretty quickly. Splitting the files into chunks, like Zope usually does automatically, solved the problem.
2) The number of ImplicitAcquisitionWrappers present in the system. I have found it to be a reliable indicator of whether you have a leak or not. Expect this number to stay under 400 or so. If it grows gradually, there's a leak. Watch the refcounts screen.
It keeps under 200. At least when I'm looking :-)
That's another strong indication that it's not actually leaking. Or perhaps it's a class of leak I have not yet encountered. ;-)
4) Don't use 2.6.1. Use 2.6.2, which has fixes for known leaks. It is actually already tagged in CVS as "Zope-2_6_2", and it's what zope.org is now running. Various unrelated things prevented a formal release this week.
You guys are sure about this? The client is very much against running things from CVS because they just don't want to go thru another upgrade processes when the "real" 2.6.2 shows up.
Well, 2.6.2b6 has been released (but not announced), and I've been promised there will be no more changes before 2.6.2.
If all else fails, grep all Python modules for "sys._getframe()" and "sys.exc_info()". These are the primary causes of memory leaks in Python 2.1 and below. If you're brave, you can just run Zope under Python 2.2, which fixes those particular leaks AFAIK.
Do I need to be particularly brave to run 2.6.2 (as tagged in the CVS) under Python2.2? is it still an "unsupported" combination? I know it's 2.7's job to be 2.2 compliant, but I've seen reports of more and more people running Zope 2.6.x under Python 2.2.
2.6 is also intended to be compatible with Python 2.2. It's just that here at ZC we haven't gone through the rigorous analysis we've been planning. Hmm, come to think of it, the community could actually do that analysis. It wouldn't be terribly difficult. Any volunteers?
Finally, there's always hope. :-) The latest thing I've been doing is running Zope in a debug build of Python. A debug build makes a magical "sys._getobjects()" available,
Where do you discover these things? :-)
I learned this one from Jeremy Hylton. He posted some info about it on his weblog after he mentioned it. http://www.python.org/~jeremy/weblog/030410.html
allowing you to inspect all live objects through a remote console.
Interesting, which one do you use? does Boa have a remote console? or is it just plain monitor port?
I rolled my own remote console. The monitor port is too buggy and Boa is overkill for this.
Since debug builds aren't much slower than standard builds, you can even run a debug build in production for short periods of time. I've been building a small library of functions for working in this mode, and if you need them, I'll pass them along. I'd have to warn you that they are anything but intuitive in their purpose and use, though.
I'll let you know if it gets to that :-)
For now we'll be stress testing to try and locate the URLs that cause the most leakage.
That's an excellent approach. Shane
On Mon, Aug 25, 2003 at 02:57:50PM -0400, Shane Hathaway wrote:
2.6 is also intended to be compatible with Python 2.2. It's just that here at ZC we haven't gone through the rigorous analysis we've been planning. Hmm, come to think of it, the community could actually do that analysis. It wouldn't be terribly difficult. Any volunteers?
Maybe. I'd really like to run with python 2.2. What is involved in the "rigorous analysis"? Is there a plan I could follow? -- Paul Winkler http://www.slinkp.com Look! Up in the sky! It's THE BABY! (random hero from isometric.spaceninja.com)
Paul Winkler wrote:
On Mon, Aug 25, 2003 at 02:57:50PM -0400, Shane Hathaway wrote:
2.6 is also intended to be compatible with Python 2.2. It's just that here at ZC we haven't gone through the rigorous analysis we've been planning. Hmm, come to think of it, the community could actually do that analysis. It wouldn't be terribly difficult. Any volunteers?
Maybe. I'd really like to run with python 2.2. What is involved in the "rigorous analysis"? Is there a plan I could follow?
We need to follow the same process we used to move to Python 2.1. http://dev.zope.org/DevHome/Wikis/DevSite/Projects/SupportPython21/FrontPage This time it should be simpler, though. We first need to produce a migration assessment, similar to this one: http://dev.zope.org/Wikis/DevSite/Projects/SupportPython21/MigrationAssessme... Basically, for each significant change made to Python, we need to evaluate its impact on Zope. Simply putting up a wiki page listing all changes between Python 2.1 and 2.2 would be a good start. If you want to, you could even include changes up to 2.3, so we don't have to do this work again to support Python 2.3. Hopefully, the evaluation will lead to no changes to Zope, but to have confidence in the evaluation, we need to be thorough and impartial. Shane
[Paul Winkler]
Maybe. I'd really like to run with python 2.2. What is involved in the "rigorous analysis"? Is there a plan I could follow?
[Shane Hathaway]
We need to follow the same process we used to move to Python 2.1.
http://dev.zope.org/DevHome/Wikis/DevSite/Projects/SupportPython21/FrontPage
This time it should be simpler, though. We first need to produce a migration assessment, similar to this one:
http://dev.zope.org/Wikis/DevSite/Projects/SupportPython21/MigrationAssessme nt
Basically, for each significant change made to Python, we need to evaluate its impact on Zope. Simply putting up a wiki page listing all changes between Python 2.1 and 2.2 would be a good start. If you want to, you could even include changes up to 2.3, so we don't have to do this work again to support Python 2.3.
I'd actually recommend skipping 2.2 and going straight to 2.3 -- 2.3 is basically 2.2 + 16 months of bugfixes and speedups, a very solid release. It will create more problems for Zope because 2.3 starts complaining about more deprecated practices that 2.2.3 still lets slide, but they should be minor problems at worst. There are deeper benefits too. For example, the experimental ZODB3 3.3 alpha release requires Python 2.3; because of a new-style class bug that still exists in Python 2.2.3, it's not at all trivial to get ZODB 3.3 to work with Python 2.2.3. I'd dearly love to skip that pointlessly time-consuming exercise.
Tim Peters wrote:
I'd actually recommend skipping 2.2 and going straight to 2.3 -- 2.3 is basically 2.2 + 16 months of bugfixes and speedups, a very solid release. It will create more problems for Zope because 2.3 starts complaining about more deprecated practices that 2.2.3 still lets slide, but they should be minor problems at worst. There are deeper benefits too. For example, the experimental ZODB3 3.3 alpha release requires Python 2.3; because of a new-style class bug that still exists in Python 2.2.3, it's not at all trivial to get ZODB 3.3 to work with Python 2.2.3. I'd dearly love to skip that pointlessly time-consuming exercise.
+1 Shane
On Mon, 2003-08-25 at 15:57, Shane Hathaway wrote:
Leonardo Rochael Almeida wrote:
On Sat, 2003-08-23 at 22:18, Shane Hathaway wrote:
When you flush the cache, those DateTimes should disappear. If they don't, the leak is keeping them.
They are disappearing. Too bad they return immediately after, as they're comming from a very heavy catalog query result.
Hmm, that tends to indicate you really need that much memory.
If I just ask the client to put more memory in the machine, he'll rightfully grill me because the site was working "less-worse" before the upgrade. The fact remains that under 2.5.1 the leak would usually allow the site to run for a number of hours before we needed to restart Zope. But maybe this means that the leak is not related to the DateTime refcounts. It's just that the fast and continual increase in DateTime objects is really bugging me. BTW, what is the usual DateTime refcount in heavy ZCatalog sites you guys run? Let me restate something important, because I forget about it myself sometimes when I'm thinking about this problem: *upgrading to 2.6.1 made the situation definetly worse*. The site became lot faster, but the memory situation became intolerable, so I believe there is something introduced between 2.5.1 and 2.6.1 that is causing our problems. Is there a fix in 2.6.2 for some leak introduced *in* the 2.6 series? Another thing, we have already estabilished that clearing the caches resets the DateTime refcount, so the DateTime must be anchored to the caches somehow. So, if the DateTime refcounts are constantly in the 50k-100k range, how come the cache object count for all threads is below 5k? If they're indirectly attached, it means that a lot of DateTimes are anchored in just a few objects, so no matter how much I reduce the target cache size, it might never be enough to throw out the DateTimes...
[...] Maybe the individual ZODB objects are excessively large. A few months ago we found a bug in the new zope.org software that made it store file uploads as a single chunk. A lot of those files were over 1 MB in size. This exhausted 1 GB of RAM pretty quickly. Splitting the files into chunks, like Zope usually does automatically, solved the problem.
Nice to know, but we don't store large files or other objects.
it just plain monitor port?
I rolled my own remote console. The monitor port is too buggy and Boa is overkill for this.
hmm, any chance of this remote console showing up somewhere? :-) I might use it to navigate the thread object cache. -- Ideas don't stay in some minds very long because they don't like solitary confinement.
On 08/25/2003 05:12 PM, Leonardo Rochael Almeida wrote:
But maybe this means that the leak is not related to the DateTime refcounts. It's just that the fast and continual increase in DateTime objects is really bugging me. BTW, what is the usual DateTime refcount in heavy ZCatalog sites you guys run?
We try to keep it near 0. :-) DateTimes are full of nonsense. Zope's DateTime troubles drove the creation of Python 2.3's new datetime module. From what I can tell, the new datetimes are excellent.
Let me restate something important, because I forget about it myself sometimes when I'm thinking about this problem: *upgrading to 2.6.1 made the situation definetly worse*. The site became lot faster, but the memory situation became intolerable, so I believe there is something introduced between 2.5.1 and 2.6.1 that is causing our problems. Is there a fix in 2.6.2 for some leak introduced *in* the 2.6 series?
I think there are, actually, but AFAIK they were obscure and rare.
Another thing, we have already estabilished that clearing the caches resets the DateTime refcount, so the DateTime must be anchored to the caches somehow. So, if the DateTime refcounts are constantly in the 50k-100k range, how come the cache object count for all threads is below 5k?
If the DateTimes are in catalog metadata or indexes, this would make sense.
If they're indirectly attached, it means that a lot of DateTimes are anchored in just a few objects, so no matter how much I reduce the target cache size, it might never be enough to throw out the DateTimes...
They are probably in the ZCatalog. Remove all metadata fields you can. If you have index dates, use DateTimeIndexes rather than the standard indexes. I find it helpful to simply visit the information ZCatalog is storing for just one cataloged object. In the catalog contents view, click one object. Look at each field and consider how ZCatalog might misuse that field for other objects.
hmm, any chance of this remote console showing up somewhere? :-)
I've attached what I threw together. It Works For Me. ;-) I put it in Zope's Extensions directory then made an external method to call the "listenlocal" function. It sets up a TCP listener bound only to localhost, so you have to telnet to port 8765 locally. No readline support. No method is provided for shutting down the listener (except restarting Zope). To exit the session, push ctrl-D then enter.
I might use it to navigate the thread object cache.
It's unfortunate you have to go so deep, but that sounds like a good plan. Shane import socket from code import InteractiveConsole from types import IntType, StringType from thread import start_new_thread, get_ident import sys import time class ThreadedObjectProxy: """Proxy to different objects based which thread invoked it. """ def __init__(self, default): self._default = default self._alts = {} def setAlternative(self, o): self._alts[get_ident()] = o def delAlternative(self): try: del self._alts[get_ident()] except KeyError: pass def __getattr__(self, name): ob = self._alts.get(get_ident(), self._default) return getattr(ob, name) class RemoteConsole(InteractiveConsole): def __init__(self, sock, file, filename=None, locals=None): if filename is None: filename = str(file) self.sock = sock self.file = file InteractiveConsole.__init__(self, locals=locals, filename=filename) def raw_input(self, prompt=''): if prompt: self.file.write(prompt) s = self.file.readline().rstrip() if s == '\x04': # Ctrl-D raise EOFError return s def interactAndClose(self): sys.stdout.setAlternative(self.file) sys.stderr.setAlternative(self.file) sys.stdin.setAlternative(self.file) try: try: self.interact() except EOFError: pass finally: sys.stdout.delAlternative() sys.stderr.delAlternative() sys.stdin.delAlternative() self.file.close() self.sock.close() def setupStreams(): if not hasattr(sys.stdout, 'setAlternative'): sys.stdout = ThreadedObjectProxy(sys.stdout) if not hasattr(sys.stderr, 'setAlternative'): sys.stderr = ThreadedObjectProxy(sys.stderr) if not hasattr(sys.stdin, 'setAlternative'): sys.stdin = ThreadedObjectProxy(sys.stdin) def accept_connections(s): while 1: cs, addr = s.accept() f = cs.makefile('w+', 0) i = RemoteConsole(cs, f) start_new_thread(i.interactAndClose, ()) def listen(addr, locals=None): setupStreams() if isinstance(addr, StringType): t = socket.AF_UNIX elif isinstance(addr, IntType): t = socket.AF_INET addr = ('', addr) else: t = socket.AF_INET s = socket.socket(t, socket.SOCK_STREAM) s.setsockopt( socket.SOL_SOCKET, socket.SO_REUSEADDR, s.getsockopt (socket.SOL_SOCKET, socket.SO_REUSEADDR) | 1 ) s.bind(addr) s.listen(1) start_new_thread(accept_connections, (s,)) def listenlocal(port=8765): listen(('localhost', port)) if __name__ == '__main__': listen(8765) while 1: time.sleep(1)
On Mon, Aug 25, 2003 at 10:35:58PM -0400, Shane Hathaway wrote:
They are probably in the ZCatalog. Remove all metadata fields you can. If you have index dates, use DateTimeIndexes rather than the standard indexes.
Hmm. ZCatalog provides a DateIndex, is this what you mean? Googling for DateTimeIndex & zope turns up exactly two results, in German which I can't read. (the google-supplied translations are pretty much gibberish.) Hmmm again: I notice that out-of-the-box, CMF Default's portal_catalog uses the following: Date FieldIndex created FieldIndex effective FieldIndex expires FieldIndex modified FieldIndex Why are these FieldIndex rather than DateTime index? -- Paul Winkler http://www.slinkp.com
Paul Winkler wrote:
On Mon, Aug 25, 2003 at 10:35:58PM -0400, Shane Hathaway wrote:
They are probably in the ZCatalog. Remove all metadata fields you can. If you have index dates, use DateTimeIndexes rather than the standard indexes.
Hmm. ZCatalog provides a DateIndex, is this what you mean? Googling for DateTimeIndex & zope turns up exactly two results, in German which I can't read. (the google-supplied translations are pretty much gibberish.)
Yes, I meant DateIndex.
Hmmm again: I notice that out-of-the-box, CMF Default's portal_catalog uses the following:
Date FieldIndex created FieldIndex effective FieldIndex expires FieldIndex modified FieldIndex
Why are these FieldIndex rather than DateTime index?
No one has corrected them, probably out of fear that something might break. Shane
On Mon, 2003-08-25 at 23:35, Shane Hathaway wrote:
On 08/25/2003 05:12 PM, Leonardo Rochael Almeida wrote:
Let me restate something important, because I forget about it myself sometimes when I'm thinking about this problem: *upgrading to 2.6.1 made the situation definetly worse*. The site became lot faster, but the memory situation became intolerable, so I believe there is something introduced between 2.5.1 and 2.6.1 that is causing our problems. Is there a fix in 2.6.2 for some leak introduced *in* the 2.6 series?
I think there are, actually, but AFAIK they were obscure and rare.
That would make the likelyhood of our problems being caused by them very small. Is there any way to know if we're hitting them? Let's assume we're not hitting a 2.6 bug. we could be hitting a 2.6 feature :-) I've read Casey's article about optimizing ZCatalog and I know there has been a lot of ZCatalog optimization work. Has there been any changes in 2.6 (in ZCatalog or thread cache or anywhere else) that was a tradeoff of memory for speed? If there was such a change, it would explain why we had a slow site that did fit into memory while now we have a fast one that doesn't.
Another thing, we have already estabilished that clearing the caches resets the DateTime refcount, so the DateTime must be anchored to the caches somehow. So, if the DateTime refcounts are constantly in the 50k-100k range, how come the cache object count for all threads is below 5k?
If the DateTimes are in catalog metadata or indexes, this would make sense.
If they're indirectly attached, it means that a lot of DateTimes are anchored in just a few objects, so no matter how much I reduce the target cache size, it might never be enough to throw out the DateTimes...
They are probably in the ZCatalog.
You mean that one ZCatalog object is in the cache, and (indirectly) attached to it are thousands of DateTimes? Then we have a huge problem. I was counting on the "objects in cache" parameter to put an upper bound in Zope memory consumption, even if this upper bound was a little fuzzy.
Remove all metadata fields you can. If you have index dates, use DateTimeIndexes rather than the standard indexes.
Well, this is a classic ZCatalog+ZClass site. Almost all ZClasses are CatalogAware and they're all cataloged in the monster ZCatalog named "Catalog" in the folder that represents the root of the site (as served by Apache) because that's how CatalogAware ZClasses work out of the box (I know the catalog name can be changed). It has a huge amount of Indexes and Metadata. The client grew the site organically, creating ZClasses and hanging them in the same ZCatalog as he was needing them. It's gonna take a lot of time to untangle it all and we're working on it, but for now we need to deal with the emergency that is the fact tha migrating to 2.6.1 put us in the very hot spot of not having a way to put an upper bound to Zope memory consumption.
I find it helpful to simply visit the information ZCatalog is storing for just one cataloged object. In the catalog contents view, click one object. Look at each field and consider how ZCatalog might misuse that field for other objects.
What kind of misuse you have in mind?
hmm, any chance of this remote console showing up somewhere? :-)
I've attached what I threw together.
Thanks!
I might use it to navigate the thread object cache.
It's unfortunate you have to go so deep, but that sounds like a good plan.
Anyone know off the top of their head how do I get to the object caches? Cheers, Leo -- Ideas don't stay in some minds very long because they don't like solitary confinement.
Leonardo Rochael Almeida wrote:
Let's assume we're not hitting a 2.6 bug. we could be hitting a 2.6 feature :-) I've read Casey's article about optimizing ZCatalog and I know there has been a lot of ZCatalog optimization work. Has there been any changes in 2.6 (in ZCatalog or thread cache or anywhere else) that was a tradeoff of memory for speed?
Not that I know of. To the contrary, I think--we sometimes can't do an optimization because it would consume extra memory.
You mean that one ZCatalog object is in the cache, and (indirectly) attached to it are thousands of DateTimes? Then we have a huge problem.
Why? Although the catalog manages those DateTime objects, the DateTimes are in many ZODB records.
I was counting on the "objects in cache" parameter to put an upper bound in Zope memory consumption, even if this upper bound was a little fuzzy.
In the past, the cache size parameter was very fuzzy. It was not an upper bound. Now, it is not only an upper bound, it also tends to act as a lower bound. (Note that the cache is still allowed to grow indefinitely within the scope of a request, however.)
Remove all metadata fields you can. If you have index dates, use DateTimeIndexes rather than the standard indexes.
Well, this is a classic ZCatalog+ZClass site. Almost all ZClasses are CatalogAware and they're all cataloged in the monster ZCatalog named "Catalog" in the folder that represents the root of the site (as served by Apache) because that's how CatalogAware ZClasses work out of the box (I know the catalog name can be changed). It has a huge amount of Indexes and Metadata. The client grew the site organically, creating ZClasses and hanging them in the same ZCatalog as he was needing them.
It's gonna take a lot of time to untangle it all and we're working on it, but for now we need to deal with the emergency that is the fact tha migrating to 2.6.1 put us in the very hot spot of not having a way to put an upper bound to Zope memory consumption.
I would characterize it the other way around; there was no way to establish a hard boundary until now. I suggest you reduce the ZODB cache size to 2000 (or less) and see if the problem goes away.
I find it helpful to simply visit the information ZCatalog is storing for just one cataloged object. In the catalog contents view, click one object. Look at each field and consider how ZCatalog might misuse that field for other objects.
What kind of misuse you have in mind?
Large strings, DateTimes, and mutable objects are all bad for the catalog.
Anyone know off the top of their head how do I get to the object caches?
app._p_jar._cache will get you started. It is dictionary-like. Shane
Shane Hathaway wrote:
as a lower bound. (Note that the cache is still allowed to grow indefinitely within the scope of a request, however.)
Which is still pretty broken IMHO :-( This is the single biggest cause of Zope becoming unresponsive for me: people doing silly things that drag hordes of objects into memory i na single request, made worse by the fact that Python is extraordinarily bad at giving memory back to the OS after that request has been completed ;-) cheers, Chris
On Fri, 2003-08-29 at 06:28, Chris Withers wrote:
Shane Hathaway wrote:
as a lower bound. (Note that the cache is still allowed to grow indefinitely within the scope of a request, however.)
Which is still pretty broken IMHO :-(
This is the single biggest cause of Zope becoming unresponsive for me: people doing silly things that drag hordes of objects into memory i na single request,
Now that would be an interesting feature: An upper bound on the number of objects a request is allowed touch, period. If a request requires more than that it's rolled back.
made worse by the fact that Python is extraordinarily bad at giving memory back to the OS after that request has been completed ;-)
On Linux (and probably other nixes), this is a libc (mis)feature. You see, malloc() is a libc abstraction. The memory the OS makes avaliable to each process is actually a contiguous block with a variable upper (or is it lower?) bound. This bound is usually changed thru the sbrk() call, which is a wrapper to the brk() call. The latter tells to the OS the amount of heap space the process think it needs. The former is an increment or decrement wrapper around the latter. When the process runs out of heap (malloc()) space, it (The libc in the python interpreter, in our case) calls sbrk() with a positive number (eg. it asks for more memory). Now, no matter how much memory is later free()d (or in our case, python objects deleted), the system never calls sbrk() with a negative value. If it were to do that, it (the libc) would first have to make sure that there are no pointers to allocated space still in the area. I'm not sure, but I think that the glibc in Linux just doesn't bother. Defragmentation of the area is not an option because, in C, due to pointer arithmetic, you can never be sure where all the pointers are, to change them to the new locations. Now this is all from memory (and a cursory read of the sbrk(2) man page) so the above could be full of, erm..., incorrections :-) Corrections to the above explanation are welcome. Cheers, Leo -- Ideas don't stay in some minds very long because they don't like solitary confinement.
Leonardo Rochael Almeida wrote:
as a lower bound. (Note that the cache is still allowed to grow indefinitely within the scope of a request, however.)
This is the single biggest cause of Zope becoming unresponsive for me: people doing silly things that drag hordes of objects into memory i na single request,
Now that would be an interesting feature: An upper bound on the number of objects a request is allowed touch, period. If a request requires more than that it's rolled back.
Hmm, not so useful, is people just keep retryign the request. What I'd like to see if the cache checking it's size on object load. If the object load would cause the cache to go above it's maximum number, then boot an object out of the cache in order to make room for the new one. So, you'd get slowness because of cache thrashing on THAT PARTICULAR REQUEST, but at least you'd be able to control the amount of memory Zope actually uses and other requests would stand a chance of beind processed normally. <snip very enlightening explanation of why a Python process using lots of memory, even for a short period of time, is a 'bad thing'> That was pretty informative, but does give even more of a good reason why we really need to be able to put a maximum upper bound on the amount of memory Zope can use at any one point... Chris
On Wednesday 03 September 2003 12:15, Chris Withers wrote:
If the object load would cause the cache to go above it's maximum number,
*Number* isnt the the right parameter to control here..... We need to limit the total amount of RAM. Objects are of variable size, and the largest ZODB objects are very much bigger than the average. One task I would like to find time for is making the cache aware of this. For example, this would fix the problem where the current cache unfairly penalises ZCatalog operations because its BTree nodes are so small.
then boot an object out of the cache in order to make room for the new one.
That would have a bad effect on ReadConflictErrors. Cache purging should only happen on transaction boundaries for storages where ReadConflictErrors are possible.
control the amount of memory Zope actually uses and other requests would stand a chance of beind processed normally.
I havent seen a mention of ulimit or autolance earlier in this thread.... They are mostly adequate protection against the work problems. -- Toby Dickenson
Toby Dickenson wrote:
On Wednesday 03 September 2003 12:15, Chris Withers wrote:
If the object load would cause the cache to go above it's maximum number,
*Number* isnt the the right parameter to control here..... We need to limit the total amount of RAM. Objects are of variable size, and the largest ZODB objects are very much bigger than the average.
This is true, but isn't this much harder to do as we run into the same issue that people have trying to produce folderish objects for Zope that limit the size of their contained objects? Hmmm, maybe not... could we make a note of the pickles size when the data is loaded and update that size when it's comitted? Is this the same as the in-memory size?
then boot an object out of the cache in order to make room for the new one.
That would have a bad effect on ReadConflictErrors.
Don't follow, can you explain?
Cache purging should only happen on transaction boundaries for storages where ReadConflictErrors are possible.
Can you put some brackets in that statement, I don't follow it..
I havent seen a mention of ulimit or autolance earlier in this thread.... They are mostly adequate protection against the work problems.
Do they kill the thread or the whole Zope? I'm also keen for users to nto get MemoryErrors, but to just have their request take much longer ( cache thrashing and the like...) cheers, Chris
On Friday 05 September 2003 14:21, Chris Withers wrote:
Hmmm, maybe not... could we make a note of the pickles size when the data is loaded and update that size when it's comitted? Is this the same as the in-memory size?
Yes, I hope its close enough for most purposes. It would be nice to have a way for the object to override that in the cases where it is badly wrong.
then boot an object out of the cache in order to make room for the new one.
That would have a bad effect on ReadConflictErrors.
Don't follow, can you explain?
You get a ReadConflictError when loading an object if it has been modified since the start of the transaction. This exception therefore becomes increasingly likely as time progresses since the start of the transaction. Today you can minimise the probability of a read conflict by touching the objects you need (at least the ones likely to change) near the start of the transaction. This works because, once loaded, the objects stay in memory until the end of the transaction. Another problem I didnt mention is that _v_ attributes are supposed to last at least until the end of the transaction.
I havent seen a mention of ulimit or autolance earlier in this thread.... They are mostly adequate protection against the work problems.
Do they kill the thread or the whole Zope?
whole zope of course there isnt a great overhead in using twice as many zopes, each with half as many publisher threads. (assuming you already have zeo, of course)
I'm also keen for users to nto get MemoryErrors, but to just have their request take much longer ( cache thrashing and the like...)
use squid, and it will retry the request. -- Toby Dickenson
Toby Dickenson wrote:
On Friday 05 September 2003 14:21, Chris Withers wrote:
[pickle size = memory usage]
Yes, I hope its close enough for most purposes. It would be nice to have a way for the object to override that in the cases where it is badly wrong.
Good call, that sounds like an excellent plan...
You get a ReadConflictError when loading an object if it has been modified since the start of the transaction. This exception therefore becomes increasingly likely as time progresses since the start of the transaction.
What's the thing Zope Corp are touting as the long term solution to this problem?
Today you can minimise the probability of a read conflict by touching the objects you need (at least the ones likely to change) near the start of the transaction. This works because, once loaded, the objects stay in memory until the end of the transaction.
Right, but if you need to touch them to avoid Read Conflicts, surely that means the other transaction which was likely to stomp on your objects gets the ReadConflict instead? Is that any better?
Another problem I didnt mention is that _v_ attributes are supposed to last at least until the end of the transaction.
bugger :-( What abotu subtransactions?
of course there isnt a great overhead in using twice as many zopes, each with half as many publisher threads. (assuming you already have zeo, of course)
Hmmm, not pretty though, is it? cheers, Chris
[Toby Dickenson]
You get a ReadConflictError when loading an object if it has been modified since the start of the transaction. This exception therefore becomes increasingly likely as time progresses since the start of the transaction.
[Chris Withers]
What's the thing Zope Corp are touting as the long term solution to this problem?
"Multiversion concurrency control" (MVCC for short) is the next step. If no other crises intervene, Jeremy and I will start implementing that on the ZODB3 3.2 branch (most likely that branch -- can't swear to it) soon (ditto). You can google on the phrase to get a ton of more-or-less abstract overviews of the concept. The short course in ZODB is that, when MVCC is in effect, a read will return the state of the object as of the time the current transaction began, even if the object has subsequently been modified by some other transaction. The overriding concern in all schemes is that you don't see inconsistent data. The current ReadConflictError prevents you from seeing inconsistent data by preventing you from loading objects that have changed since your current transaction began. MVCC prevents it by (possibly) delivering non-current object states. I don't think either can be viewed as a pure win. The ways in which ReadConflictError loses are obvious to people here because they've experienced them. The ways in which MVCC loses will become obvious later <0.9 wink>.
Tim Peters wrote:
"Multiversion concurrency control" (MVCC for short) is the next step. If no other crises intervene, Jeremy and I will start implementing that on the ZODB3 3.2 branch (most likely that branch -- can't swear to it) soon (ditto).
From Jeremy's recent notes, I think it'll be the Zope 2.6 branch...
You can google on the phrase to get a ton of more-or-less abstract overviews of the concept. The short course in ZODB is that, when MVCC is in effect, a read will return the state of the object as of the time the current transaction began, even if the object has subsequently been modified by some other transaction.
Sounds great :-) If you need any implementations testing, do let me know!
The overriding concern in all schemes is that you don't see inconsistent data.
Yep.
The current ReadConflictError prevents you from seeing inconsistent data by preventing you from loading objects that have changed since your current transaction began.
Indeed, butthat could eb a lot of things, and often in situations where ti wouldn't matter, as long as the data is consistent...
MVCC prevents it by (possibly) delivering non-current object states. I don't think either can be viewed as a pure win. The ways in which ReadConflictError loses are obvious to people here because they've experienced them. The ways in which MVCC loses will become obvious later <0.9 wink>.
Any ideas what they'll be yet? Chris
On Fri, 2003-09-12 at 07:34, Chris Withers wrote:
Tim Peters wrote:
"Multiversion concurrency control" (MVCC for short) is the next step. If no other crises intervene, Jeremy and I will start implementing that on the ZODB3 3.2 branch (most likely that branch -- can't swear to it) soon (ditto).
From Jeremy's recent notes, I think it'll be the Zope 2.6 branch...
Nope. The 'ZODB3-3_1-branch' has retired; all future "ZODB3 3.1.x" work will take place on the 'Zope-2_6-branch'. MVCC, however, is new feature work, and will take place on the 3.2 branch (which may be the Zope-2_7-branch :)
You can google on the phrase to get a ton of more-or-less abstract overviews of the concept. The short course in ZODB is that, when MVCC is in effect, a read will return the state of the object as of the time the current transaction began, even if the object has subsequently been modified by some other transaction.
Sounds great :-) If you need any implementations testing, do let me know!
The overriding concern in all schemes is that you don't see inconsistent data.
Yep.
The current ReadConflictError prevents you from seeing inconsistent data by preventing you from loading objects that have changed since your current transaction began.
Indeed, butthat could eb a lot of things, and often in situations where ti wouldn't matter, as long as the data is consistent...
MVCC prevents it by (possibly) delivering non-current object states. I don't think either can be viewed as a pure win. The ways in which ReadConflictError loses are obvious to people here because they've experienced them. The ways in which MVCC loses will become obvious later <0.9 wink>.
Any ideas what they'll be yet?
Some applications will have no downside at all; for instance, requests that will do no writing the the database are pure wins for MVCC (e.g., long-running "report" requests). Requests that write to the database on the basis of stale-but-consistent data *may* still be OK, but that will need to be a policy set by the application. Tres. -- =============================================================== Tres Seaver tseaver@zope.com Zope Corporation "Zope Dealers" http://www.zope.com
[Tim]
... The ways in which MVCC loses will become obvious later <0.9 wink>.
[Chris Withers]
Any ideas what they'll be yet?
Probably none for many apps. You'll be working with possibly non-current data, so think of ways your apps could possibly be damaged by that. For example, you're Bill Gates, using ZODB to track all your assets. A summary report takes hours to generate, and by the time you get it, perhaps a few of your billion-dollar overseas accounts were wiped out in the wee hours by an adverse court judgment, but the total you get added in the account values as of the time the report-generating transaction began. Oops. To the extent that MVCC hides that you're working with non-current data, to that extent also does an app relying on current data become vulnerable. When Bill is contemplating fleeing the country during turbulent times, he presumably needs to know how much cash he has right now, not what he had last night. Most apps aren't like that, but a one-size-fits-all policy for long-running transactions (like Bill's) doesn't exist.
Tim Peters wrote:
Probably none for many apps. You'll be working with possibly non-current data, so think of ways your apps could possibly be damaged by that. For example, you're Bill Gates, using ZODB to track all your assets. A summary report takes hours to generate, and by the time you get it, perhaps a few of your billion-dollar overseas accounts were wiped out in the wee hours by an adverse court judgment, but the total you get added in the account values as of the time the report-generating transaction began. Oops. To the extent that MVCC hides that you're working with non-current data, to that extent also does an app relying on current data become vulnerable. When Bill is contemplating fleeing the country during turbulent times, he presumably needs to know how much cash he has right now, not what he had last night. Most apps aren't like that, but a one-size-fits-all policy for long-running transactions (like Bill's) doesn't exist.
Ah, okay. That all makes sense... Of course, Bill may appreciate having a report that says "based on data no newer than X" where X is the time the transaction to generate the report started, rather than no report at all due to lots of read conflicts ;-) Chris
Tim Peters wrote at 2003-9-11 23:50 -0400:
... You can google on the phrase to get a ton of more-or-less abstract overviews of the concept. The short course in ZODB is that, when MVCC is in effect, a read will return the state of the object as of the time the current transaction began, even if the object has subsequently been modified by some other transaction.
Whoever wants to use it right now: the "no more ReadConflictErrors" patch on <http://www.dieter.handshake.de/pyprojects/zope> does precisely this (for storages that support history information). Dieter
[Tim]
... The short course in ZODB is that, when MVCC is in effect, a read will return the state of the object as of the time the current transaction began, even if the object has subsequently been modified by some other transaction.
[Dieter Maurer]
Whoever wants to use it right now: the "no more ReadConflictErrors" patch on
<http://www.dieter.handshake.de/pyprojects/zope>
does precisely this (for storages that support history information).
How has that been working out for people (not limited to Dieter)? That's indeed what we're after, although Jeremy has in mind deeper/broader changes aimed at being more efficient than digging thru history.
Tim Peters wrote at 2003-9-14 16:40 -0400:
... [Dieter Maurer]
Whoever wants to use it right now: the "no more ReadConflictErrors" patch on
<http://www.dieter.handshake.de/pyprojects/zope>
does precisely this (for storages that support history information).
How has that been working out for people (not limited to Dieter)?
It works fine for us. We are working on a content management system with tens of thousands of large SGML/XML documents. My colleague likes to make parallel imports. Before the patch, cataloguing (using an improved version of Shane's "QueueCatalog") did almost not proceed due to an incredible amount of "ReadConflictError"s. After the patch, there are occasional "WriteConflictError"s (caused by the catalog BTrees performing conflict resolution only at the leaf level), but in general the system is stable. We now use it also in production systems without problems.
That's indeed what we're after, although Jeremy has in mind deeper/broader changes aimed at being more efficient than digging thru history.
The history is consulted only rarely (only when the current state is too young). However, in this case, the pickled data is read twice. Of course, I do not mind when it gets more efficient... Dieter
On Thu, 2003-08-28 at 17:13, Shane Hathaway wrote:
Leonardo Rochael Almeida wrote:
Let's assume we're not hitting a 2.6 bug. we could be hitting a 2.6 feature :-) I've read Casey's article about optimizing ZCatalog and I know there has been a lot of ZCatalog optimization work. Has there been any changes in 2.6 (in ZCatalog or thread cache or anywhere else) that was a tradeoff of memory for speed?
Not that I know of. To the contrary, I think--we sometimes can't do an optimization because it would consume extra memory.
Well, as long as I can put an upper bound on the consumed memory, and the tradeoff is reasonable, I'd be glad to trade memory for speed.
You mean that one ZCatalog object is in the cache, and (indirectly) attached to it are thousands of DateTimes? Then we have a huge problem.
Why? Although the catalog manages those DateTime objects, the DateTimes are in many ZODB records.
The problem is: there should be a more direct relation between the number of allowed objects in the cache and the number of refcounts. As you say, the DateTimes are in ZODB Records, so if I see a DateTime refcount of, say 120k, I should be seeing at least that many objects in the caches, but I was seeing under 3k objects, unless there's something else that could be counted as a refcount multiplying factor, say if something was keeping a reference to the class for every instance besides the instance itself.
I was counting on the "objects in cache" parameter to put an upper bound in Zope memory consumption, even if this upper bound was a little fuzzy.
In the past, the cache size parameter was very fuzzy. It was not an upper bound. Now, it is not only an upper bound, it also tends to act as a lower bound. (Note that the cache is still allowed to grow indefinitely within the scope of a request, however.)
[..] but for now we need to deal with the emergency that is the fact that migrating to 2.6.1 put us in the very hot spot of not having a way to put an upper bound to Zope memory consumption.
I would characterize it the other way around; there was no way to establish a hard boundary until now. I suggest you reduce the ZODB cache size to 2000 (or less) and see if the problem goes away.
The first time you sugested this I didn't even bother because we were seeing our memory down the drain even while the caches where still at 1600 objects each (we had our limit at 5000). But as we ran out of options, and seeing that flushing the cache got us rid of the DateTimes. We decided to test the possibility of keeping Zope thrashing its cache, even at the expense of performance, to see if it could shed the DateTimes. So we set the cache limit to 1000 objects, and to our amazement, Zope sat comfortably at 150MB RSS, and the performance was very good. Some heavy ZCatalog pages still send Zope for a walk for a few seconds (for those and all other requests) but the site becomes responsive again in under 20 seconds. Not only that, but the DateTime refcount isn't even among the top 5 (the BTrees are winning there now). So we now have breathing space to do the great ZCatalog reform.
Anyone know off the top of their head how do I get to the object caches?
app._p_jar._cache will get you started. It is dictionary-like.
Thanks, as soon as we get more RAM in the machine, I'll see if I can switch the cache to a high number again and see where all those DateTimes are hanging. Cheers, Leo -- Ideas don't stay in some minds very long because they don't like solitary confinement.
Leonardo Rochael Almeida wrote:
Why? Although the catalog manages those DateTime objects, the DateTimes are in many ZODB records.
The problem is: there should be a more direct relation between the number of allowed objects in the cache and the number of refcounts. As you say, the DateTimes are in ZODB Records, so if I see a DateTime refcount of, say 120k, I should be seeing at least that many objects in the caches, but I was seeing under 3k objects, unless there's something else that could be counted as a refcount multiplying factor, say if something was keeping a reference to the class for every instance besides the instance itself.
Well, DateTimes do not derive from ZODB.Persistent, so they may be part of a ZODB object, but are never an entire ZODB object. If you have 3k objects in the cache and there are 120k DateTimes floating around, you must have an average of 40 DateTimes per loaded object. That's huge... but if you have FieldIndexes that index DateTimes, it wouldn't surprise me. We at Zope Corp. didn't even realize what a problem this could be until recently.
So we set the cache limit to 1000 objects, and to our amazement, Zope sat comfortably at 150MB RSS, and the performance was very good. Some heavy ZCatalog pages still send Zope for a walk for a few seconds (for those and all other requests) but the site becomes responsive again in under 20 seconds. Not only that, but the DateTime refcount isn't even among the top 5 (the BTrees are winning there now).
So we now have breathing space to do the great ZCatalog reform.
Excellent. The average object size sounds very large, though: start with 150 MB, give about 15 MB for non-ZODB code and data, divide by 4 (4 threads, each with independent copies of the same data), and divide by 1000 objects. That comes to 33-35k per object. That's much larger than most Zopes I've seen. BTW, if you're using ZEO, raise ZEO cache size at the same time you lower the ZODB cache size. The ZEO cache can compensate for the performance loss. The default ZEO cache size is 20 MB, but I prefer 200 MB with today's Zope apps. Shane
I have got about 70000 DateTime refs on that one CMF site at the moment and can arbitrarily increase this number by serving out pages ;-). They don't "stick" though, but go back to a stable 40000. No cause for alarm here. Did you rebuild the catalog (-indexes) after the upgrade? New Zope versions usually come with important BTree fixes ;-). Wild guess, but do you by any chance run exUserFolder? hth, Stefan --On Montag, 25. August 2003 18:12 -0300 Leonardo Rochael Almeida <leo@hiper.com.br> wrote:
But maybe this means that the leak is not related to the DateTime refcounts. It's just that the fast and continual increase in DateTime objects is really bugging me. BTW, what is the usual DateTime refcount in heavy ZCatalog sites you guys run?
-- The time has come to start talking about whether the emperor is as well dressed as we are supposed to think he is. /Pete McBreen/
On Tue, 2003-08-26 at 12:05, Stefan H. Holek wrote:
[...] Did you rebuild the catalog (-indexes) after the upgrade? New Zope versions usually come with important BTree fixes ;-).
If by rebuilding you mean clicking on the "Update" button, then yes, as part of changing the DateTime FieldIndexes to DateIndexes. If you meant something else, then no.
Wild guess, but do you by any chance run exUserFolder?
As a matter of fact, I do. Is there a known leak problem with exUserFolder? -- Ideas don't stay in some minds very long because they don't like solitary confinement.
--On Dienstag, 26. August 2003 20:02 -0300 Leonardo Rochael Almeida <leo@hiper.com.br> wrote:
On Tue, 2003-08-26 at 12:05, Stefan H. Holek wrote:
[...] Did you rebuild the catalog (-indexes) after the upgrade? New Zope versions usually come with important BTree fixes ;-).
If by rebuilding you mean clicking on the "Update" button, then yes, as part of changing the DateTime FieldIndexes to DateIndexes. If you meant something else, then no.
This should take care of it.
Wild guess, but do you by any chance run exUserFolder?
As a matter of fact, I do. Is there a known leak problem with exUserFolder?
If by "known" you mean I can point you to the code segment causing it, then no. However, I have cured several leaky sites in the past by moving away from XUF and your excessive DateTime refcounts rang a bell. Do you have the possibility to test the site without XUF? At least make sure you have the latest version. Do you have a large number of users? You could try to put ZSP_OWNEROUS_SKIP=1 in you start script, and see whether it makes a difference. If it did, it would indicate a problem related to the user folder and not the catalog per se. Disclaimer: This is guesswork based on events a long time ago. Current versions of XUF may very well work fine. Stefan -- The time has come to start talking about whether the emperor is as well dressed as we are supposed to think he is. /Pete McBreen/
On Friday 22 August 2003 03:38 pm, Leonardo Rochael Almeida wrote:
Hi,
In trying to narrow down the causes of the leak, we detected what looks like a minor one. If you insert a very simple dtml method in Zope root and bang on it at a rate of, say, 100 requests per second, Zope will increase it's memory footprint very slowly, say, a couple hundred Ks every 10 minutes (both RSS and virtual size). I figure this has something to do with memory fragmentation or some other low level stuff. But this is definetly not our main concern, which is to find out how is it that Zope is leaking DateTimes so heavily in our site, and if it is really all those DateTimes that are hogging our memory. We need help, and we need it desperately. If anyone wants any other information we'd be happy to provide.
Okay here are a few things to try based on this information. 1) Take the zclasses and burn them and do it as a python product instead. From items I have seen/done in the past this makes a fairly large difference for speed and a little in memory. However with it being faster speed you can have smaller caches. 2) Convert dtml objects to python scripts where feasible. Python scripts often run 10-40 times faster then dtml does and they seem to have a much better memory footprint. With them running a lot faster you can also cut down on your caching. You can write write scripts that are fairly complex that can generate ~ 200 req/s but I have not gotten even close to that with dtml. 3)If you still have problems use psyco on your python product that you use to replace the zclasses and bind it to the methods that make the most speed difference. Caching is good but having it run fast enough to not need caching is often better. I have a fair number of sites now that range in size from 100M - 1GB that get a fair bit of usage with tens of thousands of objects and they have a memory usage of about 20-45M of ram under zope 2.6.1. They used to have a memory usage in the 80-400M range but that was before I fixed a number of things.
Leonardo Rochael Almeida wrote Lately this was not enough, however, as Zope started taking more and more memory, to the point that it frequently required more than one restart during the day.
Note that running with ZEO will at least mean that your server restarts happen a lot faster. -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
participants (13)
-
Anthony Baxter -
Chris McDonough -
Chris Withers -
Dieter Maurer -
Jens Vagelpohl -
kosh -
Leonardo Rochael Almeida -
Paul Winkler -
Shane Hathaway -
Stefan H. Holek -
Tim Peters -
Toby Dickenson -
Tres Seaver