RE: [Zope] Zope 2.7.3 Memory Leaks
-----Original Message----- From: Chris McDonough [mailto:chrism@plope.com] Sent: Thursday, December 02, 2004 5:17 PM To: Andy Yates Cc: zope@zope.org Subject: Re: [Zope] Zope 2.7.3 Memory Leaks
On Thu, 2004-12-02 at 17:24, Andy Yates wrote:
Now it seems to leak memory (400+ Mb) to the point I have to restart Zope every 3 days.
I've read that memory leaks are almost always caused by the programmer, so I set up some very simple tests that did not involve existing code.
Great idea.
Note that I'm not trying to find a problem with Zope. My plan was to start with something very simple and then add more and more of the things we do on our production site and see when it starts leaking. However, even the simplest tests seem to leak memory.
Next I created an 8Mb page template. This caused the memory usage to climb much faster. After the test the memory never goes back down.
Do these templates access the sessioning machinery?
No. It is just a page template file with 8Mb of random text.
I installed LeakFinder and started running more tests.
I used ab to get a python script that stored a 1k string in the session object. The transient object container timeout was set to 1 minute. This also caused Zope to consume memory. LeakFinder said the Products.Transience.TransientObject.TransientObject ref count grew with each request and never went down. Everything else seemed to level off. The debug output of Transience.py showed that the "buckets" seemed to be getting deleted as expected, but the memory usage never goes down. 1 minute after the test stopped the transient object container showed that there were no more items in the container. It seems like when Zope deletes expired sessions but contents of the sessions are not deleted.
This is actually probably normal.
TransientObject buckets hang around for some period of time before they are "garbage collected" (deleted from their container). A particular TransientObject isn't garbage collected immediately when it expires, but gets gc'ed much later along with other TransientObjects that were created around the same time as a side effect of otherwise exercising the sessioning machinery.
Thanks for the explaination of how the transient gc works. That is exactly what I found when I looked in the source code and I can see all of this happening when I turn on the debug output. Sessions are created and placed in buckets. The active sessions move to the current bucket and expired session and old empty buckets get removed. Then there is a 1 in 6 chance they will be gc'ed on each request. I see all this happening but when the buckets are gc'ed the memory usage according to top does not go down. I ran the test again that puts a 1k string in the session objects. I used ab to run the python script 3000 times with a concurrency of 10. Python's memory went from 22Mb to 54Mb. Transient objects went from 0 to about 900 and stayed steady during the test. After the test the trans obs dropped back down to 0 as they expired. I continued to make occasional Zope requests to plain pages to make sure gc was getting tickled. After 8+ hours of this the memory was never released. Python was still using 54Mb. If I continue to create session objects the 54Mb is not reused by python it just continues to grow.
Note that Python will usually not release memory back to the OS once
it
has been allocated for its own use, so a large concrete amount of memory consumed by Python may not even be an indicator of a leak! See http://mail.python.org/pipermail/python-dev/2004-October/049480.html for more info. The only solid basis for determining leaks is refcounts, and it's tricky to figure out whether growing refcounts are leaks or if they're just side effects of the ZODB cache holding on to old objects.
I tested that too and python 2.3.4 on RedHat 9 does not seem to behave this way. I wrote a script that creates 100Mb of random strings and places them in a list. Then it waits a few seconds and deletes the list. When I look at the process with top I see memory (RSS) go up by about 100Mb and then a few seconds later it goes back down to just above the level when the script started. Then I put this in a loop. I can watch the memory usage just cycle up and down. This is what I would expect. Thanks! Andy
On Sat, 2004-12-04 at 23:34, Andy Yates wrote:
Note that I'm not trying to find a problem with Zope. My plan was to start with something very simple and then add more and more of the things we do on our production site and see when it starts leaking. However, even the simplest tests seem to leak memory.
This is a great place to start. I wouldn't be surprised at all if the simplest access leaked memory. It seems easy to create memory leaks with Zope code. There aren't any tests that specifically target leaks in the Zope test base so basically any given Zope release may have terrible leaks and we'd never know until it got reported to us or until someone caught it by accident. AFAIK, there has never been a Zope release that anyone has had the time or energy to certify as "leak-free", if that's even possible.
Thanks for the explaination of how the transient gc works. That is exactly what I found when I looked in the source code and I can see all of this happening when I turn on the debug output. Sessions are created and placed in buckets. The active sessions move to the current bucket and expired session and old empty buckets get removed. Then there is a 1 in 6 chance they will be gc'ed on each request. I see all this happening but when the buckets are gc'ed the memory usage according to top does not go down.
Gotcha.
I ran the test again that puts a 1k string in the session objects. I used ab to run the python script 3000 times with a concurrency of 10. Python's memory went from 22Mb to 54Mb. Transient objects went from 0 to about 900 and stayed steady during the test. After the test the trans obs dropped back down to 0 as they expired.
Since there are a lot of moving parts to this problem, it's important to be deliberate about the methodology used to provoke it. With that in mind, I'm not sure how you get 900 transient objects out of 3000 requests via ab unless you put the same "_ZopeId" in the query string for 2100 of these requests or unless you had a very short TOC timeout or if you had a max number of subobjects of 900. If none of these things is true, I can imagine some conflict errors causing a few to drop out, but you should have ended up somewhere around 3000 transient objects at some point unless the test ran for longer than the timeout period.
I continued to make occasional Zope requests to plain pages to make sure gc was getting tickled.
You probably know this but GC is only potentially tickled when you actually access a session object, FWIW; "plain page" access has no effect unless it does something to the sessioning machinery. The most reliable way to get gc to run is to call the TOC's "housekeep" method a few times (it statistically does nothing 5/6th of the time, as you noted) within the "plain page" you're accessing.
After 8+ hours of this the memory was never released. Python was still using 54Mb. If I continue to create session objects the 54Mb is not reused by python it just continues to grow.
Does the refcount of the TransientObject class at that point support your theory that somehow references to these TransientObjects stick around past the time they're gc'ed? If so, do you also see other classes with a similar or related refcount (Acqusition.ImplicitAcquisitionWrapper, ZPublisher.HTTPRequest.HTTPRequest, perhaps)? Leaking an HTTPRequest reference in Zope is bad news because the REQUEST may hold on to references to many other objects. Likewise for acquisition wrappers. It's a real nightmare to try to isolate what the issue is when you see these classes refcounts growing without bound. FWIW, I've seen these sorts of leaks before in the past (in even very simple accesses) and shamefully my first reaction is to do a mental "Colonel Klink" (I see *nuss*ink! ;-) because I find it so difficult and time-consuming to isolate. I have such a distaste for diagnosing these problems that I wrote a product named AutoLance to automatically restart Zope when memory usage goes above some watermark. :-(
Note that Python will usually not release memory back to the OS once it has been allocated for its own use, so a large concrete amount of memory consumed by Python may not even be an indicator of a leak! See http://mail.python.org/pipermail/python-dev/2004-October/049480.html for more info. The only solid basis for determining leaks is refcounts, and it's tricky to figure out whether growing refcounts are leaks or if they're just side effects of the ZODB cache holding on to old objects.
I tested that too and python 2.3.4 on RedHat 9 does not seem to behave this way. I wrote a script that creates 100Mb of random strings and places them in a list. Then it waits a few seconds and deletes the list. When I look at the process with top I see memory (RSS) go up by about 100Mb and then a few seconds later it goes back down to just above the level when the script started. Then I put this in a loop. I can watch the memory usage just cycle up and down. This is what I would expect.
As I understand it Python has separate memory pools for different kinds of objects. The policy for managing memory deallocation is different for each pool. I'm no expert about it, I'm just deferring to what I read about it, and what I read seems to indicate that releasing the last reference to an object doesn't always immediately return the memory consumed by that object to the OS. You probably hit upon a case where it does with the above test but I wouldn't assume it's indicative of all Python memory deallocation behavior. As a result, the way I think about it anyway, the only reliable way to determine whether an app is leaking or not to be refcounts. Concrete memory usage during process lifetime is interesting as a rough indicator of how an app is behaving memory-wise but AFAICT refcounts tell a much more interesting story for debugging purposes. FWIW, I'm willing to help try to isolate this problem if we can come up with some sane hypothesis about what sort of resources are leaking and a general "why" backed up by refcount data and repeatable tests. Otherwise I think I'd rather go chew on some tinfoil or burn myself with a cigarette. ;-) - C
[Chris McDonough]
... As I understand it Python has separate memory pools for different kinds of objects. The policy for managing memory deallocation is different for each pool.
I'm no expert about it, I'm just deferring to what I read about it, and what I read seems to indicate that releasing the last reference to an object doesn't always immediately return the memory consumed by that object to the OS. You probably hit upon a case where it does with the above test but I wouldn't assume it's indicative of all Python memory deallocation behavior.
It's complicated. Any request for an object exceeding 256 bytes goes to the platform malloc(). Python returns such memory via the platform free(). Whether that in turn "returns the memory to the OS" is entirely in the hands of the OS and the platform malloc(). It may or may not. When it may, whether it does depends on internal malloc() implementation details, often including the precise history of malloc() and free() requests. For example: p1 = malloc(50000); p2 = malloc(50000); p3 = malloc(50000); q = malloc(999); free(p1); free(p2); free(p3); When malloc() works by passing out blocks "low address to high address", and can only "give back" contiguous high-address chunks to the OS, then even if malloc() "wants to give memory back" to the OS, a small object (q) still alive at the highest address passed out so far will prevent it from returning anything (p1, p2, p3) to the OS. Objects <= 256 bytes may or may not be obtained via platform malloc(). Often they're obtained via Python's "small object allocator", which grabs 256KB at a time from the platform malloc() and carves it up itself. These 256KB chunks are currently never given back to the platform free(), so there's no chance they'll be "returned to the OS" either. There are indeed no easy answers here.
Maybe that's your misconception then. The memory usage according to top is unlikely to go down significantly if at all. That's because the C-heap can not safely be shrunk most of the time. So I'd say forget the heap usage and watch them refcounts instead. Stefan On 5. Dez 2004, at 05:34, Andy Yates wrote:
Thanks for the explaination of how the transient gc works. That is exactly what I found when I looked in the source code and I can see all of this happening when I turn on the debug output. Sessions are created and placed in buckets. The active sessions move to the current bucket and expired session and old empty buckets get removed. Then there is a 1 in 6 chance they will be gc'ed on each request. I see all this happening but when the buckets are gc'ed the memory usage according to top does not go down.
-- The time has come to start talking about whether the emperor is as well dressed as we are supposed to think he is. /Pete McBreen/
participants (4)
-
Andy Yates -
Chris McDonough -
Stefan H. Holek -
Tim Peters