On Sat, 2004-12-04 at 23:34, Andy Yates wrote:
Note that I'm not trying to find a problem with Zope. My plan was to start with something very simple and then add more and more of the things we do on our production site and see when it starts leaking. However, even the simplest tests seem to leak memory.
This is a great place to start. I wouldn't be surprised at all if the simplest access leaked memory. It seems easy to create memory leaks with Zope code. There aren't any tests that specifically target leaks in the Zope test base so basically any given Zope release may have terrible leaks and we'd never know until it got reported to us or until someone caught it by accident. AFAIK, there has never been a Zope release that anyone has had the time or energy to certify as "leak-free", if that's even possible.
Thanks for the explaination of how the transient gc works. That is exactly what I found when I looked in the source code and I can see all of this happening when I turn on the debug output. Sessions are created and placed in buckets. The active sessions move to the current bucket and expired session and old empty buckets get removed. Then there is a 1 in 6 chance they will be gc'ed on each request. I see all this happening but when the buckets are gc'ed the memory usage according to top does not go down.
Gotcha.
I ran the test again that puts a 1k string in the session objects. I used ab to run the python script 3000 times with a concurrency of 10. Python's memory went from 22Mb to 54Mb. Transient objects went from 0 to about 900 and stayed steady during the test. After the test the trans obs dropped back down to 0 as they expired.
Since there are a lot of moving parts to this problem, it's important to be deliberate about the methodology used to provoke it. With that in mind, I'm not sure how you get 900 transient objects out of 3000 requests via ab unless you put the same "_ZopeId" in the query string for 2100 of these requests or unless you had a very short TOC timeout or if you had a max number of subobjects of 900. If none of these things is true, I can imagine some conflict errors causing a few to drop out, but you should have ended up somewhere around 3000 transient objects at some point unless the test ran for longer than the timeout period.
I continued to make occasional Zope requests to plain pages to make sure gc was getting tickled.
You probably know this but GC is only potentially tickled when you actually access a session object, FWIW; "plain page" access has no effect unless it does something to the sessioning machinery. The most reliable way to get gc to run is to call the TOC's "housekeep" method a few times (it statistically does nothing 5/6th of the time, as you noted) within the "plain page" you're accessing.
After 8+ hours of this the memory was never released. Python was still using 54Mb. If I continue to create session objects the 54Mb is not reused by python it just continues to grow.
Does the refcount of the TransientObject class at that point support your theory that somehow references to these TransientObjects stick around past the time they're gc'ed? If so, do you also see other classes with a similar or related refcount (Acqusition.ImplicitAcquisitionWrapper, ZPublisher.HTTPRequest.HTTPRequest, perhaps)? Leaking an HTTPRequest reference in Zope is bad news because the REQUEST may hold on to references to many other objects. Likewise for acquisition wrappers. It's a real nightmare to try to isolate what the issue is when you see these classes refcounts growing without bound. FWIW, I've seen these sorts of leaks before in the past (in even very simple accesses) and shamefully my first reaction is to do a mental "Colonel Klink" (I see *nuss*ink! ;-) because I find it so difficult and time-consuming to isolate. I have such a distaste for diagnosing these problems that I wrote a product named AutoLance to automatically restart Zope when memory usage goes above some watermark. :-(
Note that Python will usually not release memory back to the OS once it has been allocated for its own use, so a large concrete amount of memory consumed by Python may not even be an indicator of a leak! See http://mail.python.org/pipermail/python-dev/2004-October/049480.html for more info. The only solid basis for determining leaks is refcounts, and it's tricky to figure out whether growing refcounts are leaks or if they're just side effects of the ZODB cache holding on to old objects.
I tested that too and python 2.3.4 on RedHat 9 does not seem to behave this way. I wrote a script that creates 100Mb of random strings and places them in a list. Then it waits a few seconds and deletes the list. When I look at the process with top I see memory (RSS) go up by about 100Mb and then a few seconds later it goes back down to just above the level when the script started. Then I put this in a loop. I can watch the memory usage just cycle up and down. This is what I would expect.
As I understand it Python has separate memory pools for different kinds of objects. The policy for managing memory deallocation is different for each pool. I'm no expert about it, I'm just deferring to what I read about it, and what I read seems to indicate that releasing the last reference to an object doesn't always immediately return the memory consumed by that object to the OS. You probably hit upon a case where it does with the above test but I wouldn't assume it's indicative of all Python memory deallocation behavior. As a result, the way I think about it anyway, the only reliable way to determine whether an app is leaking or not to be refcounts. Concrete memory usage during process lifetime is interesting as a rough indicator of how an app is behaving memory-wise but AFAICT refcounts tell a much more interesting story for debugging purposes. FWIW, I'm willing to help try to isolate this problem if we can come up with some sane hypothesis about what sort of resources are leaking and a general "why" backed up by refcount data and repeatable tests. Otherwise I think I'd rather go chew on some tinfoil or burn myself with a cigarette. ;-) - C