RE: [Zope-dev] Zope Server hanging :-(
I forgot to mention... our neither our z2.log or stupid_log show anything upon freezing. afterwards the stupid_log shows the failed transaction cleanup, but that's it. Tim -----Original Message----- From: Tim McLaughlin Sent: Friday, April 27, 2001 12:41 PM To: 'zope-dev@zope.org'; 'chrisw@nipltd.com' Subject: [Zope-dev] Zope Server hanging :-( Chris (and all), We seem to be having similar crashing issues. And interestingly enough, they don't seem to show up on any of our servers except the 2.3.0 upgraded to 2.3.1. 1. Hangs on anything from Pythonscript or dtml 2. top shows no significant mem or cpu usage 3. threads are all unresponsive 4. restart usually yields a .trX file (an aborted transaction I suppose) Any ideas? Maybe it was something "fixed" in 2.3.1? The causes seem to be somewhat different except that they all modify the ZODB.... Tim
I forgot to mention... our neither our z2.log or stupid_log show anything upon freezing. afterwards the stupid_log shows the failed transaction cleanup, but that's it.
Hmmm... it might be useful to turn on detailed request logging (-M logging) in the start file (see z2.py). Attached is a script that I just checked into the trunk to do analysis of the file generated by the -M log. It can help you figure out if there's a pattern to the hangs (whether it happens on a particular method, whether it happens at heavy load time, whether it happens at a particular time of day, etc.) Do you think you'd be willing to play around with it a little bit to try to discover a pattern?
Tim
-----Original Message----- From: Tim McLaughlin Sent: Friday, April 27, 2001 12:41 PM To: 'zope-dev@zope.org'; 'chrisw@nipltd.com' Subject: [Zope-dev] Zope Server hanging :-(
Chris (and all),
We seem to be having similar crashing issues. And interestingly enough, they don't seem to show up on any of our servers except the 2.3.0 upgraded to 2.3.1.
1. Hangs on anything from Pythonscript or dtml 2. top shows no significant mem or cpu usage 3. threads are all unresponsive 4. restart usually yields a .trX file (an aborted transaction I suppose)
Any ideas? Maybe it was something "fixed" in 2.3.1? The causes seem to
be
somewhat different except that they all modify the ZODB....
Tim
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Chris McDonough wrote:
I forgot to mention... our neither our z2.log or stupid_log show anything upon freezing. afterwards the stupid_log shows the failed transaction cleanup, but that's it.
Hmmm... it might be useful to turn on detailed request logging (-M logging) in the start file (see z2.py).
I'll give this a go next time I notice the thing has hung... cheers, Chris
Chris McDonough wrote:
Hmmm... it might be useful to turn on detailed request logging (-M logging) in the start file (see z2.py).
OK, this is done now...
Attached is a script that I just checked into the trunk to do analysis of the file generated by the -M log. It can help you figure out if there's a pattern to the hangs (whether it happens on a particular method, whether it happens at heavy load time, whether it happens at a particular time of day, etc.) Do you think you'd be willing to play around with it a little bit to try to discover a pattern?
Definitely, can you help me with 'reading the runes'? cheers, Chris PS:
4. restart usually yields a .trX file (an aborted transaction I suppose)
I didn't get any .tr? files, and no, I didn't rm them either ;-)
Attached is a script that I just checked into the trunk to do analysis of the file generated by the -M log. It can help you figure out if there's a pattern to the hangs (whether it happens on a particular method, whether it happens at heavy load time, whether it happens at a particular time of day, etc.) Do you think you'd be willing to play around with it a little bit to try to discover a pattern?
Definitely, can you help me with 'reading the runes'?
Absolutely! When you've got some representative data, and you've successfully run requestprofiler against it in various ways, let me know.
Chris McDonough wrote:
Absolutely! When you've got some representative data, and you've successfully run requestprofiler against it in various ways, let me know.
Hmmm, not really sure what I should be looking for :-S What ways should I look to run it and what should I do with the output? cheers, Chris
One of the more important bits is the "active" count when using the --detailed option to requestprofiler. This tells you how many *other* requests were "unfinished" at the end of a particular request. So for instance, if you run requestprofiler like this: python requestprofiler.py log.file.name --detailed And you see a profile line like (header added for reference): start win wout wend tot S osize code act url 2001-04-19T01:19:42 0 12 0 12 E 26024 200 5 /Catalog/checkrep?P This means that: - this request started at 2001-04-19 1:19:42 UTC ("start") - zserver waited 0 secs for input from the client ("win") - zserver waited 12 secs for output from the publisher ("wend") - the total time from start to end for the request was 12 secs ("tot") - the request reached the end ("E") stage ("S", one of "E", "A", "I", "B") - the output size in bytes was 26024 ("osize") - the http code was 200 ("code") - the number of pending requests at the end of this request was 5 ("active") - the URL of the request ("url") Hangs are also interesting. Use detailed reporting to sort by end stage: requestprofiler.py your.log.file --detailed --sort=endstage Any requests that don't show a stage of "E" for their endstage were "hung" requests. You can also use cumulative reporting to sort by "hangs": requestprofiler.py your.log.file --cumulative --sort=hangs This will show the methods that "hung" most frequently. Much of the time, you can find problems this way by seeing methods that enter an infinte loop or cause a segfault. Total time is also interesting: requestprofiler.py your.log.file --detailed --sort=total Shows you individual calls to methods that took a long time. - AND - requestprofiler.py your.log.file --cumulative --sort=total Shows you where most of the request time is spent. The goal is to indentify methods or requests that take a long time, don't return, or are accessed frequently. If you can match this data up with specific problems you've experienced (possibly by way of times/dates), it becomes easier to form postulations about what is happening to Zope under load. Then investigate those postulations by refactoring code or by doing more detailed logging in the places where you think the problems are. - C
Hmmm, not really sure what I should be looking for :-S
What ways should I look to run it and what should I do with the output?
cheers,
Chris
Chris McDonough wrote:
<snip info> Thanks, that's just what I was looking for
The goal is to indentify methods or requests that take a long time, don't return, or are accessed frequently.
Yup
If you can match this data up with specific problems you've experienced (possibly by way of times/dates),
well, I have tail -f of the detailed log constantly open on my desktop ;-)
it becomes easier to form postulations about what is happening to Zope under load.
To be honest, there really isn't any load on this server... maybe one or two people using it :-S
Then investigate those postulations by refactoring code or by doing more detailed logging in the places where you think the problems are.
Hmmm, I know what you mean but it doesn't sound like it's going to be fun :-S well, thanks for the info, I'll do what I can... Chris
Hmmm... saw these in the stupid log, just before the server crashed: ------ 2001-05-03T10:09:34 INFO(0) Z2 CONFLICT Competing writes at, /VirtualHostBase/http/server.nipltd.com:80/VirtualHostRoot/folder/object Traceback (innermost last): File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 171, in publish File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/mapply.py, line 160, in mapply File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 112, in call_object <snip lots of Python Scripts calling each other and DTML methods> File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Connection.py, line 524, in setstate ConflictError: ("'\\000\\000\\000\\000\\000\\020\\036\\304'", '<extension class Products.AProduct.Utility.AClass at 84ea878>') ------ 2001-05-03T10:09:42 INFO(0) Z2 CONFLICT Competing writes at, /VirtualHostBase/http/server.nipltd.com:80/VirtualHostRoot/folder/object Traceback (innermost last): File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 175, in publish File /usr/local/zope/Zope-2.3.2-src/lib/python/Zope/__init__.py, line 235, in commit File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Transaction.py, line 300, in commit File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Connection.py, line 281, in commit ConflictError: '\000\000\000\000\000\020\037\245' Could this bring it down? Should it? How can I track it down further? cheers, Chris
This probably isn't meaningful... unless it is. ;-) Conflict errors are normal... I wish they wouldn't be so "big" and have an UPPERCASE word in their description. Chris Withers wrote:
Hmmm... saw these in the stupid log, just before the server crashed:
------ 2001-05-03T10:09:34 INFO(0) Z2 CONFLICT Competing writes at, /VirtualHostBase/http/server.nipltd.com:80/VirtualHostRoot/folder/object Traceback (innermost last): File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 171, in publish File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/mapply.py, line 160, in mapply File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 112, in call_object <snip lots of Python Scripts calling each other and DTML methods> File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Connection.py, line 524, in setstate ConflictError: ("'\\000\\000\\000\\000\\000\\020\\036\\304'", '<extension class Products.AProduct.Utility.AClass at 84ea878>')
------ 2001-05-03T10:09:42 INFO(0) Z2 CONFLICT Competing writes at, /VirtualHostBase/http/server.nipltd.com:80/VirtualHostRoot/folder/object Traceback (innermost last): File /usr/local/zope/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 175, in publish File /usr/local/zope/Zope-2.3.2-src/lib/python/Zope/__init__.py, line 235, in commit File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Transaction.py, line 300, in commit File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Connection.py, line 281, in commit ConflictError: '\000\000\000\000\000\020\037\245'
Could this bring it down? Should it? How can I track it down further?
cheers,
Chris
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Good morning, I presume? :-) Chris McDonough wrote:
This probably isn't meaningful... unless it is. ;-)
Huh?
Conflict errors are normal... I wish they wouldn't be so "big" and have an UPPERCASE word in their description.
OK, but I did notice them just before the server died. Maybe now would be a good time to run the analyser script. Unfortunately, I'd probably be expected to 'clean' the URLs of sensitive information. Hurm, d'ya think it'd be worth it? cheers, Chris
Chris Withers wrote:
Good morning, I presume? :-)
Present!
This probably isn't meaningful... unless it is. ;-)
Huh?
This is a Jim Fultonism. In English, it means this is probably not meaningful.
Conflict errors are normal... I wish they wouldn't be so "big" and have an UPPERCASE word in their description.
OK, but I did notice them just before the server died. Maybe now would be a good time to run the analyser script. Unfortunately, I'd probably be expected to 'clean' the URLs of sensitive information.
Well the analyzer script only shows the first 20 or so chars of the URL by default unless you pass it the --verbose argument, so this might serve to do some "cleaning". Also, getting rid of the URL altogether is something you can do by using Unix "cut" or on windows putting the results of the analyzer script into Excel and deleting the column with the URL in it.
Hurm, d'ya think it'd be worth it?
It's up to you, but I think so. - C
Chris McDonough wrote:
You can also use cumulative reporting to sort by "hangs":
requestprofiler.py your.log.file --cumulative --sort=hangs
python requestprofiler.py my.log --cumulative --sort=hangs > out2.txt resulted in: Traceback (innermost last): File "requestprofiler.py", line 586, in ? analyze(open(sys.argv[1]), top, sortf, start, end, mode) File "requestprofiler.py", line 346, in analyze write(dict, top) File "requestprofiler.py", line 360, in write print str(stat)[:78] File "requestprofiler.py", line 226, in __str__ body = ( File "requestprofiler.py", line 273, in median else: return (all[i] + all[i2]) / 2 TypeError: number coercion failed Any ideas? cheers, Chris
Please apply this patch, sorry! *** requestprofiler.py 2001/04/26 02:51:49 1.3 --- requestprofiler.py 2001/05/03 15:12:13 *************** *** 269,276 **** i2 = i + 1 v1 = all[i] v2 = all[i2] ! if v1 == "NA" or v2 == "NA": return "I" ! else: return (all[i] + all[i2]) / 2 def total(self): t = 0 --- 269,276 ---- i2 = i + 1 v1 = all[i] v2 = all[i2] ! if type(v1) is type('') or type(v2) is type(''): return "I" ! else: return (v1 + v2) / 2 def total(self): t = 0 ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Chris McDonough" <chrism@digicool.com> Cc: "Tim McLaughlin" <tim@BCSWebservices.net>; <zope-dev@zope.org> Sent: Thursday, May 03, 2001 10:35 AM Subject: Re: [Zope-dev] Zope Server hanging :-(
Chris McDonough wrote:
You can also use cumulative reporting to sort by "hangs":
requestprofiler.py your.log.file --cumulative --sort=hangs
python requestprofiler.py my.log --cumulative --sort=hangs > out2.txt
resulted in:
Traceback (innermost last): File "requestprofiler.py", line 586, in ? analyze(open(sys.argv[1]), top, sortf, start, end, mode) File "requestprofiler.py", line 346, in analyze write(dict, top) File "requestprofiler.py", line 360, in write print str(stat)[:78] File "requestprofiler.py", line 226, in __str__ body = ( File "requestprofiler.py", line 273, in median else: return (all[i] + all[i2]) / 2 TypeError: number coercion failed
Any ideas?
cheers,
Chris
Well, it's still happening, this time I was just browsing the management interface :-( The last thing in Z2.log was: 194.193.44.4 - - [02/May/2001:14:35:30 +0100] "GET /VirtualHostBase/http/server.nipltd.com:80/VirtualHostRoot/manage_workspace HTTP/1.0" 302 440 "http://www.cases.nipltd.com/manage_menu" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" The last thing in the stupid log was: 2001-05-02T13:35:30 ERROR(200) ZODB Couldn't load state for '\000\000\000\000\000\020\030\177' Traceback (innermost last): File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/Connection.py, line 508, in setstate AttributeError: 'None' object has no attribute 'load' When I tried to restart I got: 2001-05-02T13:42:00 PANIC(300) z2 Startup exception Traceback (innermost last): File /usr/local/zope/current/z2.py, line 566, in ? File <string>, line 1, in ? File /usr/local/zope/Zope-2.3.2-src/lib/python/Zope/__init__.py, line 110, in ? File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/FileStorage.py, line 264, in __init__ File /usr/local/zope/Zope-2.3.2-src/lib/python/ZODB/lock_file.py, line 101, in lock_file StorageSystemError: Could not lock the database file. There must be another process that has opened the file. Big waaaagh! That meant I needed to rm Data.fs.lock, sadly I actually did rm Data.fs due to tab-name completion and stupidity (tab and return do different things, duh!) Would anyone support a plea to have the lock file called something else? Anyway, once I'd got that restored off tape, then next restart yielded: Traceback (innermost last): File /usr/local/zope/current/z2.py, line 623, in ? File /usr/local/zope/Zope-2.3.2-src/ZServer/HTTPServer.py, line 391, in __init__ File /usr/local/zope/Zope-2.3.2-src/ZServer/medusa/http_server.py, line 542, in __init__ File /usr/local/zope/Zope-2.3.2-src/ZServer/medusa/asyncore.py, line 250, in bind error: (98, 'Address already in use') One killall -9 python later, and I now have Zope up and running with -M working. Wierdness with -M, I did -M var/server.dlog, Zope bitched that it couldn't create var/var/server.dlog I took the var out and it's okay. I wonder what gives? Anyway, if anyone's got ideas as to WTF is going on here, I'd love to hear them :-S cheers, Chris
participants (3)
-
Chris McDonough -
Chris Withers -
Tim McLaughlin