Need help : Zope servers hanging.
Hi all, I need some help here - over the past few days two different Zope servers have gone into the 'hanging' state, where they don't reply to further requests. When the first event happened I didn't take a 'top', today, I've managed to get one. The process causing the problem is 12482. From previous messages, I believe that the python CPU can go up to 100%, obviously this isn't happening here. This happened when I asked the Zope server to make a MySQL query. The MySQL server is running fine and I can get to it from a command line interface. Both servers are Zope 2.1.2 source distributions running under Solaris 5.6 This server is running three different Zope sites using Apache as the backend (ie I'm using pcgi to get to my servers). I can't get to it using the pcgi route (ie a ReWrite Rule from Apache), nor from the ZServer incarnation of the server. I also can't get to it from the monitor connection (telnet localhost 8099). I can't let this situation continue as these are live sites. I need to restart the server whenever this happens. Process list: load averages: 1.39, 1.11, 0.63 08:51:59 262 processes: 258 sleeping, 2 zombie, 2 on cpu CPU states: 74.6% idle, 25.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap Memory: 512M real, 25M free, 560M swap in use, 736M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 12482 nnle 9 -5 0 38M 20M cpu/0 267:57 24.94% python 23032 nnle 1 23 0 1992K 1456K cpu/2 0:00 0.32% top 6736 nnle 8 33 0 12M 9360K sleep 3:55 0.00% roxen 15072 nnle 8 33 0 14M 11M sleep 0:59 0.00% python 1848 nnle 7 33 0 10M 7728K sleep 0:40 0.00% python 15071 nnle 4 -25 0 4240K 1304K sleep 0:00 0.00% python 1847 nnle 4 -25 0 4240K 856K sleep 0:00 0.00% python 656 nnle 1 -25 0 928K 512K sleep 0:00 0.00% start 12481 nnle 4 -5 0 4240K 856K sleep 0:00 0.00% python 18305 nnle 1 23 0 2056K 1832K sleep 0:00 0.00% tcsh 19302 nnle 1 23 0 2000K 1040K sleep 0:00 0.00% tcsh 19481 nnle 1 33 0 1000K 672K sleep 0:00 0.00% grep The only other data I have is that the pcgi for this site is shown as running in the process list quite a few times. nobody 23059 4659 0 08:54:28 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23135 4716 0 09:03:18 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23069 4574 0 08:57:03 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23068 4753 0 08:56:52 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23144 4694 0 09:04:11 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nnle 12481 1 0 Jan 21 ? 0:00 /usr/local/bin/python /home/nnle/MED_DUR_NOTTS/z2.py nobody 23064 4737 0 08:55:59 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nnle 12482 12481 25 Jan 21 ? 280:44 /usr/local/bin/python /home/nnle/MED_DUR_NOTTS/z2.py *any* help at all on this would be really appreciated. Tone ------ Dr Tony McDonald, FMCC, Networked Learning Environments Project http://nle.ncl.ac.uk/ The Medical School, Newcastle University Tel: +44 191 222 5888 Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
Tone, It looks as if the problems you have seem to be fixed in the 2.1.3 version. <quote> - A race condition in the logic for managing Zope database connections caused Zope to hang on very busy sites. - A bug in the packing code that caused records to be nreadable after: o someone did work in a version o Someone did an (unrelated) undo o the version was committed and the database was packed to a time before the work was done in the version. - Fixed a bug that caused packing to raise an error in the following situation: o someone modifies and then deletes an object in a version. o they commit the version o the database is packed between the time the object is deleted and the time the version is committed. - Fixed a bug that caused Zope to sometimes hang instead of shutting down or restarting when accessed over a fast network. - It wasn't possible to use a ZClass instance as a method of a ZClass. </quote> Have you tried upgrading? I'd recommend it. Phil phil.harris@zope.co.uk |>-----Original Message----- |>From: zope-admin@zope.org [mailto:zope-admin@zope.org]On Behalf Of Tony |>McDonald |>Sent: Monday, January 31, 2000 9:25 AM |>To: Zope List |>Subject: [Zope] Need help : Zope servers hanging. |> |> |>Hi all, |>I need some help here - over the past few days two different Zope |>servers have gone into the 'hanging' state, where they don't reply to |>further requests. When the first event happened I didn't take a |>'top', today, I've managed to get one. The process causing the |>problem is 12482. From previous messages, I believe that the python |>CPU can go up to 100%, obviously this isn't happening here. This |>happened when I asked the Zope server to make a MySQL query. The |>MySQL server is running fine and I can get to it from a command line |>interface. |> |>Both servers are Zope 2.1.2 source distributions running under Solaris 5.6 |> |>This server is running three different Zope sites using Apache as the |>backend (ie I'm using pcgi to get to my servers). I can't get to it |>using the pcgi route (ie a ReWrite Rule from Apache), nor from the |>ZServer incarnation of the server. |> |>I also can't get to it from the monitor connection (telnet |>localhost 8099). |> |>I can't let this situation continue as these are live sites. I need |>to restart the server whenever this happens. |> |>Process list: |> |>load averages: 1.39, 1.11, 0.63 |>08:51:59 |>262 processes: 258 sleeping, 2 zombie, 2 on cpu |>CPU states: 74.6% idle, 25.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap |>Memory: 512M real, 25M free, 560M swap in use, 736M swap free |> |> PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND |>12482 nnle 9 -5 0 38M 20M cpu/0 267:57 24.94% python |>23032 nnle 1 23 0 1992K 1456K cpu/2 0:00 0.32% top |> 6736 nnle 8 33 0 12M 9360K sleep 3:55 0.00% roxen |>15072 nnle 8 33 0 14M 11M sleep 0:59 0.00% python |> 1848 nnle 7 33 0 10M 7728K sleep 0:40 0.00% python |>15071 nnle 4 -25 0 4240K 1304K sleep 0:00 0.00% python |> 1847 nnle 4 -25 0 4240K 856K sleep 0:00 0.00% python |> 656 nnle 1 -25 0 928K 512K sleep 0:00 0.00% start |>12481 nnle 4 -5 0 4240K 856K sleep 0:00 0.00% python |>18305 nnle 1 23 0 2056K 1832K sleep 0:00 0.00% tcsh |>19302 nnle 1 23 0 2000K 1040K sleep 0:00 0.00% tcsh |>19481 nnle 1 33 0 1000K 672K sleep 0:00 0.00% grep |> |> |>The only other data I have is that the pcgi for this site is shown as |>running in the process list quite a few times. |> nobody 23059 4659 0 08:54:28 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nobody 23135 4716 0 09:03:18 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nobody 23069 4574 0 08:57:03 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nobody 23068 4753 0 08:56:52 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nobody 23144 4694 0 09:04:11 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nnle 12481 1 0 Jan 21 ? 0:00 /usr/local/bin/python |>/home/nnle/MED_DUR_NOTTS/z2.py |> nobody 23064 4737 0 08:55:59 ? 0:00 |>/home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper |>/home/nnle/MED_DUR_NOTTS/Zope.cgi |> nnle 12482 12481 25 Jan 21 ? 280:44 /usr/local/bin/python |>/home/nnle/MED_DUR_NOTTS/z2.py |> |> |>*any* help at all on this would be really appreciated. |>Tone |> |>------ |>Dr Tony McDonald, FMCC, Networked Learning Environments Project |>http://nle.ncl.ac.uk/ |>The Medical School, Newcastle University Tel: +44 191 222 5888 |>Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2 |> |>_______________________________________________ |>Zope maillist - Zope@zope.org |>http://lists.zope.org/mailman/listinfo/zope |>** No cross posts or HTML encoding! ** |>(Related lists - |> http://lists.zope.org/mailman/listinfo/zope-announce |> http://lists.zope.org/mailman/listinfo/zope-dev ) |>
At 10:24 am +0000 31/1/00, Phil Harris wrote:
Tone,
It looks as if the problems you have seem to be fixed in the 2.1.3 version.
Ah, Nice one Phil. I'm setting up a new 'checkout' site with 2.1.3 now... Tone ------ Dr Tony McDonald, FMCC, Networked Learning Environments Project http://nle.ncl.ac.uk/ The Medical School, Newcastle University Tel: +44 191 222 5888 Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
I can get this to happen sometimes if I view a dtml-method that does NOT comtain all of the standard html tags for opening a document with (such as <HTML>, <HEAD>, etc). Generally, if the dtml-method just starts with a <table> tag and end with a </table> tag, it has at least a 50% chance of hanging the server. --sam Tony McDonald wrote:
Hi all, I need some help here - over the past few days two different Zope servers have gone into the 'hanging' state, where they don't reply to further requests. When the first event happened I didn't take a 'top', today, I've managed to get one. The process causing the problem is 12482. From previous messages, I believe that the python CPU can go up to 100%, obviously this isn't happening here. This happened when I asked the Zope server to make a MySQL query. The MySQL server is running fine and I can get to it from a command line interface.
Both servers are Zope 2.1.2 source distributions running under Solaris 5.6
This server is running three different Zope sites using Apache as the backend (ie I'm using pcgi to get to my servers). I can't get to it using the pcgi route (ie a ReWrite Rule from Apache), nor from the ZServer incarnation of the server.
I also can't get to it from the monitor connection (telnet localhost 8099).
I can't let this situation continue as these are live sites. I need to restart the server whenever this happens.
Process list:
load averages: 1.39, 1.11, 0.63 08:51:59 262 processes: 258 sleeping, 2 zombie, 2 on cpu CPU states: 74.6% idle, 25.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap Memory: 512M real, 25M free, 560M swap in use, 736M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 12482 nnle 9 -5 0 38M 20M cpu/0 267:57 24.94% python 23032 nnle 1 23 0 1992K 1456K cpu/2 0:00 0.32% top 6736 nnle 8 33 0 12M 9360K sleep 3:55 0.00% roxen 15072 nnle 8 33 0 14M 11M sleep 0:59 0.00% python 1848 nnle 7 33 0 10M 7728K sleep 0:40 0.00% python 15071 nnle 4 -25 0 4240K 1304K sleep 0:00 0.00% python 1847 nnle 4 -25 0 4240K 856K sleep 0:00 0.00% python 656 nnle 1 -25 0 928K 512K sleep 0:00 0.00% start 12481 nnle 4 -5 0 4240K 856K sleep 0:00 0.00% python 18305 nnle 1 23 0 2056K 1832K sleep 0:00 0.00% tcsh 19302 nnle 1 23 0 2000K 1040K sleep 0:00 0.00% tcsh 19481 nnle 1 33 0 1000K 672K sleep 0:00 0.00% grep
The only other data I have is that the pcgi for this site is shown as running in the process list quite a few times. nobody 23059 4659 0 08:54:28 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23135 4716 0 09:03:18 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23069 4574 0 08:57:03 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23068 4753 0 08:56:52 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nobody 23144 4694 0 09:04:11 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nnle 12481 1 0 Jan 21 ? 0:00 /usr/local/bin/python /home/nnle/MED_DUR_NOTTS/z2.py nobody 23064 4737 0 08:55:59 ? 0:00 /home/nnle/MED_DUR_NOTTS/pcgi/pcgi-wrapper /home/nnle/MED_DUR_NOTTS/Zope.cgi nnle 12482 12481 25 Jan 21 ? 280:44 /usr/local/bin/python /home/nnle/MED_DUR_NOTTS/z2.py
*any* help at all on this would be really appreciated. Tone
------ Dr Tony McDonald, FMCC, Networked Learning Environments Project http://nle.ncl.ac.uk/ The Medical School, Newcastle University Tel: +44 191 222 5888 Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
-- Sam Gendler Chief Technology Officer - Impossible, Inc. 1222 State St. Suite 250 Santa Barbara CA. 93101 w: 805-560-0508 f: 805-560-0608 c: 805-689-1191 e: sgendler@impossible.com
At 1:45 am -0800 1/2/00, Sam Gendler wrote:
I can get this to happen sometimes if I view a dtml-method that does NOT comtain all of the standard html tags for opening a document with (such as <HTML>, <HEAD>, etc). Generally, if the dtml-method just starts with a <table> tag and end with a </table> tag, it has at least a 50% chance of hanging the server.
--sam
That's *exactly* the situation I have! It's a method that doesn't display any HTML headers (it does some DB stuff though) and displays some HTML (including a table). It's called as a code 'block' from a document I have. *Many* thanks for pointing that ought to me Sam. DC Folks: Would this constitute a bug, should I (or Sam!) file a report with the Collector? Tone ------ Dr Tony McDonald, FMCC, Networked Learning Environments Project http://nle.ncl.ac.uk/ The Medical School, Newcastle University Tel: +44 191 222 5888 Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
This can, of course, create a huge problem if a catalog search returns one of these methods, and a user clicks on it. I wound up setting up my catalogs to only search through dtml documents... --sam Tony McDonald wrote:
At 1:45 am -0800 1/2/00, Sam Gendler wrote:
I can get this to happen sometimes if I view a dtml-method that does NOT comtain all of the standard html tags for opening a document with (such as <HTML>, <HEAD>, etc). Generally, if the dtml-method just starts with a <table> tag and end with a </table> tag, it has at least a 50% chance of hanging the server.
--sam
That's *exactly* the situation I have! It's a method that doesn't display any HTML headers (it does some DB stuff though) and displays some HTML (including a table). It's called as a code 'block' from a document I have.
*Many* thanks for pointing that ought to me Sam.
DC Folks: Would this constitute a bug, should I (or Sam!) file a report with the Collector?
Tone ------ Dr Tony McDonald, FMCC, Networked Learning Environments Project http://nle.ncl.ac.uk/ The Medical School, Newcastle University Tel: +44 191 222 5888 Fingerprint: 3450 876D FA41 B926 D3DD F8C3 F2D0 C3B9 8B38 18A2
-- Sam Gendler Chief Technology Officer - Impossible, Inc. 1222 State St. Suite 250 Santa Barbara CA. 93101 w: 805-560-0508 f: 805-560-0608 c: 805-689-1191 e: sgendler@impossible.com
participants (3)
-
Phil Harris -
Sam Gendler -
Tony McDonald