We recently had an event at a customer site where a ZEO process was started with bogus arguments that kept spewing messages in the log file forever: the zdaemon module kept restarting the program, and the program kept failing (without logging a message!). I think we've all gone through this. Killing the rogue program is a bit of a pain because it dies before the pid file is written. I am thinking of a fairly simple change to zdaemon: if it finds that it is continuously respawning the program more than 10 times in 2 minutes, it assumes there is a fatal error, log a PANIC level message, and exit. (I took the criterion, but not the response from init(8).) Are there any concerns about checking this in? This would go into Zope 2.7 and ZODB 3.1, and could possibly be backported to 2.6 and 2.5. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Saturday 05 Oct 2002 8:29 am, Guido van Rossum wrote:
I am thinking of a fairly simple change to zdaemon: if it finds that it is continuously respawning the program more than 10 times in 2 minutes, it assumes there is a fatal error, log a PANIC level message, and exit. (
(I took the criterion, but not the response from init(8).)
In this scenario init pauses for a few minutes, rather than aborting. I would like an option to prevent zdaemon aborting, and I am surpised you dont want it as the default. I think init uses a simple fixed pause... an exponential backoff would probably be smarter (like how a disconnected ZEO ClientStorage tries to reconnect to its server)
I am thinking of a fairly simple change to zdaemon: if it finds that it is continuously respawning the program more than 10 times in 2 minutes, it assumes there is a fatal error, log a PANIC level message, and exit. (
(I took the criterion, but not the response from init(8).)
In this scenario init pauses for a few minutes, rather than aborting. I would like an option to prevent zdaemon aborting, and I am surpised you dont want it as the default.
I think init uses a simple fixed pause... an exponential backoff would probably be smarter (like how a disconnected ZEO ClientStorage tries to reconnect to its server)
I thought about this, and figured it wasn't necessary. Unlike init, zdaemon only manages one process. When that process doesn't get past its initialization, manual intervention is normally required to make it run again; that manual intervention can include restarting it. But I have to admit that the use case I've been thinking of is that of starting zeo and finding that it crashes immediately, over and over. There the auto-stop is just what you need (there's no point in filling up the log file while you're thinking about what could have caused this). There's a different use case where something changes in the environment after the program has run successfully for a while, which causes it to crash and causes subsequent restarts to crash immediately. It is *possible* that the environment fixes itself after a while -- it could be something like a network, DNS or NFS outage -- and then an auto-restart option might be nice. I'm not sure what should be the default -- as a developer, I prefer that it stops (and I hate that zdaemon is the default at all), but for a production site something different might be in order. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Sat, 5 Oct 2002, Guido van Rossum wrote:
There's a different use case where something changes in the environment after the program has run successfully for a while, which causes it to crash and causes subsequent restarts to crash immediately. It is *possible* that the environment fixes itself after a while -- it could be something like a network, DNS or NFS outage -- and then an auto-restart option might be nice.
I'm not sure what should be the default -- as a developer, I prefer that it stops (and I hate that zdaemon is the default at all), but for a production site something different might be in order.
Definately. I have exactly that scenario in a production environment: occasionally we will have too much disk usage, and a nightly cron job will fill up the disk. The job eventually deletes the file. When this happens, currently zope will crash for a while, and then when the disk is available again will restart successfully. Sometimes, however, it crashes completely and has to be manually restarted. I hasn't happened often enough (we try not to let the disk get that full!) for me to put time into tracking down why, but I'd certainly hate to have a zdeamon crash as the only choice. I don't have a strong opinion on which should be the default, but since production generally comes after development, and currently (or at least as of 2.4) debug mode is the default, I suspect that crash should be the default. --RDM
participants (3)
-
Guido van Rossum -
R. David Murray -
Toby Dickenson