[Zope-dev] Possible Windows Service improvements.

Wed Aug 4 09:18:37 EDT 2004

On Wed, 2004-08-04 at 07:35, Mark Hammond wrote:
> Hi all,
>   I am starting to venture into the wonderful world of Zope!  With the
> benefit of a complete lack of Zope experience, I have been able to look at
> the Windows service support from a fairly clean slate.  However, I also
> realize this lack of experience means my ideas may be naive - hence I have
> attempted to split them into discrete issues for discrete rejection <wink>.
> 
> 1) startup error redirection.
> I've noticed that the main Zope service driver for Windows seems to work
> fine when everything is setup correctly, but when things go wrong it offers
> no clues as to what.  This is reflected in collector item 1020 ("poor error
> reporting on product initialisation failure under windows").  Issue 1408
> ("Configuration file imports don't see INSTANCE_HOME when running Zope as a
> windows service"), via the referenced thread, has evidence of someone
> burning a day due to this.  It cost me alot of time too :)

Yes, sorry about that!  (I was the fool who checked in the NameError
referenced in 1408).

> I propose:
> Each time the child process terminates with a non-zero return code, the tail
> x-bytes of the child output be written to the Windows event log, where x~2k.

This is a good idea.  FWIW, I believe the Zope HEAD already has some
work done towards this (in lib/python/nt_svcutils/service.py), although
the child output goes to a logfile instead of the event log.  It would
be nice to make the output go to the event log and then backport this to
2.7.

> 2) reporting of "successful start" and "backoff" strategy.
> A trivial startup error (eg, PYTHONPATH not set) will cause the Zope service
> to hopelessly retry for a number of minutes, and not respond to shutdown
> requests during a retry.

Yup.  The reason it retry-restarts is because it's simple and stupid and
the reason it doesn't respond to shutdown requests during a retry is
because the service code sleeps for the backoff interval after an
unsuccessful startup.  Any async requests that happen in the meantime
are blocked waiting for this sleep to end.  I'm not quite sure how to do
that better.

> At the moment, as soon as the service starts it reports "successful startup"
> to Windows.  It then begins an attempt to start the child.  If the child
> immediately fails, the code immediately begins the "backoff" strategy.  This
> strategy appears to have 2 main purposes:
> * Startup may fail due to other 'services' not having yet started, so retry
> in the hope they become available.
> * The process may die due to some obscure error - restart it.

One concrete example of the obscure error is that the Zope process
handles a "restart" request from its Control Panel web interface.

> On windows, assuming we install the service to depend on the "tcpip"
> service, I see no reason that the first reason is valid.  If the process
> fails quickly the first time we attempt to launch it, it is almost certainly
> going to fail every time we try and launch it.
> 
> The current strategy also means that 3rd party services could not themselves
> depend on the Zope service - the Zope service will report successful startup
> before it really has (and therefore the dependent service may itself fail).
> This isn't a known requirement today, but who knows!  "net start" and other
> front ends also fail to detect fatal errors - they all say Zope started OK.
> 
> I propose:
> We insist the child process can be created and continues to run for x
> seconds (where x~5).  If that fails, we report an error (never reporting to
> Windows that we started successfully).  If the child process stays alive for
> this period, we report success to Windows, and then use the existing backoff
> strategy should it die.  If the machine is heavily loaded, this 5 seconds
> may expire before the fatal error is hit in the child - in that case, we are
> simply doing what we do now - using the backoff strategy to hopelessly
> attempt a restart - ie, a win in most cases, and no loss in the others.

That sounds good.

> 
> 3) environment setting
> The service process should set a number of environment variables before
> spawning the child - PYTHONPATH at a minimum, and according to issue #1408,
> INSTANCE_HOME.  It already knows these values thanks to mkzopeinstance.  I'm
> yet to determine where these values comes from for in binary build, but I
> see no reason not to fix this (and possibly remove whatever magic the binary
> does)
> 
> I propose:
> A few trivial os.environ insertions based on the substitutions done by
> mkzopeinstance, before we create the child process(es).  Alternatively, we
> create an explicit new environment we pass to CreateProcess, but I see no
> good reason for that.)

Note that the Zope Python install also has a sitecustomize.py that
munges sys.path in order to get things set up properly.  Others have
claimed this is unnecessary and that the work that gets done in there
could be done in the service code.  It's a bit of a mess.  At one point
I flailed trying to make the child process inherit its environment from
the parent, and plastered over the problem with various sys.path and
PYTHONPATH and other environment variable settings.  The current
situation is a result.  Some guidance here would be helpful.  

> 4) Currently, when the process is stopped, we immediately terminate the
> child process.  This seems dangerous.  We should find a way to gracefully
> terminate the child, and try that before we simply kill it.
> 
> I propose:
> That someone help me work out how to do this <wink>.  I've already worked
> out how if the service knows the username/password of a Zope administrator,
> but it doesn't!  Sending a Ctrl_C 'signal' doesn't work without hacks to
> run.py (and I'm yet to confirm it will even with such hacks).

I'm a Windows signal idiot.  Is there a way that we can make the Zope
process capture Windows signals and when the Windows equivalent of
SIGTERM is sent to the process to shut it down "cleanly"?  This is how
it works on UNIX, but we circumvent trying to listen for signals on
Windows entirely at startup.  There are all sorts of hooks for "clean"
shutdown now that we can coopt if we can make the process capture a
signal.

Note that the UNIX environment has a lot of additional niceties due to
responses to signals (like logfile rotation) that Windows doesn't now,
which tends to have the effect of relegating Windows to a second-class
platform on which to run a production Zope instance.

> I welcome any feedback on these issues.  Obviously I am willing to back each
> of these proposals up (except 4!) with code that seems to work :)  I would
> also welcome feedback on the best way to proceed (ie, create a new collector
> for each issue?  thrash it out here?  give up?<wink>, etc)

I can help!  I think.

It would be good to populate the collector to track progress on each one
of these issues.

> Note that none of these issues would require a win32all/pywin32 update.  If
> anyone was really upset by issue 1423 ("Zope 2.7.1 won't run as service
> under NT"), and also able to test, I'd be willing to fix it - but that
> *would* require a pywin32 upgrade.  Tim has already kindly filled me in on
> that background, so it may not be trivial (ie, I would need help!)

If it's easy to fix, it's worth it, otherwise, I wouldn't bother
personally.

- C