Python2.4.2 Zope 2.8.4 ZODB/Zeo 2.3.4 Mysql 4.0 Dual Athalon processors Linux RH7.3 We have two recent instances in our production sites where Zope suddenly stops responding. It is not a new problem, but we've now been confronted with two clean examples and nothing to blame them on. The problem appears to be independent of load as both incidents were on lightly loaded machines. A check of the logs (Linux and Zope) shows nothing obviously amiss except that the trace log (the old -M log) shows a sudden increase in active requests from the typical 0 or 1 to 1300 or more. In this context an "active request' is total number of requests pending at the end of this request and is computed by post-processing. We front-end Zope with pound and make heavy use of MySQL. Both show a plethora of incomplete transactions. Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled. I am at a loss as to where to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem. Has anyone seen seen a similar problem? Any advice as to how to proceed? --
Dennis Allison wrote:
We have two recent instances in our production sites where Zope suddenly stops responding. It is not a new problem, but we've now been confronted with two clean examples and nothing to blame them on. The problem appears to be independent of load as both incidents were on lightly loaded machines.
A check of the logs (Linux and Zope) shows nothing obviously amiss except that the trace log (the old -M log) shows a sudden increase in active requests from the typical 0 or 1 to 1300 or more. In this context an "active request' is total number of requests pending at the end of this request and is computed by post-processing. We front-end Zope with pound and make heavy use of MySQL. Both show a plethora of incomplete transactions.
Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled.
I am at a loss as to where to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem.
Has anyone seen seen a similar problem? Any advice as to how to proceed?
Threads are hanging. You should install my DeadlockDebugger and track down where the hung threads are blocked at. From the description I'd wager that you'll find your threads stuck in a corner of the MySQL DA. In which case you'd have to find why it deadlocks and find a fix. Florent -- Florent Guillaume, Nuxeo (Paris, France) Director of R&D +33 1 40 33 71 59 http://nuxeo.com fg@nuxeo.com
Just went throught that exercise with Andy and installed a patch to MySQLDA that effectively ignores the 'release unlocked lock' problem that has been plaguing us. I shoulda guessed that is the first place to look. I'll get and install the DeadlockDebugger forthwith. Thanks. On Mon, 28 Nov 2005, Florent Guillaume wrote:
Dennis Allison wrote:
We have two recent instances in our production sites where Zope suddenly stops responding. It is not a new problem, but we've now been confronted with two clean examples and nothing to blame them on. The problem appears to be independent of load as both incidents were on lightly loaded machines.
A check of the logs (Linux and Zope) shows nothing obviously amiss except that the trace log (the old -M log) shows a sudden increase in active requests from the typical 0 or 1 to 1300 or more. In this context an "active request' is total number of requests pending at the end of this request and is computed by post-processing. We front-end Zope with pound and make heavy use of MySQL. Both show a plethora of incomplete transactions.
Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled.
I am at a loss as to where to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem.
Has anyone seen seen a similar problem? Any advice as to how to proceed?
Threads are hanging. You should install my DeadlockDebugger and track down where the hung threads are blocked at.
From the description I'd wager that you'll find your threads stuck in a corner of the MySQL DA. In which case you'd have to find why it deadlocks and find a fix.
Florent
--
Does this mean that you haven't seen the errors since installing Andy's patch? If not, I'd declare victory and forget about using the deadlock debugger (unless you want to do it for learning purposes only). On Nov 27, 2005, at 8:46 PM, Dennis Allison wrote:
Just went throught that exercise with Andy and installed a patch to MySQLDA that effectively ignores the 'release unlocked lock' problem that has been plaguing us. I shoulda guessed that is the first place to look.
I'll get and install the DeadlockDebugger forthwith.
Thanks.
I have not seen any of the threading error exceptions--but then the patch catches them in the ZMySQLDA adaptor and punts... On Sun, 27 Nov 2005, Chris McDonough wrote:
Does this mean that you haven't seen the errors since installing Andy's patch? If not, I'd declare victory and forget about using the deadlock debugger (unless you want to do it for learning purposes only).
On Nov 27, 2005, at 8:46 PM, Dennis Allison wrote:
Just went throught that exercise with Andy and installed a patch to MySQLDA that effectively ignores the 'release unlocked lock' problem that has been plaguing us. I shoulda guessed that is the first place to look.
I'll get and install the DeadlockDebugger forthwith.
Thanks.
--
On Nov 27, 2005, at 8:13 PM, Dennis Allison wrote:
here to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem.
Has anyone seen seen a similar problem? Any advice as to how to proceed?
A pool of database connections is kept around to pull from for each ZODB database you have. By default, there are 7 of them per database; this number can be changed by the pool-size parameter in the <zodb_db> section in your config file. These connections are returned to the pool when a connection is closed. This usually happens when a transaction that is using that connection completes. The symptom you're describing has happened to me in the past. Zope stops serving requests and my CPU is not pegged (it's not hitting the CPU hard via an infinept loop I've made). I usually end up tracking it down to the fact that my instance is somehow leaking database connections. The connection is never closed so it never gets returned to the pool. This can happen when there is an error during a transaction commit or abort. Since you've previously posted to the list with commit errors of just this kind, it seems like that would be the logical place to start. You didn't mention seeing any of these errors, though, so if you don't see any of those in the event log, I'm not sure where to go from there other than debugging by binary exclusion which can be impossible when the failures happen so far apart. :-( - C
Chris, A helpful suggestion. The commit errors I've been seeing have to do with the intereaction of the ZODB, MySQL, session variables, and conflicts. These particular problems do not appear to be related to the ZODB/session variable/conflict issues, but I cannot completely exclude them since leaks easily cause failures far away from where the fault lies. It certainly is a worthwhile thing to check and monitor. -d On Sun, 27 Nov 2005, Chris McDonough wrote:
On Nov 27, 2005, at 8:13 PM, Dennis Allison wrote:
here to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem.
Has anyone seen seen a similar problem? Any advice as to how to proceed?
A pool of database connections is kept around to pull from for each ZODB database you have. By default, there are 7 of them per database; this number can be changed by the pool-size parameter in the <zodb_db> section in your config file. These connections are returned to the pool when a connection is closed. This usually happens when a transaction that is using that connection completes.
The symptom you're describing has happened to me in the past. Zope stops serving requests and my CPU is not pegged (it's not hitting the CPU hard via an infinept loop I've made). I usually end up tracking it down to the fact that my instance is somehow leaking database connections. The connection is never closed so it never gets returned to the pool. This can happen when there is an error during a transaction commit or abort.
Since you've previously posted to the list with commit errors of just this kind, it seems like that would be the logical place to start. You didn't mention seeing any of these errors, though, so if you don't see any of those in the event log, I'm not sure where to go from there other than debugging by binary exclusion which can be impossible when the failures happen so far apart. :-(
- C
--
Chris,
A helpful suggestion. The commit errors I've been seeing have to do with the intereaction of the ZODB, MySQL, session variables, and conflicts.
So the patch that Andy sent over is a fix that prevents the mysql adapter from raising an error when a conflict exception occurs? Do you know if that error only happens after a conflict exception occurs or can it happen without the presence of conflict exceptions? (I'm curious because I also use the adapter and I'd like to know what the patch fixes).
These particular problems do not appear to be related to the ZODB/ session variable/conflict issues, but I cannot completely exclude them since leaks easily cause failures far away from where the fault lies.
I'm afraid I can't parse that sentence fully. But I'll try to interpret as best possible. ;-) I think I've said this before but it in case not... the use of sessions is only one place where conflict errors can be generated. Conflict errors are "normal" in any system that causes writes to a ZODB database. If your application does any writes to a ZODB database at all (besides the writes that occur from use of the sessioning machinery), and the mysql adapter wasn't tolerant of conflict errors, you'd be getting the same result, they'd just probably happen further apart. That said, I certainly am interested in making fewer conflicts errors go unresolved when the sessioning machinery is used. If you were to tell that you've taken the steps I've already suggested about reducing the potential for conflicts during session usage (use 2.8 with mvcc, turn external housekeeping on, bump up the resolution time, local zodb db for sessions), and you observed that you're still having "too many" conflicts, I'd try to take some action, although to do so I might still need to request your help in providing data about your conflict rates.
It certainly is a worthwhile thing to check and monitor.
What is? Thanks, - C
We had been plagued with a threading error which manifests itself as an exception reporting "release unreleased lock". At Andy's suggestion I added code to catch the exception in the appropriate place in ZMySQLDA -- Here's the code from db.py --- The code probably should be specific to the unreleased lock exception since, presumably, the code could throw other exceptions. On the whole, I think this is a ban-aid solution. def _abort(self, *ignored): try: if self._mysql_lock: self.db.query("SELECT RELEASE_LOCK('%s')" % self._mysql_lock) self.db.store_result() if self._transactions: self.db.query("ROLLBACK") self.db.store_result() else: LOG('ZMySQLDA', ERROR, "aborting when non-transactional") finally: # just ignore the exception for "release unlocked lock" try: self._tlock.release() except: pass On Sun, 27 Nov 2005, Chris McDonough wrote:
Chris,
A helpful suggestion. The commit errors I've been seeing have to do with the intereaction of the ZODB, MySQL, session variables, and conflicts.
So the patch that Andy sent over is a fix that prevents the mysql adapter from raising an error when a conflict exception occurs? Do you know if that error only happens after a conflict exception occurs or can it happen without the presence of conflict exceptions? (I'm curious because I also use the adapter and I'd like to know what the patch fixes).
These particular problems do not appear to be related to the ZODB/ session variable/conflict issues, but I cannot completely exclude them since leaks easily cause failures far away from where the fault lies.
I'm afraid I can't parse that sentence fully. But I'll try to interpret as best possible. ;-)
I think I've said this before but it in case not... the use of sessions is only one place where conflict errors can be generated. Conflict errors are "normal" in any system that causes writes to a ZODB database. If your application does any writes to a ZODB database at all (besides the writes that occur from use of the sessioning machinery), and the mysql adapter wasn't tolerant of conflict errors, you'd be getting the same result, they'd just probably happen further apart.
That said, I certainly am interested in making fewer conflicts errors go unresolved when the sessioning machinery is used. If you were to tell that you've taken the steps I've already suggested about reducing the potential for conflicts during session usage (use 2.8 with mvcc, turn external housekeeping on, bump up the resolution time, local zodb db for sessions), and you observed that you're still having "too many" conflicts, I'd try to take some action, although to do so I might still need to request your help in providing data about your conflict rates.
It certainly is a worthwhile thing to check and monitor.
What is?
Thanks,
- C
--
Chris McDonough wrote:
The symptom you're describing has happened to me in the past. Zope stops serving requests and my CPU is not pegged (it's not hitting the CPU hard via an infinept loop I've made). I usually end up tracking it down to the fact that my instance is somehow leaking database connections.
I thought Jim solved this for 2.8 by not having a hard limit on the connection pool? (iirc, you now get errors in the log when you go over the pool limit) This was 'cos you used to get the crappy situation running unit tests when lots of test fail that things hang and you get no feedback apart from 7 E's because the connection pool is exhausted. That said, I don't know for sure that Jim _did_ end up implementing this, I might just be imagine it. Jim? cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk
[Chris McDonough]
The symptom you're describing has happened to me in the past. Zope stops serving requests and my CPU is not pegged (it's not hitting the CPU hard via an infinept loop I've made). I usually end up tracking it down to the fact that my instance is somehow leaking database connections.
[Chris Withers]
I thought Jim solved this for 2.8 by not having a hard limit on the connection pool?
This was implemented for ZODB 3.4. See: http://www.zope.org/Wikis/ZODB/SimplifyConnectionManagement and ZODB NEWS.
(iirc, you now get errors in the log when you go over the pool limit)
Messages at warning or critical level, depening on how much you've gone over.
This was 'cos you used to get the crappy situation running unit tests when lots of test fail that things hang and you get no feedback apart from 7 E's because the connection pool is exhausted.
Not really (see the Wiki page referenced above).
That said, I don't know for sure that Jim _did_ end up implementing this, I might just be imagine it.
It was released in ZODB 3.4a1: """ What's new in ZODB3 3.4a1? ========================== Release date: 01-Apr-2005 ... DB -- - There is no longer a hard limit on the number of connections that ``DB.open()`` will create. In other words, ``DB.open()`` never blocks anymore waiting for an earlier connection to close, and ``DB.open()`` always returns a connection now (while it wasn't documented, it was possible for ``DB.open()`` to return ``None`` before). ``pool_size`` continues to default to 7, but its meaning has changed: if more than ``pool_size`` connections are obtained from ``DB.open()`` and not closed, a warning is logged; if more than twice ``pool_size``, a critical problem is logged. ``pool_size`` should be set to the maximum number of connections from the ``DB`` instance you expect to have open simultaneously. In addition, if a connection obtained from ``DB.open()`` becomes unreachable without having been explicitly closed, when Python's garbage collection reclaims that connection it no longer counts against the ``pool_size`` thresholds for logging messages. The following optional arguments to ``DB.open()`` are deprecated: ``transaction``, ``waitflag``, ``force`` and ``temporary``. If one is specified, its value is ignored, and ``DeprecationWarning`` is raised. In ZODB 3.6, these optional arguments will be removed. """ And for ZODB 3.6, they were removed.
On Sun, Nov 27, 2005 at 05:13:21PM -0800, Dennis Allison wrote:
Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled.
I've seen the same symptoms a number of times recently with zope 2.7.x. In our case, it seems to be related to ZEO. Zope seems to have lost its connection to ZEO but doesn't realize it somehow. My theory is that the symptom starts when all worker threads are waiting for objects that aren't in the ZEO client cache, so they're all waiting on ZEO requests. Meanwhile, requests keep piling up in the queue. These are all the B and I lines you noticed. Eventually (10 minutes or so) one of these ZEO requests times out, and Zope then realizes it's lost its ZEO connection and successfully reconnects. This shows up in the Zope event log as a series of ZEO "Client disconnected" errors, and immediately thereafter we see a successful reconnect in the ZEO server's log. I have no idea what is causing the loss of connection in the first place, or why it takes Zope so long to realize there's a problem. Nothing in the logs so far gives me any clue. I've bumped up our log levels to BLATHER and hopefully next time this happens I'll get some more to work with and see if there's anythign I can do about it. It may be network issues between our Zope and Zeo boxes, which are on different subnets with a firewall between them, over which I have no control. -- Paul Winkler http://www.slinkp.com
Thanks Paul. I checked and did not find any "Client disconnected" entries in the event log so this may not be the cause of what I am seeing. I will bump the log level to ALL (which I think includes BLATHER) when I restart things tonight. We use a remote ZEO but at the moment Zope and ZEO share the same machine. On Sun, 27 Nov 2005, Paul Winkler wrote:
On Sun, Nov 27, 2005 at 05:13:21PM -0800, Dennis Allison wrote:
Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled.
I've seen the same symptoms a number of times recently with zope 2.7.x. In our case, it seems to be related to ZEO. Zope seems to have lost its connection to ZEO but doesn't realize it somehow. My theory is that the symptom starts when all worker threads are waiting for objects that aren't in the ZEO client cache, so they're all waiting on ZEO requests. Meanwhile, requests keep piling up in the queue. These are all the B and I lines you noticed.
Eventually (10 minutes or so) one of these ZEO requests times out, and Zope then realizes it's lost its ZEO connection and successfully reconnects. This shows up in the Zope event log as a series of ZEO "Client disconnected" errors, and immediately thereafter we see a successful reconnect in the ZEO server's log.
I have no idea what is causing the loss of connection in the first place, or why it takes Zope so long to realize there's a problem. Nothing in the logs so far gives me any clue. I've bumped up our log levels to BLATHER and hopefully next time this happens I'll get some more to work with and see if there's anythign I can do about it.
It may be network issues between our Zope and Zeo boxes, which are on different subnets with a firewall between them, over which I have no control.
--
On Sun, Nov 27, 2005 at 08:08:20PM -0800, Dennis Allison wrote:
Thanks Paul.
I checked and did not find any "Client disconnected" entries in the event log so this may not be the cause of what I am seeing.
OK. If you were having my problem you'd see them for sure when the problem stops (for me it always fixes itself "eventually").
I will bump the log level to ALL (which I think includes BLATHER)
Yep.
when I restart things tonight.
We use a remote ZEO but at the moment Zope and ZEO share the same machine.
You're reasonably unlikely to have my problem then :-) -- Paul Winkler http://www.slinkp.com
We do see occasional instances where things slow down, a backlog builds, and then it clears. But no disconnects. On Sun, 27 Nov 2005, Paul Winkler wrote:
On Sun, Nov 27, 2005 at 08:08:20PM -0800, Dennis Allison wrote:
Thanks Paul.
I checked and did not find any "Client disconnected" entries in the event log so this may not be the cause of what I am seeing.
OK. If you were having my problem you'd see them for sure when the problem stops (for me it always fixes itself "eventually").
I will bump the log level to ALL (which I think includes BLATHER)
Yep.
when I restart things tonight.
We use a remote ZEO but at the moment Zope and ZEO share the same machine.
You're reasonably unlikely to have my problem then :-)
--
Paul Winkler wrote at 2005-11-27 21:17 -0500:
... I've seen the same symptoms a number of times recently with zope 2.7.x. In our case, it seems to be related to ZEO. Zope seems to have lost its connection to ZEO but doesn't realize it somehow. My theory is that the symptom starts when all worker threads are waiting for objects that aren't in the ZEO client cache, so they're all waiting on ZEO requests.
Do you have a firewall between Zope and ZEO? Usually, the OS can inform both ends of a connection when the connection is torn down. However, some firewalls tear a connection down in a way that the endpoints do not get informed. We had to implement a keep alive mechanism to prevent our firewall from behaving in this nasty way. -- Dieter
On Mon, Nov 28, 2005 at 08:19:23PM +0100, Dieter Maurer wrote:
Paul Winkler wrote at 2005-11-27 21:17 -0500:
... I've seen the same symptoms a number of times recently with zope 2.7.x. In our case, it seems to be related to ZEO. Zope seems to have lost its connection to ZEO but doesn't realize it somehow. My theory is that the symptom starts when all worker threads are waiting for objects that aren't in the ZEO client cache, so they're all waiting on ZEO requests.
Do you have a firewall between Zope and ZEO?
Yes, we do, and it is under control of another part of the company :-(
Usually, the OS can inform both ends of a connection when the connection is torn down. However, some firewalls tear a connection down in a way that the endpoints do not get informed.
I suspected as much... thanks.
We had to implement a keep alive mechanism to prevent our firewall from behaving in this nasty way.
OK. Can you give a high-level summary of what you did? I thought of using heartbeat to detect loss of connection, but I'm not sure what I could do on failure short of restarting Zope. -- Paul Winkler http://www.slinkp.com
Paul Winkler wrote at 2005-11-28 15:37 -0500:
...
We had to implement a keep alive mechanism to prevent our firewall from behaving in this nasty way.
OK. Can you give a high-level summary of what you did? I thought of using heartbeat to detect loss of connection, but I'm not sure what I could do on failure short of restarting Zope.
We knew that our firewall shuts down connections with a timeout of 30 min. Thus, we have send our ZEO a keep alive message every 20 min. The code roughly looks like this: KeepPeriod= int(environ.get('ZEO_KEEP_ALIVE')) * 60 Storage = getConfiguration().dbtab.getDatabase('/')._storage def keepAlive(): LOG("CustomZODB",INFO,"Keep alive thread started") while 1: sleep(KeepPeriod) if Storage._ready.isSet(): LOG("CustomZODB",INFO,"Sending keep alive message") Storage._load_lock.acquire() try: try: Storage._server.get_info() LOG("CustomZODB",INFO,"Sent keep alive message") except: LOG("CustomZODB",ERROR,".... failed", error=exc_info()) finally: Storage._load_lock.release() else: LOG("CustomZODB",PROBLEM,"Connection is down") start_new_thread(keepAlive,()) -- Dieter
Dieter Maurer wrote:
Paul Winkler wrote at 2005-11-28 15:37 -0500:
...
We had to implement a keep alive mechanism to prevent our firewall from behaving in this nasty way.
OK. Can you give a high-level summary of what you did? I thought of using heartbeat to detect loss of connection, but I'm not sure what I could do on failure short of restarting Zope.
We knew that our firewall shuts down connections with a timeout of 30 min. Thus, we have send our ZEO a keep alive message every 20 min. The code roughly looks like this:
KeepPeriod= int(environ.get('ZEO_KEEP_ALIVE')) * 60
Storage = getConfiguration().dbtab.getDatabase('/')._storage
def keepAlive(): LOG("CustomZODB",INFO,"Keep alive thread started") while 1: sleep(KeepPeriod) if Storage._ready.isSet(): LOG("CustomZODB",INFO,"Sending keep alive message") Storage._load_lock.acquire() try: try: Storage._server.get_info() LOG("CustomZODB",INFO,"Sent keep alive message") except: LOG("CustomZODB",ERROR,".... failed", error=exc_info()) finally: Storage._load_lock.release() else: LOG("CustomZODB",PROBLEM,"Connection is down") start_new_thread(keepAlive,())
Why not use the max-disconnect-poll option of the zeoclient section in zope.conf ? Florent -- Florent Guillaume, Nuxeo (Paris, France) Director of R&D +33 1 40 33 71 59 http://nuxeo.com fg@nuxeo.com
Florent Guillaume wrote at 2005-11-30 01:51 +0100:
... sending keepalive messages to ZEO ...
Why not use the max-disconnect-poll option of the zeoclient section in zope.conf ?
Our solution is quite old. At that time, there was definitely no "max-disconnect-poll" yet. In addition, "max-disconnect-poll" seems to target a completely different use case: to control the time between connection attempts. In our case, a connection was successfully established. However, the firewall may cut it if it is inactive for a too long period -- in a way not noticed by the endpoints. Thus, we must prevent the connection from being idle too long -- e.g. with a keepalive mechanism. I think, Tim wanted to implement such a keep alive mechanism inside "ClientStorage" (to reliably detect disconnects) but in ZODB 3.4 it seems not yet available. -- Dieter
On Wed, Nov 30, 2005 at 08:40:34PM +0100, Dieter Maurer wrote:
Florent Guillaume wrote at 2005-11-30 01:51 +0100:
... sending keepalive messages to ZEO ...
Why not use the max-disconnect-poll option of the zeoclient section in zope.conf ?
Our solution is quite old. At that time, there was definitely no "max-disconnect-poll" yet.
Aside from that, it's not even mentioned anywhere but ZODB/component.xml so I had no idea it existed until now.
In addition, "max-disconnect-poll" seems to target a completely different use case: to control the time between connection attempts.
I see. Thanks very much for your suggestion, Dieter - I'll look into that. It certainly sounds like we have the same symptom. -- Paul Winkler http://www.slinkp.com
[Dieter Maurer]
... I think, Tim wanted to implement such a keep alive mechanism inside "ClientStorage" (to reliably detect disconnects) but in ZODB 3.4 it seems not yet available.
Right on all counts: I would like to add that, because it's currently possible for ZEO to run "forever" without noticing a connection is dead (when the OS/whatever doesn't inform it of socket death); this is especially damning for clients that normally don't try to commit changes, as they can continue serving stale cached content indefinitely. And it's not in ZODB 3.4. It's not in ZODBs 3.5 or 3.6 either -- haven't had time to work on it.
Resolved. Thanks to all that responded. The symptoms were due to an external method which runs a program (htmldoc) that converts html to pdf. The external method uses the "old" subprocess interface to fire off the process and then wait() for it to complete. To fix another problem, we had upgraded to the latest version, but that htmldoc version apparently is confused by some of our html input and never completes so the external method just sits there waiting for the child process to terminate. Multiple invokations of the blocked htmldoc process eventually brought Zope to its knees and no further progress was made. Confession is good for the soul The problem took a long time to find because failures were infrequent (we only saw three or for instances) so there was limited data. Moreover,the external method the was failing had been in service for almost two years and has not been a problem. And, other very similar external methods which also use htmldoc were known to work with the upgraded htmldoc. It was not until the external method was identified as the only commonality in the failures that I thought to look closely at the external method. And then I noticed the blocked htmldoc processes, etc. etc. The fix Our solution was to revert htmldoc to the earlier version. When time permits we'll add a timeout to the external method and try to extablish what in the current release htmldoc triggered the failure. We probably should move to the new subprocess module rather than use popen2 in the external method. On Sun, 27 Nov 2005, Dennis Allison wrote:
Python2.4.2 Zope 2.8.4 ZODB/Zeo 2.3.4 Mysql 4.0 Dual Athalon processors Linux RH7.3
We have two recent instances in our production sites where Zope suddenly stops responding. It is not a new problem, but we've now been confronted with two clean examples and nothing to blame them on. The problem appears to be independent of load as both incidents were on lightly loaded machines.
A check of the logs (Linux and Zope) shows nothing obviously amiss except that the trace log (the old -M log) shows a sudden increase in active requests from the typical 0 or 1 to 1300 or more. In this context an "active request' is total number of requests pending at the end of this request and is computed by post-processing. We front-end Zope with pound and make heavy use of MySQL. Both show a plethora of incomplete transactions.
Examination of the raw trace log shows that Zope is continuing to accept requests, but nothing getting done. The raw log date-stamps four internal states for each transaction. The states are Begin (B), Input (I), action (A), and End (E). Inputs are gathered between B and I, outputs is made between A and E. The raw log shows B and I transactions, but apparently no processing is completing. I suspect that nothing is getting scheduled.
I am at a loss as to where to begin to track this one down. The failure is spontaneous and apparently not triggered by any readily distinguishable inputs or pattern of inputs. The behavior smells a bit of resource limits or process synchronization problems, but there is not real evidence for either being the root cause. I am not sure what monitoring I should be doing to help locate the source of the problem.
Has anyone seen seen a similar problem? Any advice as to how to proceed?
--
participants (7)
-
Chris McDonough -
Chris Withers -
Dennis Allison -
Dieter Maurer -
Florent Guillaume -
Paul Winkler -
Tim Peters