hello - We recently went production with a high availability zope based cluster of machines for our website. On the front end there is an apache logging server, rewriting requests to a pair of squid caches. The caches in turn rewrite requests to pair of ZEO App servers for authoritative content. The app servers are dual homed onto a private network where The ZEO Database lives. On the backend there is a management server ( Apache and a ZEO Application server) that is used for administration of the content. We also currently have a second 'cluster' consisting of a ZEO DB server and a management server for testing purposes. Since about a week ago we started experiencing excessive delays in accessing the Zope Management Interface on both of the clusters. Accessing the main website is still quick as ever. The delays are happening only in the ZMI and are mostly assosiated with the manage_menu. Nothing was changed on our production cluster save content. I've verified there are no network level problems from my test client to the management server and from the management server to the DB server. tcpdump shows server responses delayed : <snip> 14:59:24.240133 APPSERVER.8881 > CLIENT.1900: P 1:165(164) ack 580 win 6948 <nop,nop,timestamp 67015342 1250100942> (DF) (ttl 63, id 38593, len 216) 14:59:24.240184 CLIENT.1900 > APPSERVER.8881: . [tcp sum ok] ack 165 win 31956 <nop,nop,timestamp 1250100944 67015342> (DF) (ttl 64, id 35138, len 52) 14:59:24.259872 CLIENT.1897 > APPSERVER.8881: . [tcp sum ok] ack 5442 win 31856 <nop,nop,timestamp 1250100946 67015342> (DF) (ttl 64, id 35139, len 52) 15:01:23.971608 CLIENT.1897 > APPSERVER.8881: F [tcp sum ok] 1678:1678(0) ack 5442 win 31856 <nop,nop,timestamp 1250112917 67015342> (DF) (ttl 64, id 35573, len 52) 15:01:23.972539 APPSERVER.8881 > CLIENT.1897: F [tcp sum ok] 5442:5442(0) ack 1679 win 8775 <nop,nop,timestamp 67027314 1250112917> (DF) (ttl 63, id 33341, len 52) <snip> 15:01:24.151872 CLIENT.1900 > APPSERVER.8881: . [tcp sum ok] ack 166 win 32120 <nop,nop,timestamp 1250112935 67027332> (DF) (ttl 64, id 35580, len 52) 15:02:33.506719 APPSERVER.8881 > CLIENT.1898: . 1:1449(1448) ack 516 win 6432 <nop,nop,timestamp 67034266 1250100932> (DF) (ttl 63, id 38059, len 1500) 15:02:33.506800 CLIENT.1898 > APPSERVER.8881: . [tcp sum ok] ack 1449 win 31856 <nop,nop,timestamp 1250119870 67034266> (DF) (ttl 64, id 35712, len 52) 15:02:33.506842 APPSERVER.8881 > CLIENT.1898: . 1449:2897(1448) ack 516 win 6432 <nop,nop,timestamp 67034266 1250100932> (DF) (ttl 63, id 38060, len 1500) <snip>... I enabled detailed logging on the App server and hacked up a quick script to analyze the log for slow responses : /home/zope/var/log/zc# ./analyze.py Mdebug.log 5 request 145319924 took 189.0 seconds for GET /manage_menu request 144738612 took 10.0 seconds for GET /manage_menu request 144716924 took 191.0 seconds for GET /manage_menu request 145131396 took 190.0 seconds for GET /p_/rtab request 145706116 did not finish for GET /manage_page_style.css request 146220612 did not finish for GET /manage_menu request 144733364 took 192.0 seconds for GET /manage_menu request 161581820 took 193.0 seconds for GET /manage_menu request 144670844 took 190.0 seconds for GET /misc_/OFSP/dtmlmethod.gif ... while i see the delay i see the following in strace output on the management server : 7999 0.000059 read(20, "*\353\r\n\234w\273<c\0\0\0\0\4\0\0\0s\222\0\0\0\177\0\0"..., 4096) = 4096 7999 0.000083 fstat64(0x14, 0xbfffa1c0) = 0 7999 0.000036 read(20, "\0\0<select-trigger (loopback) at "..., 4096) = 2038 7999 0.000051 read(20, "", 4096) = 0 7999 0.000369 close(20) = 0 7999 0.000053 munmap(0x4035a000, 4096) = 0 7999 0.000201 close(19) = 0 7999 0.000058 pipe([19, 20]) = 0 7999 0.000118 shmat(19, 0x3, 0x3) = ? 7999 0.000238 shmat(19, 0x4, 0x4) = ? 7999 0.000325 select(20, [12 16 18 19], [], [], {30, 0}) = 1 (in [16], left {24, 170000}) 7999 5.838636 recv(16, "\0\0\0\3eN.", 8096, 0) = 7 7999 0.000255 write(17, "(K\0K\0t.", 7) = 7 7999 0.000080 _llseek(17, 0, [0], SEEK_SET) = 0 7999 0.000075 read(17, "(K\0K\0t.", 4096) = 7 7999 0.000063 close(17) = 0 7999 0.000062 munmap(0x40355000, 4096) = 0 7999 0.000190 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) 7999 30.005935 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) 7999 30.014979 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) 7999 30.014998 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) 7999 30.015015 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) 7999 30.014975 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout) and this on the database server : 26044 0.000105 _llseek(8, 0, [0], SEEK_SET) = 0 26044 0.000054 read(8, "FS21\3D\32\301\0342\31\21\0\0\0\0\0\0\1\343p\0\0\0\30\0"..., 4096) = 4096 26044 0.000128 _llseek(8, 4096, [4096], SEEK_SET) = 0 26044 0.000176 gettimeofday({1032902895, 732272}, NULL) = 0 26044 0.000107 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 97) = 97 26044 0.000139 gettimeofday({1032902895, 732518}, NULL) = 0 26044 0.000106 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 80) = 80 26044 0.000109 gettimeofday({1032902895, 732733}, NULL) = 0 26044 0.000104 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 71) = 71 26044 0.000205 select(13, [9 11 12], [12], [], {30, 0}) = 1 (out [12], left {30, 0}) 26044 0.000522 send(12, "\0\0\0\3eN.", 7, 0) = 7 26044 0.000153 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 29.999654 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 30.000272 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 29.999345 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 29.997122 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 30.006647 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 29.994428 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 30.007108 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 30.004073 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 29.999809 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) 26044 30.000932 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout) ... the ZEO_DB server lives in a private network without access to a dns server, but this should not be an issue as there are hosts entries for the machines it needs to talk to. plus it has been working untouched for a couple of weeks. anyhow, to experiment further, today i installed a new ZEO_DB server and ZEO_APP server on the same machine to rule out the networking issue. Oddly enough it does perform quite a bit better but still times out on the same sort of requests : /home/zopetest/log# ./analyze.py Mdebug.log 5 request 148664300 took 6.0 seconds for GET /manage_menu request 145424316 took 11.0 seconds for GET /manage_menu request 149665228 took 7.0 seconds for GET /manage_menu ... ideas anyone ? -- -phil@hostway.net