[Zope] odd slowness in '/manage' in ZMI

Tue, 24 Sep 2002 16:57:21 -0500

hello -

        We recently went production with a high availability zope based
cluster of machines for our website. On the front end there is an apache
logging server, rewriting requests to a pair of squid caches. The caches
in turn rewrite requests to  pair of ZEO App servers for authoritative
content. The app servers are dual homed onto a private network where The
ZEO Database lives. On the backend there is a management server ( Apache
and a ZEO Application server) that is used for administration of the content.
We also currently have a second 'cluster' consisting of a ZEO DB server and
a management server for testing purposes.

        Since about a week ago we started experiencing excessive delays
in accessing the Zope Management Interface on both of the clusters. 
Accessing the main website is still quick as ever. The delays are 
happening only in the ZMI and are mostly assosiated with the manage_menu. 
Nothing was changed on our production cluster save content. I've verified
there are no network level problems from my test client to the management
server and from the management server to the DB server. 

tcpdump shows server responses delayed :

<snip>

14:59:24.240133 APPSERVER.8881 > CLIENT.1900: P 1:165(164) ack 580 win 6948 <nop,nop,timestamp 67015342 1250100942> (DF) (ttl 63, id 38593, len 216)
14:59:24.240184 CLIENT.1900 > APPSERVER.8881: . [tcp sum ok] ack 165 win 31956 <nop,nop,timestamp 1250100944 67015342> (DF) (ttl 64, id 35138, len 52)
14:59:24.259872 CLIENT.1897 > APPSERVER.8881: . [tcp sum ok] ack 5442 win 31856 <nop,nop,timestamp 1250100946 67015342> (DF) (ttl 64, id 35139, len 52)
15:01:23.971608 CLIENT.1897 > APPSERVER.8881: F [tcp sum ok] 1678:1678(0) ack 5442 win 31856 <nop,nop,timestamp 1250112917 67015342> (DF) (ttl 64, id 35573, len 52)
15:01:23.972539 APPSERVER.8881 > CLIENT.1897: F [tcp sum ok] 5442:5442(0) ack 1679 win 8775 <nop,nop,timestamp 67027314 1250112917> (DF) (ttl 63, id 33341, len 52)

<snip>

15:01:24.151872 CLIENT.1900 > APPSERVER.8881: . [tcp sum ok] ack 166 win 32120 <nop,nop,timestamp 1250112935 67027332> (DF) (ttl 64, id 35580, len 52)
15:02:33.506719 APPSERVER.8881 > CLIENT.1898: . 1:1449(1448) ack 516 win 6432 <nop,nop,timestamp 67034266 1250100932> (DF) (ttl 63, id 38059, len 1500)
15:02:33.506800 CLIENT.1898 > APPSERVER.8881: . [tcp sum ok] ack 1449 win 31856 <nop,nop,timestamp 1250119870 67034266> (DF) (ttl 64, id 35712, len 52)
15:02:33.506842 APPSERVER.8881 > CLIENT.1898: . 1449:2897(1448) ack 516 win 6432 <nop,nop,timestamp 67034266 1250100932> (DF) (ttl 63, id 38060, len 1500)

<snip>...

I enabled detailed logging on the App server and hacked up a quick script
to analyze the log for slow responses :

/home/zope/var/log/zc# ./analyze.py Mdebug.log  5
request 145319924 took 189.0 seconds for GET /manage_menu
request 144738612 took 10.0 seconds for GET /manage_menu
request 144716924 took 191.0 seconds for GET /manage_menu
request 145131396 took 190.0 seconds for GET /p_/rtab
request 145706116 did not finish for  GET /manage_page_style.css
request 146220612 did not finish for  GET /manage_menu
request 144733364 took 192.0 seconds for GET /manage_menu
request 161581820 took 193.0 seconds for GET /manage_menu
request 144670844 took 190.0 seconds for GET /misc_/OFSP/dtmlmethod.gif
...

while i see the delay i see the following in strace output on the management
server :

7999       0.000059 read(20, "*\353\r\n\234w\273<c\0\0\0\0\4\0\0\0s\222\0\0\0\177\0\0"..., 4096) = 4096
7999       0.000083 fstat64(0x14, 0xbfffa1c0) = 0
7999       0.000036 read(20, "\0\0<select-trigger (loopback) at "..., 4096) = 2038
7999       0.000051 read(20, "", 4096)  = 0
7999       0.000369 close(20)           = 0
7999       0.000053 munmap(0x4035a000, 4096) = 0
7999       0.000201 close(19)           = 0
7999       0.000058 pipe([19, 20])      = 0
7999       0.000118 shmat(19, 0x3, 0x3) = ?
7999       0.000238 shmat(19, 0x4, 0x4) = ?
7999       0.000325 select(20, [12 16 18 19], [], [], {30, 0}) = 1 (in [16], left {24, 170000})
7999       5.838636 recv(16, "\0\0\0\3eN.", 8096, 0) = 7
7999       0.000255 write(17, "(K\0K\0t.", 7) = 7
7999       0.000080 _llseek(17, 0, [0], SEEK_SET) = 0
7999       0.000075 read(17, "(K\0K\0t.", 4096) = 7
7999       0.000063 close(17)           = 0
7999       0.000062 munmap(0x40355000, 4096) = 0
7999       0.000190 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)
7999      30.005935 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)
7999      30.014979 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)
7999      30.014998 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)
7999      30.015015 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)
7999      30.014975 select(20, [12 16 18 19], [], [], {30, 0}) = 0 (Timeout)

and this on the database server :

26044      0.000105 _llseek(8, 0, [0], SEEK_SET) = 0
26044      0.000054 read(8, "FS21\3D\32\301\0342\31\21\0\0\0\0\0\0\1\343p\0\0\0\30\0"..., 4096) = 4096
26044      0.000128 _llseek(8, 4096, [4096], SEEK_SET) = 0
26044      0.000176 gettimeofday({1032902895, 732272}, NULL) = 0
26044      0.000107 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 97) = 97
26044      0.000139 gettimeofday({1032902895, 732518}, NULL) = 0
26044      0.000106 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 80) = 80
26044      0.000109 gettimeofday({1032902895, 732733}, NULL) = 0
26044      0.000104 write(7, "------\n2002-09-24T21:28:15 TRACE"..., 71) = 71
26044      0.000205 select(13, [9 11 12], [12], [], {30, 0}) = 1 (out [12], left {30, 0})
26044      0.000522 send(12, "\0\0\0\3eN.", 7, 0) = 7
26044      0.000153 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     29.999654 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     30.000272 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     29.999345 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     29.997122 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     30.006647 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     29.994428 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     30.007108 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     30.004073 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     29.999809 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
26044     30.000932 select(13, [9 11 12], [], [], {30, 0}) = 0 (Timeout)
...

the ZEO_DB server lives in a private network without access to
a dns server, but this should not be an issue as there are hosts
entries for the machines it needs to talk to. plus it has been
working untouched for a couple of weeks. anyhow, to experiment
further, today i installed a new ZEO_DB server and ZEO_APP server
on the same machine to rule out the networking issue. Oddly enough
it does perform quite a bit better but still times out on the
same sort of requests :

/home/zopetest/log# ./analyze.py Mdebug.log  5
request 148664300 took 6.0 seconds for GET /manage_menu
request 145424316 took 11.0 seconds for GET /manage_menu
request 149665228 took 7.0 seconds for GET /manage_menu
...

ideas anyone ?

-- 

-phil@hostway.net