[Zope] Running more than one instance on windows often block each other

Tim Peters tim.peters at gmail.com
Thu Jul 28 01:38:54 EDT 2005


It's starting to look a lot like the Windows bind() implementation is
unreliable, sometimes (but rarely -- hard to provoke) allowing two
sockets to bind to the same (address, port) pair simultaneously,
instead of raising 'Address already in use' for one of them.  Disaster
ensues.

WRT the last version of the code I posted, on another XP Pro SP2
machine (again after playing registry games to boost the number of
ephemeral ports) I eventually saw all of:  hangs during accept(); the
assertion errors I mentioned last time; and mystery "Connection
refused" errors during connect().

The variant of the code below _only_ tries to use port 19999.  If it
can't bind to that on the first try, socktest111() raises an exception
instead of trying again (or trying a different port number).  Ran two
processes.  After about 15 minutes, both died with assert errors at
about the same time (identical, so far as I could tell by eyeball):

Process A:

Traceback (most recent call last):
  File "socktest.py", line 209, in ?
    assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))

Process B:

Traceback (most recent call last):
  File "socktest.py", line 209, in ?
    assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))

So it's again the business where each process is recv'ing the random
string intended to be recv'ed by a socket in the other process. 
Hypothesized timeline:

process A's `a` binds to 19999
process B's `a` binds to 19999 -- according to me, this should be impossible
    in the absence of SO_REUSEADDR (which acts very differently on
    Windows than it does on Linux, BTW -- on Linux this should be impossible
    even in the presence of SO_REUSEADDR; regardless, we're not using
    SO_REUSEADDR here, and the braindead hard-coded

        w.setsockopt(socket.IPPROTO_TCP, 1, 1)

    is actually using the right magic constant for TCP_NODELAY on
    Windows, as it intends).
A and B both listen()
A connect()s, and accidentally gets on B.a's accept queue
B connect()s, and accidentally gets on A.a's accept queue
the rest follows inexorably

Note that because this never tries a port number other than 19999, it
can't be a bulletproof workaround simply to hold on to the `a` socket.
 If the hypothesized timeline above is right, bind() can't be trusted
on Windows in any situation where two processes may try to bind to the
same hostname:port pair at the same time.  Holding on to `a`, and
cycling through port numbers when bind() failed, would still
potentially leave two processes trying to bind to the same port number
simultaneously (just a port other than 19999).

Ick:  this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1),
so if it is -- as is looking more and more likely --an error in MS's
socket implementation, it isn't avoided by switching to a newer MS C
library.

Frankly, I don't see a sane way to worm around this -- it's difficult
for application code to worm around what smells like a missing
critical section in system code.

Using the simpler socket dance from the ZODB 3.4 code, I haven't yet
seen an instance of the assert failure, or a hang.  However, let two
processes run that long enough simultaneously, and it always (so far)
eventually fails with

    socket.error: (10048, 'Address already in use')

in the w.connect() call, and despite that Windows picks the port numbers here!

While that also smells to heaven of a missing critical section in the
Windows socket implementation, an exception is much easier to live
with / worm around.  Alas, we don't have the MS source code, and I
don't have time to try disassembling / reverse-engineering the opcodes
(what EULA <wink>?), so best I can do is run this for many more hours
to try to increase confidence that an exception is the worst that can
occur under the ZODB 3.4 spelling.

Here's full code for the "only try port 19999" version:

import socket, errno
import time, random
def socktest111():
    """Raise an exception if we can't get 19999.
    """

    a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
    w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)

    # set TCP_NODELAY to true to avoid buffering
    w.setsockopt(socket.IPPROTO_TCP, 1, 1)

    # tricky: get a pair of connected sockets
    host = '127.0.0.1'
    port = 19999

    try:
        a.bind((host, port))
    except:
        raise RuntimeError
    else:
        print 'b',

    a.listen (1)
    w.setblocking (0)
    try:
        w.connect ((host, port))
    except:
        pass
    print 'c',
    r, addr = a.accept()
    print 'a',
    a.close()
    print 'c',
    w.setblocking (1)

    return (r, w)

sofar = []
try:
   while 1:
       try:
           stuff = socktest111()
       except RuntimeError:
           print 'x',
           time.sleep(random.random()/10)
           continue
       sofar.append(stuff)
       time.sleep(random.random()/10)
       if len(sofar) == 50:
           tup = sofar.pop(0)
           r, w = tup
           msg = str(random.randrange(1000000))
           w.send(msg)
           msg2 = r.recv(100)
           assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
           for s in tup:
               s.close()
except KeyboardInterrupt:
   for tup in sofar:
       for s in tup:
           s.close()


More information about the Zope mailing list