Re: [Zope] Running more than one instance on windows often block each other
[Sune Brøndum Wøller]
Thanks for the pointer. I have been debugging select_trigger.py, and has some more info:
The problem is that the call a.accept() sometimes hangs. Apparently a.bind(self.address) allows us to bind to a port that another zope instance already is bound to.
The code creates the server socket a, and the client socket w, and gets the client socket r by connecting w to a. Then it closes a. a goes out of scope when __init__ terminates, and is probably garbage collected at some point.
Unless you're using a very old Python, `a` is collected before the call returns (where "the call" means the call of the function in which `a` is a local variable). Very old Pythons had an idiotic __del__ method attached to their Windows socket wrapper, which inhibited timely gc.
I tried moving the code to the following standalone script, and I can reproduce the error with that. In the original code w is kept as an instance variable, and r is passed to asyncore.dispatcher.__init__ and probably kept there.
Yes, the socket bound to `r` also gets bound to `self.socket` by this call: asyncore.dispatcher.__init__ (self, r)
I simulate that by returning them, then the caller of socktest can keep them around.
I try to call socktest from different processes A and B (two pythons): (w,r = socktest()) The call in A gets port 19999. The second call, in B, either blocks, or takes over port 19999 (I see the second process taking over the port in a port scanner.)
Sorry, I can't reproduce this -- but you didn't give a test program, just an isolated function, and I'm not sure what you did with it. I called that function in an infinite loop, appending the return value to a global list, with a short (< 0.1 second) sleep between iterations, and closed the returned sockets fifty iterations after they were created. Ran that loop in two processes. No hangs, or any other oddities, for some minutes. It did _eventually_ hang-- and both processes at the same time --with netstat showing more than 4000 sockets hanging around in TIME_WAIT state then. I assume I bashed into some internal Windows socket resource limit there, which Windows didn't handle gracefully. Attaching to the processes under the MSVC 6 debugger, they were hung inside the MS socket libraries. Repeated this several times (everything appeared to work fine until > 4000 sockets were sitting in TIME_WAIT, and then both processes hung at approximately the same time). Concretely: sofar = [] try: while 1: print '.', stuff = socktest() # calling your function sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) w, r = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() Note that there's also a bit of code there to verify that the connected sockets can communicate correctly; the `assert` never triggered. You haven't said which versions of Windows or Python you're using. I was using XP Pro SP2 and Python 2.3.5. Don't know whether either matters. It was certainly the case when I ran it that your
print port
statement needed to display ports less than 19999 at times, meaning that the
a.bind((host, port))
did raise an exception at times. It never printed a port number less than 19997 for me. Did you ever see it print a port number less than 19999?
a.bind in B does not raise socket.error: (10048, 'Address already in use') as expected, when the server socket in A is closed, even though the port is used by the client socket r in A.
I'm not sure what that's saying, but could be it's an illusion. For example,
import socket s = socket.socket() s.bind(('localhost', 19999)) s.listen(2) a1 = socket.socket() a2 = socket.socket() a1.connect(('localhost', 19999)) a2.connect(('localhost', 19999)) b1 = s.accept() b2 = s.accept() b1[0].getsockname() ('127.0.0.1', 19999) b2[0].getsockname() ('127.0.0.1', 19999)
That is, it's normal for the `r` in
r, addr = a.accept()
to repeat port numbers across multiple `accept()` calls, and indeed to duplicate the port number from the `bind` call. This always confused me (from way back in my Unix days -- it's not "a Windows thing"), and maybe it's not what you're talking about anyway.
If I remove a.close(), and keep a around (by passing it to the caller), a.bind works as expected - it raises socket.error: (10048, 'Address already in use').
As above, I'm seeing `bind` raise exceptions regardless.
But in the litterature on sockets, I read it should be okay to close the server socket and keep using the client sockets.
So, is this a possible bug in bind() ?
Sure feels that way to me, and I'm not seeing it (or don't know how to provoke it). But I'm not a socket expert, and am not sure I've ever met anyone who truly was ;-)
I have tested the new code from Tim Peters, it apparently works, ports are given out by windows. But could the same problem with bind occur here, since a is closed (and garbage collected) ? (far less chance for that since we do not specify port numbers, I know).
I tried getting a pair of sockets with Tim's code, and then trying to bind a third socket to the same port as a/r. And I got the same problem as above.
Here I'm not sure what "the same problem" means, as you've described more than one problem. Do you mean that you get a hang? Or that you see suspiciously repeated port numbers? Or ...? Seeing concrete code might help. Last question for now: have you seen a hang on more than one flavor of Windows? Thanks for digging into this! [and Sune's code]
import socket, errno
class BindError(Exception): pass
def socktest(): """blabla """
address = ('127.9.9.9', 19999)
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host='127.0.0.1' port=19999
while 1: print port try: a.bind((host, port)) break except: if port <= 19950: raise BindError, 'Cannot bind trigger!' port=port - 1
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass r, addr = a.accept() a.close() w.setblocking (1)
#return (a, w, r) return (w, r) #return w
I will try to recreate the problem on other flavours of windows asap. I will get back to you later. I guess my reporting was a bit too quick, sorry: I'm running python 2.3.5, (installed from windows binary). Zope 2.7.7 (not necessary for the test scripts) Windows XP Home SP2 (blush - my laptop came with that... ;) ) Sune Tim Peters wrote:
[Sune Brøndum Wøller]
Thanks for the pointer. I have been debugging select_trigger.py, and has some more info:
The problem is that the call a.accept() sometimes hangs. Apparently a.bind(self.address) allows us to bind to a port that another zope instance already is bound to.
The code creates the server socket a, and the client socket w, and gets the client socket r by connecting w to a. Then it closes a. a goes out of scope when __init__ terminates, and is probably garbage collected at some point.
Unless you're using a very old Python, `a` is collected before the call returns (where "the call" means the call of the function in which `a` is a local variable). Very old Pythons had an idiotic __del__ method attached to their Windows socket wrapper, which inhibited timely gc.
I tried moving the code to the following standalone script, and I can reproduce the error with that. In the original code w is kept as an instance variable, and r is passed to asyncore.dispatcher.__init__ and probably kept there.
Yes, the socket bound to `r` also gets bound to `self.socket` by this call:
asyncore.dispatcher.__init__ (self, r)
I simulate that by returning them, then the caller of socktest can keep them around.
I try to call socktest from different processes A and B (two pythons): (w,r = socktest()) The call in A gets port 19999. The second call, in B, either blocks, or takes over port 19999 (I see the second process taking over the port in a port scanner.)
Sorry, I can't reproduce this -- but you didn't give a test program, just an isolated function, and I'm not sure what you did with it. I called that function in an infinite loop, appending the return value to a global list, with a short (< 0.1 second) sleep between iterations, and closed the returned sockets fifty iterations after they were created. Ran that loop in two processes. No hangs, or any other oddities, for some minutes. It did _eventually_ hang-- and both processes at the same time --with netstat showing more than 4000 sockets hanging around in TIME_WAIT state then. I assume I bashed into some internal Windows socket resource limit there, which Windows didn't handle gracefully. Attaching to the processes under the MSVC 6 debugger, they were hung inside the MS socket libraries. Repeated this several times (everything appeared to work fine until > 4000 sockets were sitting in TIME_WAIT, and then both processes hung at approximately the same time).
Concretely:
sofar = [] try: while 1: print '.', stuff = socktest() # calling your function sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) w, r = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close()
Note that there's also a bit of code there to verify that the connected sockets can communicate correctly; the `assert` never triggered.
You haven't said which versions of Windows or Python you're using. I was using XP Pro SP2 and Python 2.3.5. Don't know whether either matters.
It was certainly the case when I ran it that your
print port
statement needed to display ports less than 19999 at times, meaning that the
a.bind((host, port))
did raise an exception at times. It never printed a port number less than 19997 for me. Did you ever see it print a port number less than 19999?
a.bind in B does not raise socket.error: (10048, 'Address already in use') as expected, when the server socket in A is closed, even though the port is used by the client socket r in A.
I'm not sure what that's saying, but could be it's an illusion. For example,
import socket s = socket.socket() s.bind(('localhost', 19999)) s.listen(2) a1 = socket.socket() a2 = socket.socket() a1.connect(('localhost', 19999)) a2.connect(('localhost', 19999)) b1 = s.accept() b2 = s.accept() b1[0].getsockname()
('127.0.0.1', 19999)
b2[0].getsockname()
('127.0.0.1', 19999)
That is, it's normal for the `r` in
r, addr = a.accept()
to repeat port numbers across multiple `accept()` calls, and indeed to duplicate the port number from the `bind` call. This always confused me (from way back in my Unix days -- it's not "a Windows thing"), and maybe it's not what you're talking about anyway.
If I remove a.close(), and keep a around (by passing it to the caller), a.bind works as expected - it raises socket.error: (10048, 'Address already in use').
As above, I'm seeing `bind` raise exceptions regardless.
But in the litterature on sockets, I read it should be okay to close the server socket and keep using the client sockets.
So, is this a possible bug in bind() ?
Sure feels that way to me, and I'm not seeing it (or don't know how to provoke it). But I'm not a socket expert, and am not sure I've ever met anyone who truly was ;-)
I have tested the new code from Tim Peters, it apparently works, ports are given out by windows. But could the same problem with bind occur here, since a is closed (and garbage collected) ? (far less chance for that since we do not specify port numbers, I know).
I tried getting a pair of sockets with Tim's code, and then trying to bind a third socket to the same port as a/r. And I got the same problem as above.
Here I'm not sure what "the same problem" means, as you've described more than one problem. Do you mean that you get a hang? Or that you see suspiciously repeated port numbers? Or ...? Seeing concrete code might help.
Last question for now: have you seen a hang on more than one flavor of Windows? Thanks for digging into this!
[and Sune's code]
import socket, errno
class BindError(Exception): pass
def socktest(): """blabla """
address = ('127.9.9.9', 19999)
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host='127.0.0.1' port=19999
while 1: print port try: a.bind((host, port)) break except: if port <= 19950: raise BindError, 'Cannot bind trigger!' port=port - 1
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass r, addr = a.accept() a.close() w.setblocking (1)
#return (a, w, r) return (w, r) #return w
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
I will try to recreate the problem on other flavours of windows asap. I will get back to you later. I guess my reporting was a bit too quick, sorry: I'm running python 2.3.5, (installed from windows binary). Zope 2.7.7 (not necessary for the test scripts) Windows XP Home SP2 (blush - my laptop came with that... ;) ) Sune Tim Peters wrote:
[Sune Brøndum Wøller]
Thanks for the pointer. I have been debugging select_trigger.py, and has some more info:
The problem is that the call a.accept() sometimes hangs. Apparently a.bind(self.address) allows us to bind to a port that another zope instance already is bound to.
The code creates the server socket a, and the client socket w, and gets the client socket r by connecting w to a. Then it closes a. a goes out of scope when __init__ terminates, and is probably garbage collected at some point.
Unless you're using a very old Python, `a` is collected before the call returns (where "the call" means the call of the function in which `a` is a local variable). Very old Pythons had an idiotic __del__ method attached to their Windows socket wrapper, which inhibited timely gc.
I tried moving the code to the following standalone script, and I can reproduce the error with that. In the original code w is kept as an instance variable, and r is passed to asyncore.dispatcher.__init__ and probably kept there.
Yes, the socket bound to `r` also gets bound to `self.socket` by this call:
asyncore.dispatcher.__init__ (self, r)
I simulate that by returning them, then the caller of socktest can keep them around.
I try to call socktest from different processes A and B (two pythons): (w,r = socktest()) The call in A gets port 19999. The second call, in B, either blocks, or takes over port 19999 (I see the second process taking over the port in a port scanner.)
Sorry, I can't reproduce this -- but you didn't give a test program, just an isolated function, and I'm not sure what you did with it. I called that function in an infinite loop, appending the return value to a global list, with a short (< 0.1 second) sleep between iterations, and closed the returned sockets fifty iterations after they were created. Ran that loop in two processes. No hangs, or any other oddities, for some minutes. It did _eventually_ hang-- and both processes at the same time --with netstat showing more than 4000 sockets hanging around in TIME_WAIT state then. I assume I bashed into some internal Windows socket resource limit there, which Windows didn't handle gracefully. Attaching to the processes under the MSVC 6 debugger, they were hung inside the MS socket libraries. Repeated this several times (everything appeared to work fine until > 4000 sockets were sitting in TIME_WAIT, and then both processes hung at approximately the same time).
Concretely:
sofar = [] try: while 1: print '.', stuff = socktest() # calling your function sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) w, r = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close()
Note that there's also a bit of code there to verify that the connected sockets can communicate correctly; the `assert` never triggered.
You haven't said which versions of Windows or Python you're using. I was using XP Pro SP2 and Python 2.3.5. Don't know whether either matters.
It was certainly the case when I ran it that your
print port
statement needed to display ports less than 19999 at times, meaning that the
a.bind((host, port))
did raise an exception at times. It never printed a port number less than 19997 for me. Did you ever see it print a port number less than 19999?
a.bind in B does not raise socket.error: (10048, 'Address already in use') as expected, when the server socket in A is closed, even though the port is used by the client socket r in A.
I'm not sure what that's saying, but could be it's an illusion. For example,
import socket s = socket.socket() s.bind(('localhost', 19999)) s.listen(2) a1 = socket.socket() a2 = socket.socket() a1.connect(('localhost', 19999)) a2.connect(('localhost', 19999)) b1 = s.accept() b2 = s.accept() b1[0].getsockname()
('127.0.0.1', 19999)
b2[0].getsockname()
('127.0.0.1', 19999)
That is, it's normal for the `r` in
r, addr = a.accept()
to repeat port numbers across multiple `accept()` calls, and indeed to duplicate the port number from the `bind` call. This always confused me (from way back in my Unix days -- it's not "a Windows thing"), and maybe it's not what you're talking about anyway.
If I remove a.close(), and keep a around (by passing it to the caller), a.bind works as expected - it raises socket.error: (10048, 'Address already in use').
As above, I'm seeing `bind` raise exceptions regardless.
But in the litterature on sockets, I read it should be okay to close the server socket and keep using the client sockets.
So, is this a possible bug in bind() ?
Sure feels that way to me, and I'm not seeing it (or don't know how to provoke it). But I'm not a socket expert, and am not sure I've ever met anyone who truly was ;-)
I have tested the new code from Tim Peters, it apparently works, ports are given out by windows. But could the same problem with bind occur here, since a is closed (and garbage collected) ? (far less chance for that since we do not specify port numbers, I know).
I tried getting a pair of sockets with Tim's code, and then trying to bind a third socket to the same port as a/r. And I got the same problem as above.
Here I'm not sure what "the same problem" means, as you've described more than one problem. Do you mean that you get a hang? Or that you see suspiciously repeated port numbers? Or ...? Seeing concrete code might help.
Last question for now: have you seen a hang on more than one flavor of Windows? Thanks for digging into this!
[and Sune's code]
import socket, errno
class BindError(Exception): pass
def socktest(): """blabla """
address = ('127.9.9.9', 19999)
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host='127.0.0.1' port=19999
while 1: print port try: a.bind((host, port)) break except: if port <= 19950: raise BindError, 'Cannot bind trigger!' port=port - 1
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass r, addr = a.accept() a.close() w.setblocking (1)
#return (a, w, r) return (w, r) #return w
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Sune B. Woeller]
I will try to recreate the problem on other flavours of windows asap. I will get back to you later.
Cool! If you can, posting a self-contained program that demonstrates the problem is the best way to make progress.
I guess my reporting was a bit too quick, sorry:
Not at all -- you did excellent detective work here! It's appreciated. The problem is that English descriptions are nearly always ambiguous, especially when trying to explain something complicated that other people haven't reported. Posting a program removes all that guesswork: it reproduces the problem for other people on other boxes, or it doesn't, and we learn something valuable either way; if it does fail for others, then they can help investigate _why_ it fails. At the start, thoroughly demonstrating a problem exists is more important than guessing at what might be needed to worm around it.
I'm running python 2.3.5, (installed from windows binary). Zope 2.7.7 (not necessary for the test scripts) Windows XP Home SP2 (blush - my laptop came with that... ;) )
Good -- thanks. A pretty vanilla system, then. I've heard that XP Home has "special" limitations on network capabilities, but don't know more than that; it's at least possible they're relevant. I'm not sure that running multiple Zope instances on a laptop is a prime use case for Zope <wink>.
Tim Peters wrote:
[Sune B. Woeller]
I will try to recreate the problem on other flavours of windows asap. I will get back to you later.
Cool! If you can, posting a self-contained program that demonstrates the problem is the best way to make progress.
I guess my reporting was a bit too quick, sorry:
Not at all -- you did excellent detective work here! It's appreciated. The problem is that English descriptions are nearly always ambiguous, especially when trying to explain something complicated that other people haven't reported. Posting a program removes all that guesswork: it reproduces the problem for other people on other boxes, or it doesn't, and we learn something valuable either way; if it does fail for others, then they can help investigate _why_ it fails. At the start, thoroughly demonstrating a problem exists is more important than guessing at what might be needed to worm around it.
I'm running python 2.3.5, (installed from windows binary). Zope 2.7.7 (not necessary for the test scripts) Windows XP Home SP2 (blush - my laptop came with that... ;) )
Good -- thanks. A pretty vanilla system, then. I've heard that XP Home has "special" limitations on network capabilities, but don't know more than that; it's at least possible they're relevant. I'm not sure that running multiple Zope instances on a laptop is a prime use case for Zope <wink>.
I consider it very usefull, and I can see nothing that should cause problems with that (well, except from the problem mentioned i this thread). Of course, I do not use the laptop as server - but for development purposes. I have around 10-15 instances for varoius projects, quite often with 2 or 3 instances running at the same time.
[Tim Peters] ...
.... Ran that loop in two processes. No hangs, or any other oddities, for some minutes. It did _eventually_ hang-- and both processes at the same time --with netstat showing more than 4000 sockets hanging around in TIME_WAIT state then. I assume I bashed into some internal Windows socket resource limit there, which Windows didn't handle gracefully. Attaching to the processes under the MSVC 6 debugger, they were hung inside the MS socket libraries. Repeated this several times (everything appeared to work fine until > 4000 sockets were sitting in TIME_WAIT, and then both processes hung at approximately the same time).
More info on that: since WinXP Pro supplies only about 4000 ephemeral ports by default, and the program kept hanging after about 4000 ephemeral ports were in use (albeit most in their 4-minute TIME_WAIT shutdown state), I tried boosting the # of ephemeral ports: http://support.microsoft.com/kb/q196271 After that, I never saw the processes hang again. BUT, I saw something worse: after about 20 minutes, both processes died with assert errors, in the code I added to verify that the sockets were communicating correctly. The random string created in process A was actually read by a socket in process B (instead of by its pair in process A), and vice versa: the random string created in process B was read in process A, and at approximately the same time process B was reading process A's string. I tried it again, and got a pair of similar assert failures after about 15 minutes. That's dreadful, and I don't see how it could be anything except a race bug in the Windows socket implementation. The same program on Linux doesn't run long enough to say anything interesting -- it raises "BindError, 'Cannot bind trigger!'" very quickly every time, because it apparently keeps server port numbers (19999, 19998, ,,,) reserved for "a long time" after the server socket is closed (where "a long time" just means longer than the few seconds it takes for the program to die on Linux). All of the above is wrt using socktest1() below. socktest2() below contains the Windows code I already changed ZODB 3.4 to use. I've been running socktest2() in two processes that way on Windows for more than 2 hours now, with no glitches. The same code is running fine on a Linux box too. So best guess now is that there is a subtle, rare error in the Windows socket code that could cause the Medusa/ZODB3.2 Windows trigger code to screw up. Complete code: import socket, errno import time, random class BindError(Exception): pass def socktest1(): """blabla """ address = ('127.9.9.9', 19999) a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # tricky: get a pair of connected sockets host='127.0.0.1' port=19999 while 1: if port < 19999: print port try: a.bind((host, port)) break except: if port <= 19950: raise BindError, 'Cannot bind trigger!' port -= 1 a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass r, addr = a.accept() a.close() w.setblocking (1) #return (a, w, r) return (r, w) #return w def socktest2(): a = socket.socket() w = socket.socket() # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # Specifying port 0 tells Windows to pick a port for us. a.bind(("127.0.0.1", 0)) connect_address = a.getsockname() # assigned (host, port) pair a.listen(1) w.connect(connect_address) r, addr = a.accept() # r becomes asyncore's (self.)socket a.close() #return (a, w, r) return (r, w) #return w sofar = [] try: while 1: print '.', stuff = socktest1() sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close()
It's starting to look a lot like the Windows bind() implementation is unreliable, sometimes (but rarely -- hard to provoke) allowing two sockets to bind to the same (address, port) pair simultaneously, instead of raising 'Address already in use' for one of them. Disaster ensues. WRT the last version of the code I posted, on another XP Pro SP2 machine (again after playing registry games to boost the number of ephemeral ports) I eventually saw all of: hangs during accept(); the assertion errors I mentioned last time; and mystery "Connection refused" errors during connect(). The variant of the code below _only_ tries to use port 19999. If it can't bind to that on the first try, socktest111() raises an exception instead of trying again (or trying a different port number). Ran two processes. After about 15 minutes, both died with assert errors at about the same time (identical, so far as I could tell by eyeball): Process A: Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845)) Process B: Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846)) So it's again the business where each process is recv'ing the random string intended to be recv'ed by a socket in the other process. Hypothesized timeline: process A's `a` binds to 19999 process B's `a` binds to 19999 -- according to me, this should be impossible in the absence of SO_REUSEADDR (which acts very differently on Windows than it does on Linux, BTW -- on Linux this should be impossible even in the presence of SO_REUSEADDR; regardless, we're not using SO_REUSEADDR here, and the braindead hard-coded w.setsockopt(socket.IPPROTO_TCP, 1, 1) is actually using the right magic constant for TCP_NODELAY on Windows, as it intends). A and B both listen() A connect()s, and accidentally gets on B.a's accept queue B connect()s, and accidentally gets on A.a's accept queue the rest follows inexorably Note that because this never tries a port number other than 19999, it can't be a bulletproof workaround simply to hold on to the `a` socket. If the hypothesized timeline above is right, bind() can't be trusted on Windows in any situation where two processes may try to bind to the same hostname:port pair at the same time. Holding on to `a`, and cycling through port numbers when bind() failed, would still potentially leave two processes trying to bind to the same port number simultaneously (just a port other than 19999). Ick: this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1), so if it is -- as is looking more and more likely --an error in MS's socket implementation, it isn't avoided by switching to a newer MS C library. Frankly, I don't see a sane way to worm around this -- it's difficult for application code to worm around what smells like a missing critical section in system code. Using the simpler socket dance from the ZODB 3.4 code, I haven't yet seen an instance of the assert failure, or a hang. However, let two processes run that long enough simultaneously, and it always (so far) eventually fails with socket.error: (10048, 'Address already in use') in the w.connect() call, and despite that Windows picks the port numbers here! While that also smells to heaven of a missing critical section in the Windows socket implementation, an exception is much easier to live with / worm around. Alas, we don't have the MS source code, and I don't have time to try disassembling / reverse-engineering the opcodes (what EULA <wink>?), so best I can do is run this for many more hours to try to increase confidence that an exception is the worst that can occur under the ZODB 3.4 spelling. Here's full code for the "only try port 19999" version: import socket, errno import time, random def socktest111(): """Raise an exception if we can't get 19999. """ a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999 try: a.bind((host, port)) except: raise RuntimeError else: print 'b', a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass print 'c', r, addr = a.accept() print 'a', a.close() print 'c', w.setblocking (1) return (r, w) sofar = [] try: while 1: try: stuff = socktest111() except RuntimeError: print 'x', time.sleep(random.random()/10) continue sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close()
Tim Peters wrote:
It's starting to look a lot like the Windows bind() implementation is unreliable, sometimes (but rarely -- hard to provoke) allowing two sockets to bind to the same (address, port) pair simultaneously, instead of raising 'Address already in use' for one of them. Disaster ensues.
WRT the last version of the code I posted, on another XP Pro SP2 machine (again after playing registry games to boost the number of ephemeral ports) I eventually saw all of: hangs during accept(); the assertion errors I mentioned last time; and mystery "Connection refused" errors during connect().
The variant of the code below _only_ tries to use port 19999. If it can't bind to that on the first try, socktest111() raises an exception instead of trying again (or trying a different port number). Ran two processes. After about 15 minutes, both died with assert errors at about the same time (identical, so far as I could tell by eyeball):
Process A:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))
Process B:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))
So it's again the business where each process is recv'ing the random string intended to be recv'ed by a socket in the other process. Hypothesized timeline:
process A's `a` binds to 19999 process B's `a` binds to 19999 -- according to me, this should be impossible in the absence of SO_REUSEADDR (which acts very differently on Windows than it does on Linux, BTW -- on Linux this should be impossible even in the presence of SO_REUSEADDR; regardless, we're not using SO_REUSEADDR here, and the braindead hard-coded
w.setsockopt(socket.IPPROTO_TCP, 1, 1)
is actually using the right magic constant for TCP_NODELAY on Windows, as it intends). A and B both listen() A connect()s, and accidentally gets on B.a's accept queue B connect()s, and accidentally gets on A.a's accept queue the rest follows inexorably
This is what I'm experiencing as well. I can narrow it down a bit: I *always* experience one out of two erroneous behaviours, as described below. I tried to make an even simpler test situation, without binding sockets 'r' and 'w' to each other in the same process. I try to reproduce the problem in a 'standard' socket use case, where a client in one process binds to a server in another process. The following two scripts acts as a server and a client. #*********************** # sock_server_reader.py #*********************** import socket a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) a.bind(("127.0.0.1", 19999)) print a.getsockname() # assigned (host, port) pair a.listen(1) print "a accepting:" r, addr = a.accept() # r becomes asyncore's (self.)socket print "a accepted: " print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername()) a.close() msg = r.recv(100) print 'msg recieved:', msg #*********************** # sock_client_writer.py #*********************** import socket, random w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w.setsockopt(socket.IPPROTO_TCP, 1, 1) print 'w connecting:' w.connect(('127.0.0.1', 19999)) print 'w connected:' print w.getsockname() print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername()) msg = str(random.randrange(1000000)) print 'sending msg: ', msg w.send(msg) There are two possible outcomes [a) and b)] of running two instances of this client/server pair (that is, 4 processes in total like the following). (Numbers 1 to 4 are steps executed in chronological order.) 1) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting: and waits for a connection 2) python -i sock_client_writer.py The client prints: w connecting: w connected: ('127.0.0.1', 3774) ('127.0.0.1', 3774), peer=('127.0.0.1', 19999) sending msg: 903848 >>> and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3774) msg recieved: 903848 >>> This is like it should be. Then lets try to setup a second client/server pair, on the same port (19999). The expected outcome of this is that the bind() call in sock_server_reader.py should fail with socket.error: (10048, 'Address already in use'). 3) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting: Already here the problem occurs, bind() is allowed to bind to a port that is in use, in this case by the client socket 'r'. [also on other windows ? Mikkel: yes. Diku:???] 4) python -i sock_client_writer.py Now one out of two things happen: a) The client prints: w connecting: Traceback (most recent call last): File "c:\pyscripts\sock_client_writer.py", line 7, in ? w.connect(('127.0.0.1', 19999)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused') >>> The server waits on the call to accept(), still waiting for a connection. (This is the blocking behaviour I reported in my first mail, experienced when running two zope instances. The socket error was swallowed by the unconditional except clause). b) The client connects to the server: w connecting: w connected: ('127.0.0.1', 3865) ('127.0.0.1', 3865), peer=('127.0.0.1', 19999) sending msg: 119105 >>> and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3865) msg recieved: 119105 >>> The second set of client/server processes are now connected on the same port as the first set of client/server processes. In a port scanner the port now belongs two the second server process [3)]. I always get one out of these two possibilities (a and b), I never see bind() raising socket.error: (10048, 'Address already in use'). It is important to realize that both these outcomes are an error. I tried the same process as above on a linux system, and 3) always raises (10048, 'Address already in use'). If case a) occured, where w.connect raises socket.error: (10061, 'Connection refused'), trying to run a third client/server pair, the bind() call raises (10048, 'Address already in use'). The 'a'-socket from the second pair of processes is not closed in this case, but still trying to accept(). In my case bind() always raises (10048, 'Address already in use') when there is an open server socket like 'a' bound to the same port. To summarize: Closing a server socket bound to a given port, alows another server socket to bind to the same port, even when there are open client sockets bound to the port.
Note that because this never tries a port number other than 19999, it can't be a bulletproof workaround simply to hold on to the `a` socket. If the hypothesized timeline above is right, bind() can't be trusted on Windows in any situation where two processes may try to bind to the same hostname:port pair at the same time. Holding on to `a`, and cycling through port numbers when bind() failed, would still potentially leave two processes trying to bind to the same port number simultaneously (just a port other than 19999).
It would not be enough to keep a reference to 'a'. It would have to be kept open as well. And maybe that is not a problem, since we only accept() once - only one 'w' client socket would be able to be accepted. Normally the use case for closing the server socket is to disallow more connections than those already acceptet. (But I'm not so experienced with sockets, I might be wrong.)
Ick: this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1), so if it is -- as is looking more and more likely --an error in MS's socket implementation, it isn't avoided by switching to a newer MS C library.
Frankly, I don't see a sane way to worm around this -- it's difficult for application code to worm around what smells like a missing critical section in system code.
Using the simpler socket dance from the ZODB 3.4 code, I haven't yet seen an instance of the assert failure, or a hang. However, let two processes run that long enough simultaneously, and it always (so far) eventually fails with
socket.error: (10048, 'Address already in use')
in the w.connect() call, and despite that Windows picks the port numbers here!
That is exactly what I feared could happen. As shown in my example above, the other that might happen is that the port is 'taken over' by the other process.
While that also smells to heaven of a missing critical section in the Windows socket implementation, an exception is much easier to live with / worm around. Alas, we don't have the MS source code, and I don't have time to try disassembling / reverse-engineering the opcodes (what EULA <wink>?), so best I can do is run this for many more hours to try to increase confidence that an exception is the worst that can occur under the ZODB 3.4 spelling.
Here's full code for the "only try port 19999" version:
import socket, errno import time, random def socktest111(): """Raise an exception if we can't get 19999. """
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999
try: a.bind((host, port)) except: raise RuntimeError else: print 'b',
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass print 'c', r, addr = a.accept() print 'a', a.close() print 'c', w.setblocking (1)
return (r, w)
sofar = [] try: while 1: try: stuff = socktest111() except RuntimeError: print 'x', time.sleep(random.random()/10) continue sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
I have made two similar testprograms in c++, and the problem also occurs there. Exactly the same pattern as my python client/server scripts in the mail I am replying to. But then I stumbled upon this flag in the WinSock documentation: SO_EXCLUSIVEADDRUSE See the description here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/win... It is very interesting reading, especially: "An important caveat to using the SO_EXCLUSIVEADDRUSE option exists: If one or more connections originating from (or accepted on) a port bound with SO_EXCLUSIVEADDRUSE is active, all bind attempts to that port will fail." This is just what we want (and I think that is standard behaviour on Linux). I have tested it with my c+ programs, and when i set that option on the server socket before the bind(), it works, bind() in the second server process fails with WSAEADDRINUSE (bind() failed: 10048.) There is a python bugfix for this, but only for python 2.4: http://sourceforge.net/tracker/index.php?func=detail&aid=982665&group_id=547... (It is added to version 1.294 of socketmodule.c) I run the two test programs from two cmd terminals, like I described for the python versions. // link with ws2_32.lib //sock_server.cpp #include <cstdlib> #include <stdio.h> #include <conio.h> #include "winsock2.h" void main() { // Initialize Winsock. WSADATA wsaData; int iResult = WSAStartup( MAKEWORD(2,2), &wsaData ); if ( iResult != NO_ERROR ) printf("Error at WSAStartup()\n"); // Create a socket. SOCKET m_socket; m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP ); if ( m_socket == INVALID_SOCKET ) { printf( "Error at socket(): %ld\n", WSAGetLastError() ); WSACleanup(); return; } // try to use SO_EXCLUSIVEADDRUSE BOOL bOptVal = TRUE; int bOptLen = sizeof(BOOL); if (setsockopt(m_socket, SOL_SOCKET, SO_EXCLUSIVEADDRUSE, (char*)&bOptVal, bOptLen) != SOCKET_ERROR) { printf("Set SO_EXCLUSIVEADDRUSE: ON\n"); } // Bind the socket. sockaddr_in service; service.sin_family = AF_INET; service.sin_addr.s_addr = inet_addr( "127.0.0.1" ); service.sin_port = htons( 19990 ); if ( bind( m_socket, (SOCKADDR*) &service, sizeof(service) ) == SOCKET_ERROR ) { printf( "bind() failed: %i.\n", WSAGetLastError() ); closesocket(m_socket); return; } // Listen on the socket. if ( listen( m_socket, 1 ) == SOCKET_ERROR ) printf( "Error listening on socket.\n"); // Accept connections. SOCKET AcceptSocket; printf( "Waiting for a client to connect...\n" ); while (1) { AcceptSocket = SOCKET_ERROR; while ( AcceptSocket == SOCKET_ERROR ) { AcceptSocket = accept( m_socket, NULL, NULL ); } printf( "Client Connected.\n"); //m_socket = AcceptSocket; break; } closesocket(m_socket); // Send and receive data. int bytesRecv = SOCKET_ERROR; char recvbuf[32] = ""; bytesRecv = recv( AcceptSocket, recvbuf, 32, 0 ); printf( "Bytes Recv: %ld\n", bytesRecv ); printf("Recieved: %s\n", recvbuf); printf("press a key to terminate\n"); getch(); return; } //sock_client.cpp #include <stdio.h> #include <conio.h> #include "winsock2.h" void main() { // Initialize Winsock. WSADATA wsaData; int iResult = WSAStartup( MAKEWORD(2,2), &wsaData ); if ( iResult != NO_ERROR ) printf("Error at WSAStartup()\n"); // Create a socket. SOCKET m_socket; m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP ); if ( m_socket == INVALID_SOCKET ) { printf( "Error at socket(): %ld\n", WSAGetLastError() ); WSACleanup(); return; } // Connect to a server. sockaddr_in clientService; clientService.sin_family = AF_INET; clientService.sin_addr.s_addr = inet_addr( "127.0.0.1" ); clientService.sin_port = htons( 19990 ); if ( connect( m_socket, (SOCKADDR*) &clientService, sizeof(clientService) ) == SOCKET_ERROR) { printf( "Failed to connect.\n" ); WSACleanup(); return; } // Send and receive data. int bytesSent; char sendbuf[32] = ""; printf("Enter string to send (max 30 bytes):\n"); scanf("%s", sendbuf ); printf("Sending: %s\n", sendbuf); bytesSent = send( m_socket, sendbuf, strlen(sendbuf), 0 ); printf( "Bytes Sent: %ld\n", bytesSent ); printf("press a key to terminate\n"); getch(); return; } Sune B. Woeller wrote:
Tim Peters wrote:
It's starting to look a lot like the Windows bind() implementation is unreliable, sometimes (but rarely -- hard to provoke) allowing two sockets to bind to the same (address, port) pair simultaneously, instead of raising 'Address already in use' for one of them. Disaster ensues.
WRT the last version of the code I posted, on another XP Pro SP2 machine (again after playing registry games to boost the number of ephemeral ports) I eventually saw all of: hangs during accept(); the assertion errors I mentioned last time; and mystery "Connection refused" errors during connect().
The variant of the code below _only_ tries to use port 19999. If it can't bind to that on the first try, socktest111() raises an exception instead of trying again (or trying a different port number). Ran two processes. After about 15 minutes, both died with assert errors at about the same time (identical, so far as I could tell by eyeball):
Process A:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))
Process B:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))
So it's again the business where each process is recv'ing the random string intended to be recv'ed by a socket in the other process. Hypothesized timeline:
process A's `a` binds to 19999 process B's `a` binds to 19999 -- according to me, this should be impossible in the absence of SO_REUSEADDR (which acts very differently on Windows than it does on Linux, BTW -- on Linux this should be impossible even in the presence of SO_REUSEADDR; regardless, we're not using SO_REUSEADDR here, and the braindead hard-coded
w.setsockopt(socket.IPPROTO_TCP, 1, 1)
is actually using the right magic constant for TCP_NODELAY on Windows, as it intends). A and B both listen() A connect()s, and accidentally gets on B.a's accept queue B connect()s, and accidentally gets on A.a's accept queue the rest follows inexorably
This is what I'm experiencing as well. I can narrow it down a bit: I *always* experience one out of two erroneous behaviours, as described below.
I tried to make an even simpler test situation, without binding sockets 'r' and 'w' to each other in the same process. I try to reproduce the problem in a 'standard' socket use case, where a client in one process binds to a server in another process.
The following two scripts acts as a server and a client.
#*********************** # sock_server_reader.py #*********************** import socket
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
a.bind(("127.0.0.1", 19999)) print a.getsockname() # assigned (host, port) pair
a.listen(1)
print "a accepting:" r, addr = a.accept() # r becomes asyncore's (self.)socket print "a accepted: " print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())
a.close()
msg = r.recv(100) print 'msg recieved:', msg
#*********************** # sock_client_writer.py #*********************** import socket, random
w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w.setsockopt(socket.IPPROTO_TCP, 1, 1)
print 'w connecting:' w.connect(('127.0.0.1', 19999)) print 'w connected:' print w.getsockname() print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername()) msg = str(random.randrange(1000000)) print 'sending msg: ', msg w.send(msg)
There are two possible outcomes [a) and b)] of running two instances of this client/server pair (that is, 4 processes in total like the following). (Numbers 1 to 4 are steps executed in chronological order.)
1) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting: and waits for a connection
2) python -i sock_client_writer.py The client prints: w connecting: w connected: ('127.0.0.1', 3774) ('127.0.0.1', 3774), peer=('127.0.0.1', 19999) sending msg: 903848 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3774) msg recieved: 903848 >>>
This is like it should be. Then lets try to setup a second client/server pair, on the same port (19999). The expected outcome of this is that the bind() call in sock_server_reader.py should fail with socket.error: (10048, 'Address already in use').
3) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting:
Already here the problem occurs, bind() is allowed to bind to a port that is in use, in this case by the client socket 'r'. [also on other windows ? Mikkel: yes. Diku:???]
4) python -i sock_client_writer.py Now one out of two things happen:
a) The client prints: w connecting: Traceback (most recent call last): File "c:\pyscripts\sock_client_writer.py", line 7, in ? w.connect(('127.0.0.1', 19999)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused') >>> The server waits on the call to accept(), still waiting for a connection. (This is the blocking behaviour I reported in my first mail, experienced when running two zope instances. The socket error was swallowed by the unconditional except clause).
b) The client connects to the server: w connecting: w connected: ('127.0.0.1', 3865) ('127.0.0.1', 3865), peer=('127.0.0.1', 19999) sending msg: 119105 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3865) msg recieved: 119105 >>>
The second set of client/server processes are now connected on the same port as the first set of client/server processes. In a port scanner the port now belongs two the second server process [3)].
I always get one out of these two possibilities (a and b), I never see bind() raising socket.error: (10048, 'Address already in use').
It is important to realize that both these outcomes are an error.
I tried the same process as above on a linux system, and 3) always raises (10048, 'Address already in use').
If case a) occured, where w.connect raises socket.error: (10061, 'Connection refused'), trying to run a third client/server pair, the bind() call raises (10048, 'Address already in use'). The 'a'-socket from the second pair of processes is not closed in this case, but still trying to accept().
In my case bind() always raises (10048, 'Address already in use') when there is an open server socket like 'a' bound to the same port.
To summarize: Closing a server socket bound to a given port, alows another server socket to bind to the same port, even when there are open client sockets bound to the port.
Note that because this never tries a port number other than 19999, it can't be a bulletproof workaround simply to hold on to the `a` socket. If the hypothesized timeline above is right, bind() can't be trusted on Windows in any situation where two processes may try to bind to the same hostname:port pair at the same time. Holding on to `a`, and cycling through port numbers when bind() failed, would still potentially leave two processes trying to bind to the same port number simultaneously (just a port other than 19999).
It would not be enough to keep a reference to 'a'. It would have to be kept open as well. And maybe that is not a problem, since we only accept() once - only one 'w' client socket would be able to be accepted. Normally the use case for closing the server socket is to disallow more connections than those already acceptet. (But I'm not so experienced with sockets, I might be wrong.)
Ick: this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1), so if it is -- as is looking more and more likely --an error in MS's socket implementation, it isn't avoided by switching to a newer MS C library.
Frankly, I don't see a sane way to worm around this -- it's difficult for application code to worm around what smells like a missing critical section in system code.
Using the simpler socket dance from the ZODB 3.4 code, I haven't yet seen an instance of the assert failure, or a hang. However, let two processes run that long enough simultaneously, and it always (so far) eventually fails with
socket.error: (10048, 'Address already in use')
in the w.connect() call, and despite that Windows picks the port numbers here!
That is exactly what I feared could happen. As shown in my example above, the other that might happen is that the port is 'taken over' by the other process.
While that also smells to heaven of a missing critical section in the Windows socket implementation, an exception is much easier to live with / worm around. Alas, we don't have the MS source code, and I don't have time to try disassembling / reverse-engineering the opcodes (what EULA <wink>?), so best I can do is run this for many more hours to try to increase confidence that an exception is the worst that can occur under the ZODB 3.4 spelling.
Here's full code for the "only try port 19999" version:
import socket, errno import time, random def socktest111(): """Raise an exception if we can't get 19999. """
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999
try: a.bind((host, port)) except: raise RuntimeError else: print 'b',
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass print 'c', r, addr = a.accept() print 'a', a.close() print 'c', w.setblocking (1)
return (r, w)
sofar = [] try: while 1: try: stuff = socktest111() except RuntimeError: print 'x', time.sleep(random.random()/10) continue sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
btw, the code is slightly modified versions of the getting started with Winsock example: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/win... Sune B. Woeller wrote:
I have made two similar testprograms in c++, and the problem also occurs there. Exactly the same pattern as my python client/server scripts in the mail I am replying to.
But then I stumbled upon this flag in the WinSock documentation: SO_EXCLUSIVEADDRUSE See the description here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/win...
It is very interesting reading, especially: "An important caveat to using the SO_EXCLUSIVEADDRUSE option exists: If one or more connections originating from (or accepted on) a port bound with SO_EXCLUSIVEADDRUSE is active, all bind attempts to that port will fail."
This is just what we want (and I think that is standard behaviour on Linux).
I have tested it with my c+ programs, and when i set that option on the server socket before the bind(), it works, bind() in the second server process fails with WSAEADDRINUSE (bind() failed: 10048.)
There is a python bugfix for this, but only for python 2.4: http://sourceforge.net/tracker/index.php?func=detail&aid=982665&group_id=547...
(It is added to version 1.294 of socketmodule.c)
I run the two test programs from two cmd terminals, like I described for the python versions.
// link with ws2_32.lib //sock_server.cpp #include <cstdlib> #include <stdio.h> #include <conio.h> #include "winsock2.h"
void main() {
// Initialize Winsock. WSADATA wsaData; int iResult = WSAStartup( MAKEWORD(2,2), &wsaData ); if ( iResult != NO_ERROR ) printf("Error at WSAStartup()\n");
// Create a socket. SOCKET m_socket; m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
if ( m_socket == INVALID_SOCKET ) { printf( "Error at socket(): %ld\n", WSAGetLastError() ); WSACleanup(); return; }
// try to use SO_EXCLUSIVEADDRUSE BOOL bOptVal = TRUE; int bOptLen = sizeof(BOOL); if (setsockopt(m_socket, SOL_SOCKET, SO_EXCLUSIVEADDRUSE, (char*)&bOptVal, bOptLen) != SOCKET_ERROR) { printf("Set SO_EXCLUSIVEADDRUSE: ON\n"); }
// Bind the socket. sockaddr_in service;
service.sin_family = AF_INET; service.sin_addr.s_addr = inet_addr( "127.0.0.1" ); service.sin_port = htons( 19990 );
if ( bind( m_socket, (SOCKADDR*) &service, sizeof(service) ) == SOCKET_ERROR ) { printf( "bind() failed: %i.\n", WSAGetLastError() ); closesocket(m_socket); return; }
// Listen on the socket. if ( listen( m_socket, 1 ) == SOCKET_ERROR ) printf( "Error listening on socket.\n");
// Accept connections. SOCKET AcceptSocket;
printf( "Waiting for a client to connect...\n" ); while (1) { AcceptSocket = SOCKET_ERROR; while ( AcceptSocket == SOCKET_ERROR ) { AcceptSocket = accept( m_socket, NULL, NULL ); } printf( "Client Connected.\n"); //m_socket = AcceptSocket; break; } closesocket(m_socket);
// Send and receive data.
int bytesRecv = SOCKET_ERROR;
char recvbuf[32] = ""; bytesRecv = recv( AcceptSocket, recvbuf, 32, 0 ); printf( "Bytes Recv: %ld\n", bytesRecv ); printf("Recieved: %s\n", recvbuf); printf("press a key to terminate\n"); getch();
return; }
//sock_client.cpp #include <stdio.h> #include <conio.h> #include "winsock2.h"
void main() {
// Initialize Winsock. WSADATA wsaData; int iResult = WSAStartup( MAKEWORD(2,2), &wsaData ); if ( iResult != NO_ERROR ) printf("Error at WSAStartup()\n");
// Create a socket. SOCKET m_socket; m_socket = socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
if ( m_socket == INVALID_SOCKET ) { printf( "Error at socket(): %ld\n", WSAGetLastError() ); WSACleanup(); return; }
// Connect to a server. sockaddr_in clientService;
clientService.sin_family = AF_INET; clientService.sin_addr.s_addr = inet_addr( "127.0.0.1" ); clientService.sin_port = htons( 19990 );
if ( connect( m_socket, (SOCKADDR*) &clientService, sizeof(clientService) ) == SOCKET_ERROR) { printf( "Failed to connect.\n" ); WSACleanup(); return; }
// Send and receive data. int bytesSent; char sendbuf[32] = ""; printf("Enter string to send (max 30 bytes):\n"); scanf("%s", sendbuf ); printf("Sending: %s\n", sendbuf);
bytesSent = send( m_socket, sendbuf, strlen(sendbuf), 0 ); printf( "Bytes Sent: %ld\n", bytesSent );
printf("press a key to terminate\n"); getch();
return; }
Sune B. Woeller wrote:
Tim Peters wrote:
It's starting to look a lot like the Windows bind() implementation is unreliable, sometimes (but rarely -- hard to provoke) allowing two sockets to bind to the same (address, port) pair simultaneously, instead of raising 'Address already in use' for one of them. Disaster ensues.
WRT the last version of the code I posted, on another XP Pro SP2 machine (again after playing registry games to boost the number of ephemeral ports) I eventually saw all of: hangs during accept(); the assertion errors I mentioned last time; and mystery "Connection refused" errors during connect().
The variant of the code below _only_ tries to use port 19999. If it can't bind to that on the first try, socktest111() raises an exception instead of trying again (or trying a different port number). Ran two processes. After about 15 minutes, both died with assert errors at about the same time (identical, so far as I could tell by eyeball):
Process A:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))
Process B:
Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))
So it's again the business where each process is recv'ing the random string intended to be recv'ed by a socket in the other process. Hypothesized timeline:
process A's `a` binds to 19999 process B's `a` binds to 19999 -- according to me, this should be impossible in the absence of SO_REUSEADDR (which acts very differently on Windows than it does on Linux, BTW -- on Linux this should be impossible even in the presence of SO_REUSEADDR; regardless, we're not using SO_REUSEADDR here, and the braindead hard-coded
w.setsockopt(socket.IPPROTO_TCP, 1, 1)
is actually using the right magic constant for TCP_NODELAY on Windows, as it intends). A and B both listen() A connect()s, and accidentally gets on B.a's accept queue B connect()s, and accidentally gets on A.a's accept queue the rest follows inexorably
This is what I'm experiencing as well. I can narrow it down a bit: I *always* experience one out of two erroneous behaviours, as described below.
I tried to make an even simpler test situation, without binding sockets 'r' and 'w' to each other in the same process. I try to reproduce the problem in a 'standard' socket use case, where a client in one process binds to a server in another process.
The following two scripts acts as a server and a client.
#*********************** # sock_server_reader.py #*********************** import socket
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
a.bind(("127.0.0.1", 19999)) print a.getsockname() # assigned (host, port) pair
a.listen(1)
print "a accepting:" r, addr = a.accept() # r becomes asyncore's (self.)socket print "a accepted: " print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())
a.close()
msg = r.recv(100) print 'msg recieved:', msg
#*********************** # sock_client_writer.py #*********************** import socket, random
w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w.setsockopt(socket.IPPROTO_TCP, 1, 1)
print 'w connecting:' w.connect(('127.0.0.1', 19999)) print 'w connected:' print w.getsockname() print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername()) msg = str(random.randrange(1000000)) print 'sending msg: ', msg w.send(msg)
There are two possible outcomes [a) and b)] of running two instances of this client/server pair (that is, 4 processes in total like the following). (Numbers 1 to 4 are steps executed in chronological order.)
1) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting: and waits for a connection
2) python -i sock_client_writer.py The client prints: w connecting: w connected: ('127.0.0.1', 3774) ('127.0.0.1', 3774), peer=('127.0.0.1', 19999) sending msg: 903848 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3774) msg recieved: 903848 >>>
This is like it should be. Then lets try to setup a second client/server pair, on the same port (19999). The expected outcome of this is that the bind() call in sock_server_reader.py should fail with socket.error: (10048, 'Address already in use').
3) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting:
Already here the problem occurs, bind() is allowed to bind to a port that is in use, in this case by the client socket 'r'. [also on other windows ? Mikkel: yes. Diku:???]
4) python -i sock_client_writer.py Now one out of two things happen:
a) The client prints: w connecting: Traceback (most recent call last): File "c:\pyscripts\sock_client_writer.py", line 7, in ? w.connect(('127.0.0.1', 19999)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused') >>> The server waits on the call to accept(), still waiting for a connection. (This is the blocking behaviour I reported in my first mail, experienced when running two zope instances. The socket error was swallowed by the unconditional except clause).
b) The client connects to the server: w connecting: w connected: ('127.0.0.1', 3865) ('127.0.0.1', 3865), peer=('127.0.0.1', 19999) sending msg: 119105 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3865) msg recieved: 119105 >>>
The second set of client/server processes are now connected on the same port as the first set of client/server processes. In a port scanner the port now belongs two the second server process [3)].
I always get one out of these two possibilities (a and b), I never see bind() raising socket.error: (10048, 'Address already in use').
It is important to realize that both these outcomes are an error.
I tried the same process as above on a linux system, and 3) always raises (10048, 'Address already in use').
If case a) occured, where w.connect raises socket.error: (10061, 'Connection refused'), trying to run a third client/server pair, the bind() call raises (10048, 'Address already in use'). The 'a'-socket from the second pair of processes is not closed in this case, but still trying to accept().
In my case bind() always raises (10048, 'Address already in use') when there is an open server socket like 'a' bound to the same port.
To summarize: Closing a server socket bound to a given port, alows another server socket to bind to the same port, even when there are open client sockets bound to the port.
Note that because this never tries a port number other than 19999, it can't be a bulletproof workaround simply to hold on to the `a` socket. If the hypothesized timeline above is right, bind() can't be trusted on Windows in any situation where two processes may try to bind to the same hostname:port pair at the same time. Holding on to `a`, and cycling through port numbers when bind() failed, would still potentially leave two processes trying to bind to the same port number simultaneously (just a port other than 19999).
It would not be enough to keep a reference to 'a'. It would have to be kept open as well. And maybe that is not a problem, since we only accept() once - only one 'w' client socket would be able to be accepted. Normally the use case for closing the server socket is to disallow more connections than those already acceptet. (But I'm not so experienced with sockets, I might be wrong.)
Ick: this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1), so if it is -- as is looking more and more likely --an error in MS's socket implementation, it isn't avoided by switching to a newer MS C library.
Frankly, I don't see a sane way to worm around this -- it's difficult for application code to worm around what smells like a missing critical section in system code.
Using the simpler socket dance from the ZODB 3.4 code, I haven't yet seen an instance of the assert failure, or a hang. However, let two processes run that long enough simultaneously, and it always (so far) eventually fails with
socket.error: (10048, 'Address already in use')
in the w.connect() call, and despite that Windows picks the port numbers here!
That is exactly what I feared could happen. As shown in my example above, the other that might happen is that the port is 'taken over' by the other process.
While that also smells to heaven of a missing critical section in the Windows socket implementation, an exception is much easier to live with / worm around. Alas, we don't have the MS source code, and I don't have time to try disassembling / reverse-engineering the opcodes (what EULA <wink>?), so best I can do is run this for many more hours to try to increase confidence that an exception is the worst that can occur under the ZODB 3.4 spelling.
Here's full code for the "only try port 19999" version:
import socket, errno import time, random def socktest111(): """Raise an exception if we can't get 19999. """
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
# set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1)
# tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999
try: a.bind((host, port)) except: raise RuntimeError else: print 'b',
a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass print 'c', r, addr = a.accept() print 'a', a.close() print 'c', w.setblocking (1)
return (r, w)
sofar = [] try: while 1: try: stuff = socktest111() except RuntimeError: print 'x', time.sleep(random.random()/10) continue sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Sune B. Woeller]
... This is what I'm experiencing as well. I can narrow it down a bit: I *always* experience one out of two erroneous behaviours, as described below.
I see only one of the behaviors below (the second -- no problems), and don't agree it's in error.
I tried to make an even simpler test situation, without binding sockets 'r' and 'w' to each other in the same process. I try to reproduce the problem in a 'standard' socket use case, where a client in one process binds to a server in another process.
The following two scripts acts as a server and a client.
#*********************** # sock_server_reader.py #*********************** import socket
a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
Note that a = socket.socket() is an easier way to spell the same thing; the Medusa code is ancient.
a.bind(("127.0.0.1", 19999)) print a.getsockname() # assigned (host, port) pair
a.listen(1)
print "a accepting:" r, addr = a.accept() # r becomes asyncore's (self.)socket print "a accepted: " print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())
a.close()
Key point: no socket is _listening_ on address ("127.0.0.1", 19999) after this close(). From what comes later, I guess you believe that no socket should be allowed to listen on that address again until all connections made with that `a` also close, but I don't think you'll find anything in socket documentation to support that belief. In the world of socket connections, what needs to be unique is _the connection_, and that's a 4-tuple: (side 1 host, side 1 port, side 2 host, side 2 port) There's no prohibition against seeing either side's address in any number of connections simultaneously, you just can't have two connections simultaneouly that match in all 4 positions. It so happens that Windows is happy to allow another socket to bind to a port the instant after a socket that had been listening on it closes (and regardless of whether connections made via the latter are still open), but I don't believe that's a bug. What I appear to be seeing is that sometimes-- rarely --Windows allows binding to a port by two sockets simultaneously, not serially as you're showing here. Simultaneous binding (in the absence of SO_REUSEADDR on Windows) is a bug.
msg = r.recv(100) print 'msg recieved:', msg
#*********************** # sock_client_writer.py #*********************** import socket, random
w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w.setsockopt(socket.IPPROTO_TCP, 1, 1)
print 'w connecting:' w.connect(('127.0.0.1', 19999)) print 'w connected:' print w.getsockname() print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername()) msg = str(random.randrange(1000000)) print 'sending msg: ', msg w.send(msg)
There are two possible outcomes [a) and b)] of running two instances of this client/server pair (that is, 4 processes in total like the following). (Numbers 1 to 4 are steps executed in chronological order.)
1) python -i sock_server_reader.py
So -i keeps the connection open -- these programs never "finish".
The server prints: ('127.0.0.1', 19999) a accepting: and waits for a connection
2) python -i sock_client_writer.py The client prints: w connecting: w connected: ('127.0.0.1', 3774) ('127.0.0.1', 3774), peer=('127.0.0.1', 19999) sending msg: 903848 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3774) msg recieved: 903848 >>>
This is like it should be.
Agreed so far <wink>.
Then lets try to setup a second client/server pair, on the same port (19999). The expected outcome of this is that the bind() call in sock_server_reader.py should fail with socket.error: (10048, 'Address already in use').
Sorry, I don't expect that. sock_server_reader is no longer listening on port 19999, so there's no reason some other socket can't start listening on it.
3) python -i sock_server_reader.py The server prints: ('127.0.0.1', 19999) a accepting:
Already here the problem occurs, bind() is allowed to bind to a port that is in use, in this case by the client socket 'r'. [also on other windows ? Mikkel: yes. Diku:???]
I showed an example before of how you can get any number (well, up to 64K) of sockets simultaneously alive saying they're bound to the same address, on Windows or Linux. The socket returned by a.accept() always duplicates a's (hosthame, port) address. That's so that if the peer asks for its peer, it gets back the address it originally connected to. It may be confusing, but that's how it works. Windows and Linux seem to differ in how willing they are to reuse a port after a listening socket is closed, but dollars to doughnuts says Microsoft wouldn't accept a claim that their behavior is "a bug".
4) python -i sock_client_writer.py Now one out of two things happen:
a) The client prints: w connecting: Traceback (most recent call last): File "c:\pyscripts\sock_client_writer.py", line 7, in ? w.connect(('127.0.0.1', 19999)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused') >>>
How often do you see this? I haven't seen it yet, but I can't make hours today to do this hand.
The server waits on the call to accept(), still waiting for a connection. (This is the blocking behaviour I reported in my first mail, experienced when running two zope instances. The socket error was swallowed by the unconditional except clause).
The real reason (and well-hidden it is) the Medusa code puts its connect() call in try/except is because the Medusa code (but not your code here) set w to non-blocking mode before the connect, and w.connect() on a non-blocking socket is always exceptional ("in progress" on Linux, "would block" on Windows). I have no idea why the Medusa code set w to be non-blocking to begin with, and although I haven't mentioned it here before, I saw all the same symptoms when I removed the non-blocking convolutions.
b) The client connects to the server: w connecting: w connected: ('127.0.0.1', 3865) ('127.0.0.1', 3865), peer=('127.0.0.1', 19999) sending msg: 119105 >>>
and the server now accepts the connection and prints: a accepted: ('127.0.0.1', 19999), peer=('127.0.0.1', 3865) msg recieved: 119105 >>>
This is the outcome I've seen every time I've tried it by hand -- no problems.
The second set of client/server processes are now connected on the same port as the first set of client/server processes.
You can get to a similar end more easily by having a server socket accept more than one connection -- all accept()'ed connections have the same socket address. The connection 4-tuples all differ, though, and that's what matters.
In a port scanner the port now belongs two the second server process [3)].
I always get one out of these two possibilities (a and b), I never see bind() raising socket.error: (10048, 'Address already in use').
If you can type very, very quickly <wink>, you should see that on Windows if you manage to try binding to 19999 before the original a.close() manages to complete.
It is important to realize that both these outcomes are an error.
If it were true that outcome #b were in error, Windows would have a trivially easy-to-reproduce gross bug here, of many years' standing Life's rarely that simple, alas.
I tried the same process as above on a linux system, and 3) always raises (10048, 'Address already in use').
Same here, but I've found nothing in socket docs requiring this behavior, and, indeed, there doesn't appear to be a _logical_ necessity for it. It's in fact somewhat of a pain on Linux, becuase I continue to get 'Address already in use' even after I close both ends of the socket connection too, presumably waiting for the 4-minute TIME_WAIT shutdown dance to end. ...
In my case bind() always raises (10048, 'Address already in use') when there is an open server socket like 'a' bound to the same port.
That's as it should be. Alas, what I believe the program I sent last night shows is that Windows doesn't _always_ raise 'Address already in use' when two server sockets are binding to the same port simultaneously. Instead it sometimes says "OK, you got it" to _both_ of them. This is seriously difficult for me to provoke, BTW: 10-60 minutes per failure, on a 3.4 GHz hyperthreaded box.
To summarize: Closing a server socket bound to a given port, alows another server ocket to bind to the same port, even when there are open client sockets bound to the port.
And in Windows, I believe that's by design. Indeed, I expect the Medusa Windows code tries such a small number of ports (no more than about 50) precisely because Windows has always allowed reusing a listening port so quickly. Otherwise the code would need to try at least as many ports as "the maximum" number of triggers that could possibly be open simultaneously -- but there's no way to know what the maximum is, and it's expensive to try binding to a large number of ports. It would have to impose an artificial limit. The Medusa Linux code avoids all of this by creating pipes instead; alas, Windows asyncore.py can't work with Windows pipes.
[...]
Key point: no socket is _listening_ on address ("127.0.0.1", 19999) after this close(). From what comes later, I guess you believe that no socket should be allowed to listen on that address again until all connections made with that `a` also close, but I don't think you'll find anything in socket documentation to support that belief. In the world of socket connections, what needs to be unique is _the connection_, and that's a 4-tuple:
(side 1 host, side 1 port, side 2 host, side 2 port)
There's no prohibition against seeing either side's address in any number of connections simultaneously, you just can't have two connections simultaneouly that match in all 4 positions. It so happens that Windows is happy to allow another socket to bind to a port the instant after a socket that had been listening on it closes (and regardless of whether connections made via the latter are still open), but I don't believe that's a bug.
Hi Tim, You are right, I had things confused quite a bit there. Well, I guess I learned quite a bit about sockets in the last few days :)
What I appear to be seeing is that sometimes-- rarely --Windows allows binding to a port by two sockets simultaneously, not serially as you're showing here. Simultaneous binding (in the absence of SO_REUSEADDR on Windows) is a bug.
So -i keeps the connection open -- these programs never "finish".
By design, to keep the sockets around, and being able to inspect them afterwards. [...]
Then lets try to setup a second client/server pair, on the same port (19999). The expected outcome of this is that the bind() call in sock_server_reader.py should fail with socket.error: (10048, 'Address already in use').
Sorry, I don't expect that. sock_server_reader is no longer listening on port 19999, so there's no reason some other socket can't start listening on it.
point taken. [...]
I showed an example before of how you can get any number (well, up to 64K) of sockets simultaneously alive saying they're bound to the same address, on Windows or Linux. The socket returned by a.accept() always duplicates a's (hosthame, port) address. That's so that if the peer asks for its peer, it gets back the address it originally connected to. It may be confusing, but that's how it works.
yes, I was aware of that ;) (I thought it was an error for another _server_ socket to start listening on the same port, even when the first server socket is closed, as long as there were client sockets connected via the port. But I agree it is not.)
Windows and Linux seem to differ in how willing they are to reuse a port after a listening socket is closed, but dollars to doughnuts says Microsoft wouldn't accept a claim that their behavior is "a bug".
4) python -i sock_client_writer.py Now one out of two things happen:
a) The client prints: w connecting: Traceback (most recent call last): File "c:\pyscripts\sock_client_writer.py", line 7, in ? w.connect(('127.0.0.1', 19999)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused')
How often do you see this? I haven't seen it yet, but I can't make hours today to do this hand.
The problem with a failing connect, and as consequence a hanging accept: I have tested with your socktest111(), and experience the same thing with that as with my own one-go scripts: The socktest111() test hangs on accept() after a few (1, 2, 3...) cycles, because the w.connect fails (10061, 'Connection refused'). Well, enough with that - I can only recreate that problem on my own machine. I have run the same test on another win xp home, and win 2K without that problem. (I have tested with your newest script also, see the next mail). /sune
The attached hasn't failed on my box (Win XP Pro SP2, Python 2.3.5) for about two hours, running it in 3 processes. Was using 2 processes before; discovered it was much easier to provoke problems using 3; but the # of ephemeral ports in use increases too, typically hovering between 7-8 thousand after reaching steady state. I'll let it run the rest of today, and start changing ZODB code if it still looks good. I hope someone(s) else will then volunteer to port the Windows changes to all the copies of Medusa code in the various active Zope trunks and branches. This suffers from what I still believe to be bugs in the Windows socket implementation, but there is only one symptom I see with this, and the code uses try/except to implement what appears to be a reliable workaround. import socket, errno import time, random class BindError(Exception): pass def socktest29(): w = socket.socket() # Disable buffering -- pulling the trigger sends 1 byte, # and we want that sent immediately, to wake up asyncore's # select() ASAP. w.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) count = 0 while 1: count += 1 # Bind to a local port; for efficiency, let the OS pick # a free port for us. # Unfortunately, stress tests showed that we may not # be able to connect to that port ("Address already in # use") despite that the OS picked it. This appears # to be a race bug in the Windows socket implementation. # So we loop until a connect() succeeds (almost always # on the first try). a = socket.socket() a.bind(("127.0.0.1", 0)) print 'b', connect_address = a.getsockname() # assigned (host, port) pair a.listen(1) try: w.connect(connect_address) print 'c', break except socket.error, detail: if detail[0] != errno.WSAEADDRINUSE: # "Address already in use" is the only error # I've seen on two WinXP Pro SP2 boxes, under # Pythons 2.3.5 and 2.4.1. raise # (10048, 'Address already in use') # assert count <= 2 # never triggered in Tim's tests if count >= 10: # I've never seen it go above 2 a.close() w.close() raise BindError("Cannot bind trigger!") # Close `a` and try again. Note: I originally put a short # sleep() here, but it didn't appear to help or hurt. print print detail, a.getsockname() a.close() r, addr = a.accept() # r becomes asyncore's (self.)socket print 'a', a.close() print 'c', return (r, w) sofar = [] try: while 1: print '.', stuff = socktest29() sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) r.close() w.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close()
Tim Peters wrote:
The attached hasn't failed on my box (Win XP Pro SP2, Python 2.3.5) for about two hours, running it in 3 processes. Was using 2 processes before; discovered it was much easier to provoke problems using 3; but the # of ephemeral ports in use increases too, typically hovering between 7-8 thousand after reaching steady state.
I'll let it run the rest of today, and start changing ZODB code if it still looks good. I hope someone(s) else will then volunteer to port the Windows changes to all the copies of Medusa code in the various active Zope trunks and branches.
I have been running 8 simultaneous processes with socktest29() for 12 hours, on my xp box. No problems except for an occasional "Address already in use".
This suffers from what I still believe to be bugs in the Windows socket implementation, but there is only one symptom I see with this, and the code uses try/except to implement what appears to be a reliable workaround.
import socket, errno import time, random
class BindError(Exception): pass
def socktest29(): w = socket.socket()
# Disable buffering -- pulling the trigger sends 1 byte, # and we want that sent immediately, to wake up asyncore's # select() ASAP. w.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
count = 0 while 1: count += 1 # Bind to a local port; for efficiency, let the OS pick # a free port for us. # Unfortunately, stress tests showed that we may not # be able to connect to that port ("Address already in # use") despite that the OS picked it. This appears # to be a race bug in the Windows socket implementation. # So we loop until a connect() succeeds (almost always # on the first try). a = socket.socket() a.bind(("127.0.0.1", 0)) print 'b', connect_address = a.getsockname() # assigned (host, port) pair a.listen(1) try: w.connect(connect_address) print 'c', break except socket.error, detail: if detail[0] != errno.WSAEADDRINUSE: # "Address already in use" is the only error # I've seen on two WinXP Pro SP2 boxes, under # Pythons 2.3.5 and 2.4.1. raise # (10048, 'Address already in use') # assert count <= 2 # never triggered in Tim's tests if count >= 10: # I've never seen it go above 2 a.close() w.close() raise BindError("Cannot bind trigger!") # Close `a` and try again. Note: I originally put a short # sleep() here, but it didn't appear to help or hurt. print print detail, a.getsockname() a.close()
r, addr = a.accept() # r becomes asyncore's (self.)socket print 'a', a.close() print 'c',
return (r, w)
sofar = [] try: while 1: print '.', stuff = socktest29() sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) r.close() w.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Sune B. Woeller, on socktest29]
I have been running 8 simultaneous processes with socktest29() for 12 hours, on my xp box. No problems except for an occasional "Address already in use".
Which the code expects, and works around. Thank you, Sune! I ran the same code, but with 3 processes, for 24 hours straight on my fastest XP box, and also got no failures. Next I'll change various ZODB branches accordingly.
Tim Peters wrote:
[Sune B. Woeller, on socktest29]
I have been running 8 simultaneous processes with socktest29() for 12 hours, on my xp box. No problems except for an occasional "Address already in use".
Which the code expects, and works around. Thank you, Sune! I ran the same code, but with 3 processes, for 24 hours straight on my fastest XP box, and also got no failures. Next I'll change various ZODB branches accordingly.
No prop, sorry for the confusion wrt socket-functionality :) Sounds great with the ZODB changes. What releases will this cover ? If more releases are due in the 2.7.x line, it would be great if they could be included. sune
[Sune B. Woeller]
... Sounds great with the ZODB changes. What releases will this cover ? If more releases are due in the 2.7.x line, it would be great if they could be included.
ZODB and Medusa changes have already been checked in for the 2.7.x line. For the rest, please see http://mail.zope.org/pipermail/zope3-dev/2005-August/015112.html and http://www.zope.org/Collectors/Zope3-dev/432
... [Tim Peters[
How often do you see this? I haven't seen it yet, but I can't make hours today to do this [by] hand.
[Sune B. Woeller]
The problem with a failing connect, and as consequence a hanging accept: I have tested with your socktest111(), and experience the same thing with that as with my own one-go scripts: The socktest111() test hangs on accept() after a few (1, 2, 3...) cycles, because the w.connect fails (10061, 'Connection refused').
Well, enough with that - I can only recreate that problem on my own machine.
I have run the same test on another win xp home, and win 2K without that problem.
OK, so you have one box where it's extraordinarily easy to provoke problems with these test scripts. Since the cause is unlikely to be Windows (despite that all flavors of Windows appear to have bugs here, they're pretty shy on most boxes), I'd suggest looking at 3rd-party software you run on your laptop but not on other boxes. For example, software firewalls are a good candidate, and, e.g., I used to see several "impossible socket problems" only when ZoneAlarm was active (and I stopped seeing that when I stopped ZoneAlarm ;-)). Anti-virus and active anti-spyware programs can also create no end of mysterious problems.
[Sune B. Woeller] ...
But then I stumbled upon this flag in the WinSock documentation: SO_EXCLUSIVEADDRUSE See the description here: <http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/using_so_exclusiveaddruse.asp>
Right, I vaguely <wink> knew about that. Note that the documentation explicitly states that in the absence of SO_EXCLUSIVEADDRUSE, the port may be reused as soon as the socket on which bind was called (that is, the socket the connection was originated on or the listening socket) is closed. IOW, that's your "case #b" from earlier email, and Windows is just doing what's documented there. Believe it or not, I haven't found any Linux-ish docs as clear as these MS docs about the behavior of its bind() in all cases. There are problems with SO_EXCLUSIVEADDRUSE too, which Google will find. A big one is that many versions of Windows require admin privs to set this option, including many versions of Windows Server, and WinXP through SP1. That was a bug, but it's only recently been fixed (in SP2 for WinXP). At this point, I wouldn't consider using it unless someone first took the tedious time it needs to demonstrate that when it is used, the thing that _I_ think is a bug here goes away in its presence: the seeming ability of Windows to sometimes permit more than one socket to bind to the same address simultaneously (not serially -- Windows does seem to prevent that reliably). If you can, I would like you to try the ZODB 3.4 Windows socket dance code, and see if it works for you in practice. I know it's not bulletproof, but it's portable across all flavors of Windows and is much better-behaved in my tests so far than the Medusa Windows socket dance.
It is very interesting reading, especially: "An important caveat to using the SO_EXCLUSIVEADDRUSE option exists: If one or more connections originating from (or accepted on) a port bound with SO_EXCLUSIVEADDRUSE is active, all bind attempts to that port will fail."
Note too that they describe that as an "important caveat" (a warning), not as "a feature". They go on to explain that "active" means all of the ESTABLISHED, FIN_WAIT, FIN_WAIT_2, and LAST_ACK states, meaning the port stays tied up (in reality) for minutes even after the `r` and `w` sockets are closed. That's a 50% increase then in the # of ports each trigger tiies up for an arbitrarily long time. ...
There is a python bugfix for this, but only for python 2.4: http://sourceforge.net/tracker/index.php?func=detail&aid=982665&group_id=547...
(It is added to version 1.294 of socketmodule.c)
That's not a real problem; if needed this could easily be done under Python 2.3.5 too (the patch only adds a symbolic name for a fixed integer; the integer could be hard-coded when not hasattr(socket, "SO_EXCLUSIVEADDRUSE") -- much as the current Medusa dance hardcodes 1 instead of using socket.TCP_NODELAY).
[Tim]
... At this point, I wouldn't consider using it [SO_EXCLUSIVEADDRUSE] unless someone first took the tedious time it needs to demonstrate that when it is used, the thing that _I_ think is a bug here goes away in its presence: the seeming ability of Windows to sometimes permit more than one socket to bind to the same address simultaneously (not serially -- Windows does seem to prevent that reliably).
I started, but didn't get that far. The first time I ran a pair of processes with the attached (Python 2.4.1, WinXP Pro SP2), one fell over with ... w.connect((host, port)) File "<string>", line 1, in connect socket.error: (10048, 'Address already in use') after about 20 minutes. So, on the face of it, playing with SO_EXCLUSIVEADDRUSE is no better than the ZODB 3.4 Windows socket dance. Both appear mounds better-behaved than the Medusa Windows socket dance without SO_EXCLUSIVEADDRUSE, though. Since there are fewer other problems associated with the ZODB 3.4 version (see last email), I'd like to repeat this part:
If you can, I would like you to try the ZODB 3.4 Windows socket dance code, and see if it works for you in practice. I know it's not bulletproof, but it's portable across all flavors of Windows and is much better-behaved in my tests so far than the Medusa Windows socket dance.
Bulletproof appears impossible due to what still look like race bugs in the Windows socket implementation. Here's the code. Note that it changed to try (no more than) 10,000 ports, although I didn't see it need to go through more than 200: import socket, errno import time, random class BindError(Exception): pass def socktest15(): """Like socktest1, but w/o pointless blocking games. Added SO_EXCLUSIVEADDRUSE to the server socket. """ a = socket.socket() w = socket.socket() a.setsockopt(socket.SOL_SOCKET, socket.SO_EXCLUSIVEADDRUSE, 1) # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999 while 1: try: a.bind((host, port)) break except: if port <= 10000: raise BindError, 'Cannot bind trigger!' port -= 1 port2count[port] = port2count.get(port, 0) + 1 a.listen(1) w.connect((host, port)) r, addr = a.accept() a.close() return (r, w) def close(r, w): for s in r, w: s.close() return # the fancy stuff below didn't help or hurt for s in w, r: s.shutdown(socket.SHUT_WR) for s in w, r: while 1: msg = s.recv(10) if msg == "": break print "eh?!", repr(msg) for s in w, r: s.close() port2count = {} def dump(): print items = port2count.items() items.sort() for pair in items: print "%5d %7d" % pair sofar = [] i = 0 try: while 1: if i % 1000 == 0: dump() i += 1 print '.', try: stuff = socktest15() except RuntimeError: raise sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) close(r, w) except KeyboardInterrupt: for tup in sofar: close(*tup)
Tim Peters wrote:
[Tim]
... At this point, I wouldn't consider using it [SO_EXCLUSIVEADDRUSE] unless someone first took the tedious time it needs to demonstrate that when it is used, the thing that _I_ think is a bug here goes away in its presence: the seeming ability of Windows to sometimes permit more than one socket to bind to the same address simultaneously (not serially -- Windows does seem to prevent that reliably).
I started, but didn't get that far. The first time I ran a pair of processes with the attached (Python 2.4.1, WinXP Pro SP2), one fell over with
... w.connect((host, port)) File "<string>", line 1, in connect socket.error: (10048, 'Address already in use')
after about 20 minutes.
So, on the face of it, playing with SO_EXCLUSIVEADDRUSE is no better than the ZODB 3.4 Windows socket dance. Both appear mounds better-behaved than the Medusa Windows socket dance without SO_EXCLUSIVEADDRUSE, though. Since there are fewer other problems associated with the ZODB 3.4 version (see last email), I'd like to repeat this part:
If you can, I would like you to try the ZODB 3.4 Windows socket dance code, and see if it works for you in practice. I know it's not bulletproof, but it's portable across all flavors of Windows and is much better-behaved in my tests so far than the Medusa Windows socket dance.
Bulletproof appears impossible due to what still look like race bugs in the Windows socket implementation.
I agree that SO_EXCLUSIVEADDRUSE is not interesting - I was too focused on what I thougth was an error (See the previous posts ). After your first post I changed select_trigger.py of medusa in my zope installation to work like the ZODB 3.4 code (like the ZODB svn trunk). The problem I had with blocks during accept() is not appearing, and my zopes are running fine. Tests: A) socktest111 on win2k: It runs fine for more than 3 hours, but a few (1 out of 100) exceptions on bind() (a printed x) appears. B) socktest15 on my winxp home sp2 With SO_EXCLUSIVEADDRUSE: Has been running for 4 hours without problems. Needs ports down to around 19800. Without SO_EXCLUSIVEADDRUSE: it fails after very few cycles, (just like socktest111) like this: . . . . Traceback (most recent call last): File "peters2.py", line 72, in ? stuff = socktest15() File "peters2.py", line 33, in socktest15 w.connect((host, port)) File "<string>", line 1, in connect socket.error: (10061, 'Connection refused') (The old problem. I can't reproduce it on other machines.) C) socktest15 on win2k: With SO_EXCLUSIVEADDRUSE: Runs fine for more than 3 hours, needs ports down to around 19800 I did not get an "Address already in use" in that amount of time. Without SO_EXCLUSIVEADDRUSE: Runs fine for more than 3 hours, needs usually only to go down to 19998 I have not yet been able to get access to the machines for a longer period. As said above socktest111 and socktest15 (without SO_EXCLUSIVEADDRUSE) fails immediately on my own machine, so I have not yet been able to reproduce the processes swapping connection (the failing assert). D) socktest2 on my winxp home sp2: Have run it several times. Fails after a varying amount of time (2-15 minuttes) with an (10048, 'Address already in use') in connect. (Just like you experienced). Sometimes both processes fail at the same time, sometimes only one of them. C) socktest29 Will try to run that now, and let run a while.
Here's the code. Note that it changed to try (no more than) 10,000 ports, although I didn't see it need to go through more than 200:
import socket, errno import time, random
class BindError(Exception): pass
def socktest15(): """Like socktest1, but w/o pointless blocking games. Added SO_EXCLUSIVEADDRUSE to the server socket. """
a = socket.socket() w = socket.socket()
a.setsockopt(socket.SOL_SOCKET, socket.SO_EXCLUSIVEADDRUSE, 1) # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999
while 1: try: a.bind((host, port)) break except: if port <= 10000: raise BindError, 'Cannot bind trigger!' port -= 1
port2count[port] = port2count.get(port, 0) + 1 a.listen(1) w.connect((host, port)) r, addr = a.accept() a.close()
return (r, w)
def close(r, w): for s in r, w: s.close() return # the fancy stuff below didn't help or hurt for s in w, r: s.shutdown(socket.SHUT_WR) for s in w, r: while 1: msg = s.recv(10) if msg == "": break print "eh?!", repr(msg) for s in w, r: s.close()
port2count = {}
def dump(): print items = port2count.items() items.sort() for pair in items: print "%5d %7d" % pair
sofar = [] i = 0 try: while 1: if i % 1000 == 0: dump() i += 1 print '.', try: stuff = socktest15() except RuntimeError: raise sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) close(r, w) except KeyboardInterrupt: for tup in sofar: close(*tup) _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
participants (2)
-
Sune B. Woeller -
Tim Peters