Ok, I've got some thoughts. Mainly, I think there are times when a cluster tier should manage its own availability, and times when it should be managed by a forward tier...
Derek Writes:>>>>> Is there a way to use URL Rewriting rules in Apache (with mod_rewrite) to test if a particular box was alive, and only if so, direct traffic there? Maybe have it look if a particular file exists (or some such)? (Also note that Apache will do much of what Squid will do using mod_proxy.) <<<<<
Squid is more flexible here since it uses an external redirector script that can be written in any language that you want; redirector scripts in Squid can be as few as 4-5 lines of Python code, or an elaborate C program. In order to so this in Apache, you need to write a perl module to control mod_rewrite, I believe (a Squid redirector plugin is easier).
Derek Writes:>>>>> I.P. address take-overs are dangerous. What if the Zope processes die, but the O.S. is okay? You'll have an I.P. address conflict unless you can run a script on the primary box that tells it to shut down it's network interface. So what if the hardware locks up/loses all resources/gets into a loop of somekind? The NIC will still respond to its I.P. address, but you can't run the script to disable it. Bad situation--pray you have a watchdog card for those Zope processes. <<<<<
No, you won't have an IP address conflict. IP address takeover via gratuitous/unsolicited ARP prevents an IP conflict, and its a standard documented in RFCs that switch vendors are supposed to obey. If a node running monitoring software is acting as a backup for a failed node and sees that Zope has died on its peer, it will initiate a takeover with the clustering software. Once this takeover happens, the NIC will NOT respond to its own IP address, because the switch will NOT be sending Ethernet frames to it in the first place, because switches and hosts keep ARP tables. I promise you, this stuff works.
Derek Writes:>>>>> MAC address takeovers are somewhat dangerous, because the switch that you are connected into (such as at a Data Center) may not recognize the MAC address takeover if the NIC on the primary box is still responding (as above). <<<<<
Some switches behave better than others in this regard. IP address takeover is easier to deal with than MAC address takeover, and isn't so picky about hardware. Fortunately, most open source clustering software uses IP takeover, not MAC takeover.
Derek Writes:>>>>> I prefer solutions that keep all nodes (primary, backup, or any peer nodes) behind a NAT. Each node gets its own 192.168.0.x I.P. address, and the NAT box does all failover. You've now moved the I.P. takeover problem to the NAT box (with its backup), but since NAT is in the kernel (under Linux, at least) you'd be hard-pressed to find a NAT box that could respond to an ICMP or serial-port ping but not do NAT. If the kernel is running, it's running, and if it's not, it's not. <<<<<
NAT isn't as flexible as proxying, though I imagine it might be faster in some cases - though in both you can bridge to a private network. Perhaps the problem of reliability of many web nodes in a cluster is best dealt with at the NAT/Proxy/L7 switch level, but there is also validity in IP address takeover. The one disadvantage in an IP address takeover, though, is that a backup server in a load-balanced arrangement will take on twice the load. Where IP takeover mechanisms might be more appropriate is in 2-box clusters for things like db/file/proxy servers. Toby's ICP patch looks __really, really cool__. My next setup is likely to use IP takeover clustering on a pair of Squid proxy servers, which themselves load-balance ZEO client nodes using ICP, with the backend storage (file/ODB/RDB) tier as a pair of storage servers also using IP takeover. The reliability of the web server nodes would be dealt with by Squid, thanks to ICP, which would free up the necessity to have a complex IP takeover arrangement for all my web servers, other than making each node have monitoring for a half-dead Zope server. I guess what I'm saying is there are places where IP-takeover based clustering is appropriate, sometimes even in conjunction with forward traffic direction. Sean