[Zope] Static fail-over

Thu, 24 Jan 2002 13:59:34 -0800 (PST)

-> I guess what I'm saying is there are places where IP-takeover based
-> clustering is appropriate, sometimes even in conjunction with forward
-> traffic direction.

	Sean: Thanks for the I.P. Takeover info!

	While we're on the topic, I have a quick 'opinion' question about
clusters.  This applies directly to a Zope cluster I'll be building soon.

	A fully redundant, yet not load-balanced, H.A. system requires
*almost* all the same hardware and software as a 2-node load-balanced
system.  That is, you need to detect a service failure and, if found, make
sure traffic goes to the backup system, not the primary system.  (You'd
also want to send an alert, etc., and if you only have dual redundancy,
you'll also want to monitor the backup to make sure it'll be there when
the failover is needed.)

	In a load balanced system (with homogenous nodes), you need to
watch all nodes for failure and, if found, fail out that particular node.  
But instead of the backup hardware going "wasted", just waiting for a
failover, you've halved the hardware workload by distributing the work to
both machines.  This may result in a faster response to endusers.

	My question:  Does it ever make sense to set up a redundant system
without load balancing?  After all, plopping in new nodes on an as-needed
basis is a very handy feature.

	The only thing I can think of is this:  Imagine a site that must
serve 1 zillion requests per day (a zillion being a Very Big Number).  If
you use a simple failover system, then you buy two boxes, both capable of
handling 1 zillion requests.  If the site grows more popular so you must
handle 2 zillion requests/day, you just upgrade both servers to handle 2
zillion requests/day.  (This is a thought experiment, ignore the fact that
you should have planned for the growth in the first place :)

	Now imagine those same two (1 zillion/day capable) boxes have been
configured for load balancing.  Immediately, each server is only serving
.5 zillion requests/day.  As the site grows to it's new 2 zillion/day
load, both servers being in use means no hardware upgrade is needed.  BUT
--and this is a big but-- you no longer have an H.A. system.  You've lost
your redundancy.  If one of the servers go down, about half of your
customers will get an HTTP 502 "Overloaded" error message.  So to keep
full redundancy, you actually need THREE nodes.  In fact, for however many
nodes you want your cluster to be, you need to add one extra "redundant"  
node that would handle the traffic for any failed node (just until that
failed node is repaired).

	So I guess I answered my own question: in a two-node load balanced 
system, the second node would really be nothing more than a backup node 
(even though it's handling traffic), and thus you'd need to upgrade your 
hardware (or rather, just add more nodes) as soon as your traffic exceeded 
the limit of 

(nodecount - 1) * traffic_per_node

	And the "traffic_per_node" you'd have to assume would be peak
usage traffic, i.e. 

(  total_peak_traffic / (nodecount - 1)  )

	In the real world of public websites, however, I think a
load-balanced system may actually offer extra redundancy.  Because no site
will get its peak load on a 24/7 basis, the load balanced system can use
any extra resources (which are free because the cluster is not at 100%
capacity) to fill in for the failed node -- and this would be *in addition
to* your extra failover node.  In a simple failover-system, you don't get
this.

	Any additional comments would be greatly appreciated.


--Derek