Re: [Zope] ZEO and a front end...
On Tue, 18 Jul 2000, ethan mindlace fremen wrote:
Curtis Maloney wrote:
Yes, however his point is that by having each Zope instance 'predominantly' serving one portion of the site, its cache will contain more objects relevant, and thus be just that little bit faster.
Personally, I find this such a simple idea that it MUST be good. (o8 So much so, in fact, that I've decided to have a crack at writing just such a redirector. I feel the Zope world (and others, most likely) could benefit from a 'preferential' redirector.
The way I would do this is have
section1.contrived-example.com section2.contrived-example.com section3.contrived-example.com
with siteAccess, and then each zope would serve it according to it's IP (though each "could" serve each site). Then you can use whatever IP/DNS load balancing tool your heart desires.
I think most people seem to be missing the point here. The idea is that ALL servers can serve ALL content. HOWEVER, the 'load balancer' will opt for a certain server for a certain URL, in order to improve cache hits. So, for www.contrived-example.com/dir1 it will first try server1, but if it's busy (or down) it will try others. This way, the cache on server1 is more likely to contain objects relevant to /dir1 and thus have a higher hit rate, therefore improving performance. An enforced 'mapping', as you were suggesting, removes ALL redundancy from the site, but would likely provide even better cache hits.
a thought,
Have a better one, Curtis
Curtis Maloney wrote:
On Tue, 18 Jul 2000, ethan mindlace fremen wrote:
Curtis Maloney wrote:
Yes, however his point is that by having each Zope instance 'predominantly' serving one portion of the site, its cache will contain more objects relevant, and thus be just that little bit faster.
Personally, I find this such a simple idea that it MUST be good. (o8 So much so, in fact, that I've decided to have a crack at writing just such a redirector. I feel the Zope world (and others, most likely) could benefit from a 'preferential' redirector.
The way I would do this is have
section1.contrived-example.com section2.contrived-example.com section3.contrived-example.com
with siteAccess, and then each zope would serve it according to it's IP (though each "could" serve each site). Then you can use whatever IP/DNS load balancing tool your heart desires.
I think most people seem to be missing the point here.
The idea is that ALL servers can serve ALL content. HOWEVER, the 'load balancer' will opt for a certain server for a certain URL, in order to improve cache hits.
So, for www.contrived-example.com/dir1 it will first try server1, but if it's busy (or down) it will try others. This way, the cache on server1 is more likely to contain objects relevant to /dir1 and thus have a higher hit rate, therefore improving performance.
No, I understand what is being discussed, I doubt the problem. :-) Given an equal distribution*, then all the back-end (BE) servers will have a fairly consistent cache content from server to server. you are _equally_ likely to hit a server with that object in cache. The more requests you have for a given object, the greater odds you'll see it in the caches of all BE servers. * Now, not all systems are equal, this is true. However, in an intelligent load balancing sysstem, you 'weight' the faster/better performing machines, such that they are hit more often. Since these machines will be used more frequently, they will have the best chance to have what you want in cache already. I just don't see that the additional effort is worth it. The job is already done, and the additional overhead would seem to outweigh any perceived increases in performance. See below.
An enforced 'mapping', as you were suggesting, removes ALL redundancy from the site, but would likely provide even better cache hits.
How so? http://my.site.com/sec1 is mapped to: sec1.site.com, which is load balanced across as many machines as possible, using ZEO and a load balancing tool. Any of the machines in the pool known as sec1 (nobody said it had to be a single machine) could respond. since these machines serve out sec1 predominantly (they can also participate in the general site load balanceing servers), these would have a better cache hit rate on sec1 stuff than the primary BE servers. Perhaps this can help: www.libc.org (real site, fictional setup :) is a ZEO cluster. o The site's primary ZEO Clients number 5. o My load balancing tool lets me weight some servers over others. /Members is a heavily trafficked section, so I want it to be seperated out using a rewrite tool (SiteAccess, Roxen, Apache mod_rewrite, whatever) to send all /Members urls to members.libc.org. I set up two ZEO clients, M1 and M2. These two talk to the same ZSS as the other 5, and respond to members.libc.org. So, when you go to www.libc.org/Members, you will wind up on either M1 or M2. These machines are set up as low-weighted primary site servers (bringing the total up to 7), so they will have a cache that is biased towards /Members, but still can serve up any part of www.libc.org If M1 or M2 goes down, you stay up. For added redundancy, you can add the other 5 primary servers as low-weighted servers for members.libc.org, such that if both M1 and M2 die, or get heavily loaded, one or more of the other 5 can pick up the overage, just as M1 and M2 can for teh 5 primary servewrs for the main site. Now you have 'preferred' machines, to improve cache-hit-rate for certain heavily trafficked sections of your site, and maintain (or even improve) overall performance and redundancy of the system. Of course, you still have ZSS as a SPOF, but even that can be gotten around with good design and planning. :^) If that isn't enough, you can throw eddieware into the mix, which *already* has the ability to redirect based upon the URL. And-yes,-McGuyver-is-my-hero-ly y'rs Bill -- Do not meddle in the affairs of sysadmins, for they are easy to annoy, and have the root password.
On Tue, 18 Jul 2000 04:22:16 -0600, Bill Anderson <bill@libc.org> wrote:
I think most people seem to be missing the point here.
The idea is that ALL servers can serve ALL content. HOWEVER, the 'load balancer' will opt for a certain server for a certain URL, in order to improve cache hits.
So, for www.contrived-example.com/dir1 it will first try server1, but if it's busy (or down) it will try others. This way, the cache on server1 is more likely to contain objects relevant to /dir1 and thus have a higher hit rate, therefore improving performance.
No, I understand what is being discussed, I doubt the problem. :-)
You are right, theres no problem in the scenario you described. Ill fill in some more details about the fictional example for which I still can't see an easy solution.... Zope is used to store books. Each book object contains: 1. The text of the books, each page in a separate object 2. Images and diagrams for the book. 3. A ZCatalog full-text-index of the book. Each book object allows: 1. Searching, viewing pages, etc. 2. Dynamically rendering a range of pages as pdf, postscript, etc. The whole database stores 10,000 books, and is served by a cluster of many identical Zope servers. A typical usage pattern might be: a. Users searches through a book to find the interesting pages b. He browses the pdf version of those pages c. He tweaks the page range, and double-checks the pdf version. d. then downloads a postscript version of that page range for printing Assume that noone has accessed this book recently, so it's not in any caches. The cache has to be filled at step b. This transfers alot of data - possibly the whole content of the book - and introduces a noticeable delay. The possibility for optimisation comes at steps c and d. There is one cache already filled with the right data - if the requests from c and d can be directed to the same server as the original then the cache-filling delay can be avoided. This extra delay might not have a great impact of actual site performance, but I've found a catastrophic affect on perceived performance in some usability tests. Users seem happy to accept a delay when they first access their data, but not if it repeated in a subsequent request. Bill wrote...
http://my.site.com/sec1 is mapped to: sec1.site.com, which is load balanced across as many machines as possible
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable. I want to move load balancing up one level of abstraction - distributing sections across machines (rather than connections).
If that isn't enough, you can throw eddieware into the mix, which *already* has the ability to redirect based upon the URL.
Ive not seen eddieware before - so it looks like Ive got some reading to do. At a first glance it doesn't have any integrated http caching (although it seems to have everything else ;-) and theres no obvious place to hang squid. In my example above, I really want to be able to cache the rendered pdf files. Toby Dickenson tdickenson@geminidataloggers.com
Toby Dickenson wrote:
On Tue, 18 Jul 2000 04:22:16 -0600, Bill Anderson <bill@libc.org> wrote:
I think most people seem to be missing the point here.
The idea is that ALL servers can serve ALL content. HOWEVER, the 'load balancer' will opt for a certain server for a certain URL, in order to improve cache hits.
So, for www.contrived-example.com/dir1 it will first try server1, but if it's busy (or down) it will try others. This way, the cache on server1 is more likely to contain objects relevant to /dir1 and thus have a higher hit rate, therefore improving performance.
No, I understand what is being discussed, I doubt the problem. :-)
You are right, theres no problem in the scenario you described.
Ill fill in some more details about the fictional example for which I still can't see an easy solution....
Zope is used to store books. Each book object contains: 1. The text of the books, each page in a separate object 2. Images and diagrams for the book. 3. A ZCatalog full-text-index of the book. Each book object allows: 1. Searching, viewing pages, etc. 2. Dynamically rendering a range of pages as pdf, postscript, etc.
The whole database stores 10,000 books, and is served by a cluster of many identical Zope servers.
A typical usage pattern might be: a. Users searches through a book to find the interesting pages b. He browses the pdf version of those pages c. He tweaks the page range, and double-checks the pdf version. d. then downloads a postscript version of that page range for printing
Assume that noone has accessed this book recently, so it's not in any caches.
The cache has to be filled at step b. This transfers alot of data - possibly the whole content of the book - and introduces a noticeable delay.
The possibility for optimisation comes at steps c and d. There is one cache already filled with the right data - if the requests from c and d can be directed to the same server as the original then the cache-filling delay can be avoided.
This extra delay might not have a great impact of actual site performance, but I've found a catastrophic affect on perceived performance in some usability tests. Users seem happy to accept a delay when they first access their data, but not if it repeated in a subsequent request.
Bill wrote...
http://my.site.com/sec1 is mapped to: sec1.site.com, which is load balanced across as many machines as possible
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
I want to move load balancing up one level of abstraction - distributing sections across machines (rather than connections).
That's easier :) Make sec1.site.com a single machine, and all requests for my.site.com/sec1 go to this machine, thus the cache will have it loaded if it has been accessed at all. The downside, like Curtis mentioned, is that if sec1 dies, you lose that part of the site.
If that isn't enough, you can throw eddieware into the mix, which *already* has the ability to redirect based upon the URL.
Ive not seen eddieware before - so it looks like Ive got some reading to do.
At a first glance it doesn't have any integrated http caching (although it seems to have everything else ;-) and theres no obvious place to hang squid. In my example above, I really want to be able to cache the rendered pdf files.
EddieWare does do 'intellgient' caching, allowing you to seperate out sections of a site to a server (for example, all images come from this machine, and text from that one, etc.), and it works at the IP Address level. You simply plug in squid wherever, AIUI. -- Do not meddle in the affairs of sysadmins, for they are easy to annoy, and have the root password.
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org> wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
EddieWare does do 'intellgient' caching
eddieware is on my list of option to try out next month... Ill keep you posted Toby Dickenson tdickenson@geminidataloggers.com
Toby Dickenson wrote:
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org> wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
I don't understand why you think it doesn't. DNS has clearly demonstrated the ability to handle 'thousands', and the entire scalability of a cluster is the addition of machines. You appear to be desirous of having a machine handle a section. Thus, for thousands of sections, you have thousands of machines. Again, with a ZEO clusters the bottleneck/SPOF would be the ZSS, but that _could_ be worked aorund, and has nothing to do with 'sections' of a website. Beyond that, your bottleneck would be networking. Whether yoour individual BE servers responded directly to the web browser, or whether they were channeled through a single/multiple FrontEnd servers. The decision to implement a BE->Client vs. a BE->FE->Client topology has not been discussed, as it is irrelevent to the discussion. In fact, come to think of it, I have noticed many sites redirect a /foo/bar usr to a foo.domain.com or bar.domain.com.
EddieWare does do 'intellgient' caching
eddieware is on my list of option to try out next month... Ill keep you posted
Cool. -- Do not meddle in the affairs of sysadmins, for they are easy to annoy, and have the root password.
On Thu, 20 Jul 2000, Bill Anderson wrote:
Toby Dickenson wrote:
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org>
wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
I don't understand why you think it doesn't. DNS has clearly demonstrated the ability to handle 'thousands', and the entire scalability of a cluster is the addition of machines. You appear to be desirous of having a machine handle a section. Thus, for thousands of sections, you have thousands of machines. Again, with a ZEO clusters the bottleneck/SPOF would be the ZSS, but that _could_ be worked aorund, and has nothing to do with 'sections' of a website.
Bill, Whilst the structures you've described are very effective, your example of libc.org required one thing in particular that I'm not sure is available: prior knowledge of which sections will be hit hardest. Essentially, your setup allows any 'server' to become a 'server cluster' for scaling purposes. Great! So, if for now on we assume 'server' can mean 'single or cluster of servers'.... The desire isn't for fixed server<->section relationship. Instead, a 'preference' for that section to go to a particular server, so that the request 'hopefully' goes the server with the greatest chance of having the relevant objects in cache. In fact, with the further information provided, what you really want is for requests from a particular client to go to the same server. This would be better served with a redirection to a server specific domain name (serverN.mysite.com). However, for the initial request, your best choice is to go to the server that last served those pages. Since dynamically tracking this info would be onerous, by encouraging requests for one section toward a particular server, you improve the chances of it holding the relevant objects in cache, with merely a fraction of the processing/data overheads.
Beyond that, your bottleneck would be networking. Whether yoour individual BE servers responded directly to the web browser, or whether they were channeled through a single/multiple FrontEnd servers. The decision to implement a BE->Client vs. a BE->FE->Client topology has not been discussed, as it is irrelevent to the discussion.
Ah, topology. (I'm leaving it there. I really don't have time to get into this fully :)
In fact, come to think of it, I have noticed many sites redirect a /foo/bar usr to a foo.domain.com or bar.domain.com.
EddieWare does do 'intellgient' caching
eddieware is on my list of option to try out next month... Ill keep you posted
Cool.
Have a better one, Curtis <dtml-var standard_work_disclaimer>
Curtis Maloney wrote:
On Thu, 20 Jul 2000, Bill Anderson wrote:
Toby Dickenson wrote:
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org>
wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
I don't understand why you think it doesn't. DNS has clearly demonstrated the ability to handle 'thousands', and the entire scalability of a cluster is the addition of machines. You appear to be desirous of having a machine handle a section. Thus, for thousands of sections, you have thousands of machines. Again, with a ZEO clusters the bottleneck/SPOF would be the ZSS, but that _could_ be worked aorund, and has nothing to do with 'sections' of a website.
Bill,
Whilst the structures you've described are very effective, your example of libc.org required one thing in particular that I'm not sure is available: prior knowledge of which sections will be hit hardest.
You start with the most likely suspects, and then after a given time interval, you adjust as needed. *most* site admins have a good idea of a given section being more popular or frequented when the site is built. That is as good a start as any other, if not better.
Essentially, your setup allows any 'server' to become a 'server cluster' for scaling purposes. Great! So, if for now on we assume 'server' can mean 'single or cluster of servers'....
A logical assumption.
The desire isn't for fixed server<->section relationship. Instead, a 'preference' for that section to go to a particular server, so that the request 'hopefully' goes the server with the greatest chance of having the relevant objects in cache.
I see that it may not have been clear, but my ecample provided just that. A preference is indicated by the weight given to servers and sections. Let us say I have three servers. Fo rthe whole site, two get a weight of 2, whilst a third gets a weight of 1. This third one, however, gets a weight of 2 for the members section, whilst the other two get a weight or 1. This provides a preference for server3 to serve up the members section, though it is not a direct-only mapping. how does this not fit the 'hopefully' desire? If you _wanted_ a direct-only, you simply remove servers 1 and 2 from the list of the members section. The really neat thing about this is that it can be done at runtime.
In fact, with the further information provided, what you really want is for requests from a particular client to go to the same server. This would be better served with a redirection to a server specific domain name (serverN.mysite.com). However, for the initial request, your best choice is to go to the server that last served those pages.
Since dynamically tracking this info would be onerous, by encouraging requests for one section toward a particular server, you improve the chances of it holding the relevant objects in cache, with merely a fraction of the processing/data overheads.
Right. I agree that tracking all of this would be onerous, which is why I said I don't think it is worth the effort, and would cost more than it saved. The scenario I described gives a preference for sections to go to a particular server, thus giving you the 'encouragement'. :^)
Beyond that, your bottleneck would be networking. Whether yoour individual BE servers responded directly to the web browser, or whether they were channeled through a single/multiple FrontEnd servers. The decision to implement a BE->Client vs. a BE->FE->Client topology has not been discussed, as it is irrelevent to the discussion.
Ah, topology. (I'm leaving it there. I really don't have time to get into this fully :)
Yeah, topology is where the umm ... electrons hits the wire. Mebbe I'll post this stuff to the Wiki ... the question is .,.. which one? -- Do not meddle in the affairs of sysadmins, for they are easy to annoy, and have the root password.
On Thu, 20 Jul 2000, Bill Anderson wrote:
Curtis Maloney wrote: [snip]
Bill,
Whilst the structures you've described are very effective, your example of libc.org required one thing in particular that I'm not sure is available: prior knowledge of which sections will be hit hardest.
You start with the most likely suspects, and then after a given time interval, you adjust as needed. *most* site admins have a good idea of a given section being more popular or frequented when the site is built. That is as good a start as any other, if not better.
Ah... in my revision of this e-mail (scary, but i do that when i'm writing :) I must have dropped out the bit about tuning... (o8
Essentially, your setup allows any 'server' to become a 'server cluster' for scaling purposes. Great! So, if for now on we assume 'server' can mean 'single or cluster of servers'....
A logical assumption.
The desire isn't for fixed server<->section relationship. Instead, a 'preference' for that section to go to a particular server, so that the request 'hopefully' goes the server with the greatest chance of having the relevant objects in cache.
I see that it may not have been clear, but my ecample provided just that. A preference is indicated by the weight given to servers and sections. Let us say I have three servers. Fo rthe whole site, two get a weight of 2, whilst a third gets a weight of 1. This third one, however, gets a weight of 2 for the members section, whilst the other two get a weight or 1. This provides a preference for server3 to serve up the members section, though it is not a direct-only mapping. how does this not fit the 'hopefully' desire?
Ah... well... in your previous e-mails I don't recall you mentioning multiple weightings for a single server. In this case, yes, your solutions fits well.
Ah, topology. (I'm leaving it there. I really don't have time to get into this fully :)
Yeah, topology is where the umm ... electrons hits the wire.
hehehe....
Mebbe I'll post this stuff to the Wiki ... the question is .,.. which one?
Don't look at me... I've never even SEEN a wiki. (o8 Curtis
On Wed, 19 Jul 2000 10:07:30 -0600, Bill Anderson <bill@libc.org> wrote:
Toby Dickenson wrote:
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org> wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
I don't understand why you think it doesn't. DNS has clearly demonstrated the ability to handle 'thousands', and the entire scalability of a cluster is the addition of machines. You appear to be desirous of having a machine handle a section. Thus, for thousands of sections, you have thousands of machines.
DNS scales up to one machine per section, but a typical budget doesnt. Fortunately it doesnt need too. Even if we have 10000's of sections, I would expect only 10's to be active over a period of a few minutes. Another way of looking at the issue is that it is similar to using in-memory Sessions. You have to ensure that each user's requests are routed to the machine that holds their session. The main difference is that it is a performance, not correctness issue. I don't want to think about handling Sessions using DNS and one machine per user ;-)
EddieWare does do 'intellgient' caching
eddieware is on my list of option to try out next month... Ill keep you posted
Cool.
Toby Dickenson tdickenson@geminidataloggers.com
Toby Dickenson wrote:
On Wed, 19 Jul 2000 10:07:30 -0600, Bill Anderson <bill@libc.org> wrote:
Toby Dickenson wrote:
On Tue, 18 Jul 2000 16:08:48 -0600, Bill Anderson <bill@libc.org> wrote:
I might be reading more into his words than was intended, but I think this demonstrates the problem. Distributing multiple requests for one section across multiple servers is (what I consider to be) undesirable.
You can actually do it either way. Curtis (AIUI) complained that the method described meant your site depended upon each of th esection's servers being up, that there was no redundancy. So I described a way of doing it with redundancy.
What you described doesn't scale up to having 1000's of sections (which I was assuming, and I think Curtis was too). If this isn't a problem, then your solution is great.
I don't understand why you think it doesn't. DNS has clearly demonstrated the ability to handle 'thousands', and the entire scalability of a cluster is the addition of machines. You appear to be desirous of having a machine handle a section. Thus, for thousands of sections, you have thousands of machines.
DNS scales up to one machine per section, but a typical budget doesnt.
Fortunately it doesnt need too. Even if we have 10000's of sections, I would expect only 10's to be active over a period of a few minutes.
You can have multiple sections per machine, as well. :^) sec1.libc.org and sec2.libc.org can be on the same machine (heck, they _could_ be different ZServers on the same machine!). Real time analysis of section use compared to user browsing by url analysis would, IMO, induce more overhead than you would save by doing it based upon overall site useage patterns.
Another way of looking at the issue is that it is similar to using in-memory Sessions. You have to ensure that each user's requests are routed to the machine that holds their session. The main difference is that it is a performance, not correctness issue.
Ah, but if you encoded the session information in the url, you get no practical differences ;^)
I don't want to think about handling Sessions using DNS and one machine per user ;-)
ygh, me either! -- Do not meddle in the affairs of sysadmins, for they are easy to annoy, and have the root password.
Curtis Maloney wrote:
I think most people seem to be missing the point here.
While I think Bill addressed this, I am not missing your point. By subdomaining areas, you can assign those subdomains an IP address, which can be primarily served by a Zope Client.
The idea is that ALL servers can serve ALL content. HOWEVER, the 'load balancer' will opt for a certain server for a certain URL, in order to improve cache hits.
Because you're using SiteAccess, every node can access the objects that the subdomain-primary serves, so you can do loadbalancing or failover. There might be some delay as the secondaries draw objects from the siteserver.
Have a better one,
My life has been so good lately I'm almost afraid to think of what that would be like. ethan mindlace fremen Zopatistas Unite!
participants (4)
-
Bill Anderson -
Curtis Maloney -
ethan mindlace fremen -
Toby Dickenson