system down - how to prevent?
Hi all. We have been working with Zope for over a year now and we like it. It has been a long time since I had to call upon you people to help me out. Yesterday, our production server went down the hard way. While someone was editing some objects in the ZMI everything crashed. Not only Zope, but the entire machine. We are running redhat 7.3, apache and Zope 2.5.1 + CMF 1.3 on a Dell PowerApp 120 dual PIII 1Ghz, 1 GB ram, Raid 5 so we thought we would be rather save having also backed up Data.fs by our hosting provider. After the crash the server would not start up again, indicating 'memory failure'. After several retries nothing works anymore. The machine is still under warranty so there is no real problem either. Now comes the funny part. There doesn't seem to exist any backup of the Data.fs. Checking our systemlogs prior tho the incident revealed this: Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed on "/home/zope/2-5-1/var/Data.fs", error 75 Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed on "/home/zope/2-5-1/var/Data.fs.old", error 75 Feb 14 02:54:12 piwebserver Retrospect[27995]: connTcpConnection: invalid code found: 111 Now they tell us their backup program can connect to our server, but the Data.fs file cannot be backed up because it is locked / in use. Our firewall is open to the backup program $IPT -A tcp_inbound -p TCP -s 111.111.111.111 --destination-port 497 -j ACCEPT How come? We can manually create a copy of the file. Has anyone had these problems and how did you solve them. Secondly we are investigating how to prevent downtime of the production server in the future. I had a quick peek at ZEO but I'm a bit lost there. What is the minimum setup for a production site to be kept alive (not necesaraly with the same specs) As far as I can tell you need at least three machines to keep your site alive: a 'load- balancer', a 'client' and a 'server'. Could this be narrowed down to two machines? And what if the actual 'ZEO server' goes down? TIA, WKR, Roel
On Wednesday 05 March 2003 12:34 pm, Roel Van den Bergh wrote:
Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed on "/home/zope/2-5-1/var/Data.fs", error 75
errno 75 is EOVERFLOW, which I think is probably related to 2G file limits in the backup software.
a 'load- balancer', a 'client' and a 'server'. Could this be narrowed down to two machines?
Or down to one machine. You dont have to put ZEO server and application servers on seperate boxes.
And what if the actual 'ZEO server' goes down?
Today it is a single point of failure, so make sure that one box is well engineered. Keeping a cold standby ZEO server isnt hard, if you dont mind a disaster losing a few recent transactions, and causing maybe an hour of downtime while you swap the boxes. There are several options for maintaining this replica.... Is that what you had in mind? -- Toby Dickenson http://www.geminidataloggers.com/people/tdickenson
-----Original Message----- From: Toby Dickenson [mailto:tdickenson@geminidataloggers.com] Sent: Wednesday, March 05, 2003 1:49 PM To: roel@planetinterior.com; zope@zope.org Subject: Re: [Zope] system down - how to prevent?
On Wednesday 05 March 2003 12:34 pm, Roel Van den Bergh wrote:
Feb 14 02:54:08 piwebserver Retrospect[27997]: FSGetNodeInfo: lstat failed on "/home/zope/2-5-1/var/Data.fs", error 75
errno 75 is EOVERFLOW, which I think is probably related to 2G file limits in the backup software.
Yes our Data.fs is about 4GB but shouldn't a backup program like retrospect be able to handle such files?
a 'load- balancer', a 'client' and a 'server'. Could this be narrowed down to two machines?
Or down to one machine. You dont have to put ZEO server and application servers on seperate boxes.
This does not help in case of hardware failure
And what if the actual 'ZEO server' goes down?
Today it is a single point of failure, so make sure that one box is well engineered.
Keeping a cold standby ZEO server isnt hard, if you dont mind a disaster losing a few recent transactions, and causing maybe an hour of downtime while you swap the boxes. There are several options for maintaining this replica.... Is that what you had in mind?
What do you mean by 'cold', how can you maintain an exact copy of the Data.fs on another machine if that system isn't turned on and connected to the original server. Our hardware is located 30 km from here in a secured place. And yes, swapping boxes is an option I had in mind. Like bying two equal high-end servers with hot-swappable disks and in emergencies moving the disks from one machine to the other. But my boss hasn't got the money :-(
-- Toby Dickenson http://www.geminidataloggers.com/people/tdickenson
On Wednesday 05 March 2003 1:21 pm, Roel Van den Bergh wrote:
Like bying two equal high-end servers with hot-swappable disks and in emergencies moving the disks from one machine to the other. But my boss hasn't got the money :-(
If you want redundancy then you will need two copies of your hardware. The secondary system can be lower spec (in reliability, but not capacity), because you will only be using it while the main system is down. If you dont have easy physical access to the machines then Im not sure hot swap disks are a great advantage. An online replication scheme would allow you to switch remotely. There are scripts to do this for FileStorage (I/O intensive, and hairy) and DirectoryStorage (still in beta release), and Zope Corp have a commercial product. -- Toby Dickenson http://www.geminidataloggers.com/people/tdickenson
participants (2)
-
Roel Van den Bergh -
Toby Dickenson