ZODB data loss - help required
Ive just had a very disturbing and odd experience with data loss in the ZODB. Ive been running Zope servers for 15 years and never seen this before until two previous incidents a few months ago. The sequence of events for those incidents were vague so I put it down to user error, but now that its happened a third time on a different site Im getting worried. Plone 4.2.0.1 (4206) CMF 2.2.6 Zope 2.13.15 Python 2.7.3 (default, Feb 6 2013, 01:00:51) [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] PIL 2.0.0 (Pillow) The site was been working perfectly for 4 months with many updates and perfectly fine performance. Im now faced with the horrible job of explaining this to the client and I dont quite understand it myself. *Sequence of events* The Plone site was launched to production in September and has been running ever since. It has been restarted several times since then with minor product updates. Today I needed to do a yum update and then I restarted Zope using an init script that stops Zope & ZEO and restarts them. When Zope came up the Plone site was now showing data from August 5. Somehow everything between August 5 and December 5 has been lost. A scan of the transactions via the Undo Log or using data.fs tools shows transactions up to August 5 then nothing until December 5
From the logs I can see that the server itself was rebooted on August 5 and there was also a copy of the Data.fs made on that date too.
We have not packed the database since that date as it was not large or growing fast. All backups also show the August 5 data - which means the file system copies of the Data.fs were simply copying the old version of the database. Ive cloned the file system and kept the original untouched although in the first hour or so after the restart I did several file copies and moves as I was unaware of the gravity of the situation. Ive tried extundelete and debugfs and neither of them detect any deleted Data.fs file that can be recovered. *Questions* My main aim is to work out what happened. My best guess is that Zope was somehow connecting to a stale or outdated file pointer and updating that file all along while the Data.fs was pointing to an August 5 copy. But how could this situation eventuate and persist for so long? The odd thing is that we had 2 very similar incidents on 2 different Zope servers a few months ago but both resulted in almost no data loss as the timeframes were shorter and I dismissed them as some odd user error. We have recently moved most of our Zope servers to Linode - could it be their file system? or could it be the new way we setup the buildouts and init scripts? Any clues at all would be welcomed. -- *Tom Cameron* Technical Director *Mooball IT*
Tom Cameron <tom@mooball.net> schrieb:
Ive just had a very disturbing and odd experience with data loss in the
Hi Tom, do you have any Zope / event.log Logs from the timeframe? Just an idea: Did you make the Zope restart BEFORE your system update/-grade (i assume that you mean the Cent OS packet manager "yum")? Did you make shure that all of your Zopes python libs (not managed by your yum system - if you have some) got rebuild after the system upgrade - at least the ones with system library / binary context? 5th September sounds a relative "large" timeframe today, but is possibly not if there was not a lot of changes / operations to ZODB or the volume was small. best regards, Niels. -- Niels Dettenbach Syndicat IT&Internet http://www.syndicat.com
On Don, 2013-12-05 at 15:42 +1000, Tom Cameron wrote:
Questions My main aim is to work out what happened. My best guess is that Zope was somehow connecting to a stale or outdated file pointer and updating that file all along while the Data.fs was pointing to an August 5 copy. But how could this situation eventuate and persist for so long?
for my, it looks exactly like that Zope used an inode, which wasn't the Data.fs you expected. It used that as long as you restarted Zope. your backups were always copying the old Data.fs from the filesystem path, where you expected the correct one to be in. that happens, when the directory with your Data.fs is moved (e.g. $ mv ./var ./var-bak) and a new one copied back to the old place (e.g. $ cp -R ./var-bak ./var). that happens also, when the Data.fs is deleted. there are chances, that the correct Data.fs is still somewhere around (except when it was deleted, then a data recovery service might be able to help). just a wild guess: did your backup script screw this up? best, johannes
The odd thing is that we had 2 very similar incidents on 2 different Zope servers a few months ago but both resulted in almost no data loss as the timeframes were shorter and I dismissed them as some odd user error.
We have recently moved most of our Zope servers to Linode - could it be their file system? or could it be the new way we setup the buildouts and init scripts?
Any clues at all would be welcomed.
-- Tom Cameron Technical Director
Mooball IT
_______________________________________________ Zope maillist - Zope@zope.org https://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope-dev )
Le 05/12/2013 09:43, Johannes Raggam a écrit :
On Don, 2013-12-05 at 15:42 +1000, Tom Cameron wrote:
Questions My main aim is to work out what happened. My best guess is that Zope was somehow connecting to a stale or outdated file pointer and updating that file all along while the Data.fs was pointing to an August 5 copy. But how could this situation eventuate and persist for so long? for my, it looks exactly like that Zope used an inode, which wasn't the Data.fs you expected. It used that as long as you restarted Zope. your backups were always copying the old Data.fs from the filesystem path, where you expected the correct one to be in.
that happens, when the directory with your Data.fs is moved (e.g. $ mv ./var ./var-bak) and a new one copied back to the old place (e.g. $ cp -R ./var-bak ./var).
that happens also, when the Data.fs is deleted.
I'm agree with this description. You can test this python behavior in a python shell: when a file is unlinked in the filesystem (move or deleted) python move the file descriptor into RAM and still use it there without warning. The file descriptor is destroyed when the python process end (Zope stop or restart). Your only way is to find a Data.fs.old from the last pack somewhere or to have a backup. -- Encolpe DEGOUTE http://encolpe.degoute.free.fr/ Logiciels libres, hockey sur glace et autres activités cérébrales
participants (4)
-
Encolpe Degoute -
Johannes Raggam -
Niels Dettenbach (Syndicat IT&Internet) -
Tom Cameron