Ive just had a very disturbing and odd experience with data loss in the ZODB. Ive been running Zope servers for 15 years and never seen this before until two previous incidents a few months ago. The sequence of events for those incidents were vague so I put it down to user error, but now that its happened a third time on a different site Im getting worried.

Plone 4.2.0.1 (4206)
CMF 2.2.6
Zope 2.13.15
Python 2.7.3 (default, Feb 6 2013, 01:00:51) [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]
PIL 2.0.0 (Pillow)

The site was been working perfectly for 4 months with many updates and perfectly fine performance. Im now faced with the horrible job of explaining this to the client and I dont quite understand it myself.


Sequence of events
The Plone site was launched to production in September and has been running ever since. It has been restarted several times since then with minor product updates.

Today I needed to do a yum update and then I restarted Zope using an init script that stops Zope & ZEO and restarts them.

When Zope came up the Plone site was now showing data from August 5. Somehow everything between August 5 and December 5 has been lost. 

A scan of the transactions via the Undo Log or using data.fs tools shows transactions up to August 5 then nothing until December 5

From the logs I can see that the server itself was rebooted on August 5 and there was also a copy of the Data.fs made on that date too. 

We have not packed the database since that date as it was not large or growing fast.

All backups also show the August 5 data - which means the file system copies of the Data.fs were simply copying the old version of the database.

Ive cloned the file system and kept the original untouched although in the first hour or so after the restart I did several file copies and moves as I was unaware of the gravity of the situation.

Ive tried extundelete and debugfs and neither of them detect any deleted Data.fs file that can be recovered.

Questions
My main aim is to work out what happened. My best guess is that Zope was somehow connecting to a stale or outdated file pointer and updating that file all along while the Data.fs was pointing to an August 5 copy. But how could this situation eventuate and persist for so long?

The odd thing is that we had 2 very similar incidents on 2 different Zope servers a few months ago but both resulted in almost no data loss as the timeframes were shorter and I dismissed them as some odd user error. 

We have recently moved most of our Zope servers to Linode - could it be their file system? or could it be the new way we setup the buildouts and init scripts?

Any clues at all would be welcomed.



--
Tom Cameron
Technical Director

Mooball IT