[ZODB-Dev] What's best to do when there is a failure in the second phase of 2-phase commit on a storage server

Wed Oct 1 13:40:11 EDT 2008

On Oct 1, 2008, at 1:21 PM, Dieter Maurer wrote:

> Jim Fulton wrote at 2008-9-30 18:30 -0400:
>> ...
>>>> c. Close the file storage, causing subsequent reads and writes to
>>>> fail.
>>>
>>> Raise an easily recognizable exception.
>>
>> I raise the original exception.
>
> Sad.
>
> The original exception may have many consequences -- most probably
> harmless. The special exception would express that the consequence was
> very harmfull.

The fact that it occurs in this place at all indicates this.

>>> In our error handling we look out for some nasty exceptions and
>>> enforce
>>> a restart in such cases. The exception above might be such a nasty
>>> exception.
>>
>> The critical log entry should be easy enough to spot.
>
> For humans, but I had in mind that software recognizes the exception
> automatically and forces a restart.

I suppose we could define such an exception.  A storage that raises it  
is indicating that it will come back in some sort of consistent state  
after a restart.

> Or do you have a logger customization in mind that intercepts the
> log entry and then forces a restart?

No
...

>>>> - Have a storage server restart when a tpc_finish call fails.  This
>>>> would work fine for FileStorage, but might be the wrong thing to do
>>>> for another storage.  The server can't know.
>>>
>>> Why do you think that a failing "tpc_finish" is less critical
>>> for some other kind of storage?
>>
>>
>> It's not a question of criticality.  It's a question of whether a
>> restart will fix the problem.  I happen to know that a file storage
>> would be in a reasonable state after a restart.  I don't know this to
>> be the case for some other storage.
>
> But what should an administrator do when this is not the case?
> Either a stop or a restart....

Yes

> It may well be that a restart *may* not lead into a fully functional
> state (though this would indicate a storage bug)

A failure in tpc_finish already indicates a storage bug.

> but a definitely not
> working system is not much better than one that may potentially not
> be fully functional but usually will be apart from storage bugs.

If the alternative to a non-working system is a system with  
inconsistent data, I'll take the former.

I can see some benefit from raising a special error to indicate that a  
restart would be beneficial.  If I hadn't already done the proposed  
work, I might even pursue this idea. :)  At this point, I think I've  
reduced the probability of a failure in FileStorage._finish enough  
that further effort, at least by me, isn't warranted.

Jim

--
Jim Fulton
Zope Corporation