[ZODB-Dev] [ zodb-Bugs-547020 ] Weird ZEO error: Aiieee! error code 25

Mon, 22 Apr 2002 12:19:24 -0700

Bugs item #547020, was opened at 2002-04-22 10:47
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=115628&aid=547020&group_id=15628

Category: None
Group: None
Status: Open
Resolution: None
Priority: 9
Submitted By: Chris Withers (fresh)
Assigned to: Jeremy Hylton (jhylton)
Summary: Weird ZEO error: Aiieee! error code 25

Initial Comment:
On a Python ZEO client, we get:

>   File c:\zope\2-4-2_base\lib\python\ZEO\ClientStorage.py, line 426, in tpc_vote
>     (Object: (x.x.com', xxxx))
>   File c:\zope\2-4-2_base\lib\python\ZEO\zrpc.py, line 228, in __call__
> TypeError: exceptions must be strings, classes, or instances

The matching entry on the Storage Server is:

> 2002-04-22T10:33:43 ERROR(200) zdaemon zdaemon: Mon Apr 22 11:33:43 2002: Aiieee! 2107 
exited with error code: 25

After that, the storage server tries to fork, and we got the following entry pattern in the logs:

> ------
> 2002-04-22T10:33:43 INFO(0) zdaemon zdaemon: Mon Apr 22 11:33:43 2002: Houston, we 
have 
> forked
> ------
> 2002-04-22T10:33:43 INFO(0) zdaemon zdaemon: Mon Apr 22 11:33:43 2002: Hi, I just forked 
> off a kid: 2155
> ------
> 2002-04-22T10:33:43 INFO(0) zdaemon zdaemon: Mon Apr 22 11:33:43 2002: Houston, we 
have 
forked

..however, the storage server doesn't ever accept connections afterwards until it is manually 
stopped and restarted.

During this restart process, we often get the following entries in the storage server logs:

> ------
> 2002-04-21T09:22:22 PROBLEM(100) ZODB FS FS21  warn: /x/Data.fs > truncated, possibly due 
to damaged records at 2147482867
> 
> ------
> 2002-04-21T09:22:22 PROBLEM(100) ZODB FS FS21  warn: Writing truncated data from > 
/x/Data.fs to /x/Data.fs.tr14
>
> ------

...but regardless of whether we get those messages or not, the storage server takes an age to 
start and uses 100% CPU the whole time.

This storage server has been completely stable for about a month, and this problem has started 
happening reccurrently in the last few days.

What's the best way to go about find what's going on?

----------------------------------------------------------------------

>Comment By: Jeremy Hylton (jhylton)
Date: 2002-04-22 19:19

Message:
Logged In: YES 
user_id=31392

I suspect it is something funky with your environment.  On 
my machine (Mandrake 7.2?) RLIMIT_FSIZE > 10**18.

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2002-04-22 19:01

Message:
Logged In: YES 
user_id=31392

The zdaemon error message is probably a little misleading.  
I believe that 25 is the signal sent to the child causing 
it to exit.  If that's right, then you're getting SIGXFSZ, 
which means that it tried to extend a file past the rlimit 
(RLIMIT_FSIZE).

Does that sound plausible?  How big is the file?  Do you 
have a custom RLIMIT_FSIZE or know what the default is for 
your OS?

It makes sense that this happens during tpc_vote(), because 
that's when all the data is copied from the tempfile to the 
Data.fs.  If a failure occurs at this point, I'll bet that 
on restart FileStorage ignores the index and recomputes it, 
which would explain why it is slow.

Not sure how we can detect this problem more gracefully.  
Until today, I had never even heard of SIGXFSZ.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=115628&aid=547020&group_id=15628