[Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris

Wed, 12 Dec 2001 11:00:49 -0500

Florent Guillaume wrote:

>>(gdb) print *((PyObject *) gc)->ob_type
>>$1 = {ob_refcnt = 18213696, ob_type = 0x2d70b0, ob_size = 0, 
>>  tp_name = 0x1 "T", tp_basicsize = 1328272, tp_itemsize = 4156348, 
>>  tp_dealloc = 0x125865c, tp_print = 0x3c1b04, tp_getattr = 0,
>>tp_setattr = 0, 
>>  tp_compare = 0x29, tp_repr = 0x3adeb0, tp_as_number = 0xf66198, 
>>  tp_as_sequence = 0xdf3fa0, tp_as_mapping = 0x0, tp_hash = 0x1, 
>>  tp_call = 0x144490 <PyMethod_Type>, tp_str = 0x3f0a1c, 
>>  tp_getattro = 0x125865c, tp_setattro = 0x3c1b04, tp_as_buffer = 0x0,
>>
>>  tp_flags = 158561192, tp_doc = 0x29 "", tp_traverse = 0x4c4f4144, 
>>  tp_clear = 0xd908c0, tp_richcompare = 0x1151300, tp_weaklistoffset =
>>0}
>>
>[...]
>
>>gdb) x 0x4c4f4144
>>0x4c4f4144:     Cannot access memory at address 0x4c4f4144.
>>
>
>
>0x4c4f4144 is big-endian ascii for "LOAD". Things were corrupted
>before...
>
>
>Florent
>

Yes, the whole block is bad, so it probably isn't really a Python type 
object.  The refcount is a bit high, the name is really low (0x01!) the 
basicsize and itemsize are extremely large, the compare function is too 
low, the hash function is too low -- ie it isn't a type object.  

So, I may have been telling him to get the wrong thing; the source code 
that he faulted in reads:

/* Subtract internal references from gc_refs */
static void
subtract_refs(PyGC_Head *containers)
{
        traverseproc traverse;
        PyGC_Head *gc = containers->gc_next;
        for (; gc != containers; gc=gc->gc_next) {
/* The next line is the line that was active at the time of his fault */
                traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
                (void) traverse(PyObject_FROM_GC(gc),
                               (visitproc)visit_decref,
                               NULL);
        }
}

And PyObject_FROM_GC(gc) is either (gc) or ((PyObject *)(((PyGC_Head 
*)gc)+1)) depending on on whether or not WITH_CYCLE_GC is defined.  I 
took the easy route and asked Joe to assume that the former was true.
If the latter is true, then the type object is shifted upwards in memory 
by three words; the new first three fields are gc_next, gc_prev, and 
gc_refs.

That means every value in the type header is off by three fields, if it 
isn't aligned, meaning the real type object would be:

gc_next = 0x115eb40
gc_prev = 0x2d70b0
gc_refs = 0
ob_refcnt = 0x1
ob_type = 0x144490 (which we actually know is <PyMethod_Type> -- yay)
ob_size = 0x3f6bbc (which is too large for my comfort)
tp_name = 0x12865c (valid pointer but we dont know what it is)
tp_basicsize=0x3c1b04 (seems high again, but is 0x350b8 less than ob_size)
tp_itemsize = 0
tp_dealloc = 0
tp_print = 0x29 (boo!)
tp_getattr = 0x3adeb0
tp_setattr = 0xf66198
tp_compare = 0xdf3fa0
tp_repr = 0
tp_as_number = 1 (boo!)
tp_as_sequence = 0x144490 <PyMethod_Type> (boo!)

etc...

even shifting THESE values by 1 (assuming the compiler takes PyGC_Head 
which is three words and pads it up to 4 words for alignment) puts 
garbage values like 0x29 in tp_dealloc.

Ergo, I'm pretty confident that the gc pointer itself is bad.

If I was just a *wee* bit more familiar with how Solaris loaded 
segments, I'd be able to glean some more information from the addresses 
(ie are they code or data segment pointers).  Normally I like seeing 
OS's use the high nybble or byte of an address as a segment number to 
make that sort of diagnosis easier.

It actually looks like page zero is MAPPED on Solaris (I didnt think it 
was) which in my book is a baaad thing since it means a null pointer CAN 
be dereferenced.