[Zope-dev] TALParser barfing on byte-order marked utf8 XML files.

9 Jul 2004

      Yo,

We are using TAL for things other than ZPT. but are having problems with 
files that include a BOM preamble.

the problem is that althought the underlying XML parser is capable of 
parsing these kind of files, TALParser initialises his parent without 
encoding (XMLParser.__init__(self) in TALParser.py  line 27)

Anyway,
I have attached a small example (test.py  + test.ml) that illustrates 
the problem with Zope 2.7.1.

running the test gives:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in 
position 0: ordinal not in range(128)
which is perfectly logical: feff (the start of the bom preamble) is not 
ascii.

chipping away the preamble (data=data[4:] ) gives problems further on in 
the file as the test example has some german characters (ä)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in 
position 50: ordinal not in range(128) which is also perfectly logical: 
ä has code 132.

My question is simply: why is TALParser not taking the encoding into 
acount ? Is this deliberate, or is it an oversight ?

Romain Slootmaekers.

#
#
#
from xml.dom.minidom import parseString

import sys
from TAL.TALParser import TALParser
from TAL.TALInterpreter import TALInterpreter
from TAL.DummyEngine import DummyEngine
import StringIO

import codecs

print sys.getdefaultencoding()

def readData():
    f = open('test.xml','r')

    readerClass = codecs.getreader('utf8')
    print readerClass
    reader = readerClass(f)
    data = reader.read()
    f.close()
    print "size = %s" % len(data)
    return data

def expand(xml):

    parser = TALParser()
    xml = xml[4:]
    parser.parseString(xml)
    program, macros = parser.getCode()
    engine = DummyEngine(0)
    out = StringIO.StringIO()
    interpreter = TALInterpreter(program,macros,engine,stream=out)
    interpreter()
    result = out.getvalue()

    return result

data = readData()
expanded = expand(data)
document = parseString(expanded)

print "ok"