Most of this is discussion is over my head. But there's one pretty basic misunderstanding exhibited:
I've got some stuff that's in strings, so I guess not unicode, but which is UTF-8 encoded, and I'm wondering how I make sure Zope does "the right thing" here. Are there any docs about?
and
Hmmm, that's interesting. I'd been planning on keeping everything as UTF-8 encoded strings rather than actual unicode. What leads you to suggest storing everything as unicode?
and
Finally, is ZCTextIndex compatible with either unicode or strings that contain UTF-8 encoding?
UTF-8 is one way of encoding Unicode character-sets. They are not different things. When you use UTF-8 you are using Unicode. UTF-8 exists to allow systems to migrate gently, because it translates a very large character set into a format that will not normally break file systems which expect 8-bit character data. There are 16-bit and 32-bit representations of Unicode. UTF-8's representation of ASCII is identical to ASCII's. So for applications which internally process only ASCII, the encoding is moot. But if you have user input you need to watch out: Windows and MacOS support UTF-8 input in browser windows for forms input. This input can seriously break old apps. best Mark Barratt