[Voiceglue] More on UTF-8 (Simple workaround)
emiliano esposito
emiespo at tiscali.it
Mon Dec 15 13:11:27 EST 2008
Doug Campbell ha scritto:
>> Why does this happen? Is it a OpenVXI flaw or something else?
>>
>
> This is an excellent question, and I'd like to find an
> answer. Unfortunately I'm very busy right now getting
> a new release of voiceglue out, but I have put this
> issue on the todo list, so hopefully I can get to it soon.
>
> Doug Campbell
> _
I was investigating right now. I think I found a workaround.
In voiceglue, line 990:
## For some reason, OpenVXI is passing \x90 and \x902x for spaces
$xml_text =~ s/\x902x/ /g;
$xml_text =~ s/\x90/ /g;
I've read something about perl RegExp and utf-8 (wikipedia and here:
http://www.xav.com/perl/lib/Pod/perlre.html), and I eventually dumped
the $xml_text variable out to a file (learned just that little perl to
do this :-)), the result is attached (open with an hex editor).
Notice this sequence: C3 30 32 78 A8 30 32 78... the missing character
is "C3A8", both codes are right before the byzarre "02x" sequence; so I
don't think that 0x90303278 (=\x902x in regex) are for spaces (since the
other spaces are normally coded as "0x20").
They don't look like anything known to me :), not UTF-8 (since you can't
start a sequence with 0x90), nor UTF-16 nor UTF32, since there are both
3 and 4 bytes sequences. I also thought of an endianess trouble, but it
doesn't seem to be the case. Finally, I simply tried to remove them with:
$xml_text =~ s/\x902x//g; ## same as
before: we look for 0x90 followed by "2x"
$xml_text =~ s/(\P{IsASCII})02x/$1/g; ## now we look for any
non-ascii character followed by "02x"
And now it works... but I don't know what sort of encoding is that and
why it gets in.
Using valid UTF-8, the previous sequence should give no false positives,
since we expect that OVXI has split coded bytes like that, but I have no
idea what could happen with other languages/encodings, provided you for
some reason have a "02x" somewhere after a non english character, makes
not much sense to me, but you never know :-)))
It would be better to further investigate this and get rid of that
substitution, but I can't do it now, maybe later.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dump.txt
Url: http://www.voiceglue.org/pipermail/voiceglue/attachments/20081215/5e383696/attachment.txt
More information about the Voiceglue
mailing list