[Voiceglue] More on UTF-8 (Simple workaround)

emiliano esposito emiespo at tiscali.it
Mon Dec 15 13:11:27 EST 2008


Doug Campbell ha scritto:
>> Why does this happen? Is it a OpenVXI flaw or something else?
>>     
>
> This is an excellent question, and I'd like to find an
> answer.  Unfortunately I'm very busy right now getting
> a new release of voiceglue out, but I have put this
> issue on the todo list, so hopefully I can get to it soon.
>
> Doug Campbell
> _
I was investigating right now. I think I found a workaround.

In voiceglue, line 990:

    ##  For some reason, OpenVXI is passing \x90 and \x902x for spaces
    $xml_text =~ s/\x902x/ /g;
    $xml_text =~ s/\x90/ /g;

I've read something about perl RegExp and utf-8 (wikipedia and here: 
http://www.xav.com/perl/lib/Pod/perlre.html), and I eventually dumped 
the $xml_text variable out to a file (learned just that little perl to 
do this :-)), the result is attached (open with an hex editor).

Notice this sequence: C3 30 32 78 A8 30 32 78... the missing character 
is "C3A8", both codes are right before the byzarre "02x" sequence; so I 
don't think that 0x90303278 (=\x902x in regex) are for spaces (since the 
other spaces are normally coded as "0x20").

They don't look like anything known to me :), not UTF-8 (since you can't 
start a sequence with 0x90), nor UTF-16 nor UTF32, since there are both 
3 and 4 bytes sequences. I also thought of an endianess trouble, but it 
doesn't seem to be the case. Finally, I simply tried to remove them with:

    $xml_text =~ s/\x902x//g;                             ## same as 
before: we look for 0x90 followed by "2x"
    $xml_text =~ s/(\P{IsASCII})02x/$1/g;        ## now we look for any 
non-ascii character followed by "02x"

And now it works... but I don't know what sort of encoding is that and 
why it gets in.

Using valid UTF-8, the previous sequence should give no false positives, 
since we expect that OVXI has split coded bytes like that, but I have no 
idea what could happen with other languages/encodings, provided you for 
some reason have a "02x" somewhere after a non english character, makes 
not much sense to me, but you never know :-)))

It would be better to further investigate this and get rid of that 
substitution, but I can't do it now, maybe later.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dump.txt
Url: http://www.voiceglue.org/pipermail/voiceglue/attachments/20081215/5e383696/attachment.txt 


More information about the Voiceglue mailing list