[Voiceglue] More on UTF-8

Carlos Alarcón carlos.alarcon at tyven.com
Wed Jan 21 06:39:56 EST 2009


Hi,
Browsing the mailist I can see that Emiliano were doing some stuff getting voiceglue able to run with festival and using ISO-8859-1 encoding.
I am also using festival since it gives Spanish Language support, I will need also to use ISO-8859-1 encoding (festival seems not to like other).
Does the workaround solve the problem?, I mean, after the workaround, does voiceglue give to voiceglue_gen_tts the proper ISO-8859-1 characters?

regards



Doug Campbell ha scritto:
 >>/ Why does this happen? Is it a OpenVXI flaw or something else?
/>>/
/>/
/>/ This is an excellent question, and I'd like to find an
/>/ answer. Unfortunately I'm very busy right now getting
/>/ a new release of voiceglue out, but I have put this
/>/ issue on the todo list, so hopefully I can get to it soon.
/>/
/>/ Doug Campbell
/>/ _
/I was investigating right now. I think I found a workaround.

In voiceglue, line 990:

## For some reason, OpenVXI is passing \x90 and \x902x for spaces
$xml_text =~ s/\x902x/ /g;
$xml_text =~ s/\x90/ /g;

I've read something about perl RegExp and utf-8 (wikipedia and here:
http://www.xav.com/perl/lib/Pod/perlre.html), and I eventually dumped
the $xml_text variable out to a file (learned just that little perl to
do this :-)), the result is attached (open with an hex editor).

Notice this sequence: C3 30 32 78 A8 30 32 78... the missing character
is "C3A8", both codes are right before the byzarre "02x" sequence; so I
don't think that 0x90303278 (=\x902x in regex) are for spaces (since the
other spaces are normally coded as "0x20").

They don't look like anything known to me :), not UTF-8 (since you can't
start a sequence with 0x90), nor UTF-16 nor UTF32, since there are both
3 and 4 bytes sequences. I also thought of an endianess trouble, but it
doesn't seem to be the case. Finally, I simply tried to remove them with:

$xml_text =~ s/\x902x//g; ## same as
before: we look for 0x90 followed by "2x"
$xml_text =~ s/(\P{IsASCII})02x/$1/g; ## now we look for any
non-ascii character followed by "02x"

And now it works... but I don't know what sort of encoding is that and
why it gets in.

Using valid UTF-8, the previous sequence should give no false positives,
since we expect that OVXI has split coded bytes like that, but I have no
idea what could happen with other languages/encodings, provided you for
some reason have a "02x" somewhere after a non english character, makes
not much sense to me, but you never know :-)))

It would be better to further investigate this and get rid of that
substitution, but I can't do it now, maybe later.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: carlos_alarcon.vcf
Type: text/x-vcard
Size: 267 bytes
Desc: not available
Url : http://www.voiceglue.org/pipermail/voiceglue/attachments/20090121/782464e8/attachment.vcf 


More information about the Voiceglue mailing list