[Clam-devel] Addressing foreign text encodings: a call for testing

David García Garzón dgarcia at iua.upf.edu
Mon Jul 7 04:24:36 PDT 2008


Ops, we were already taking option 2, using local 8 bits internally and 
converting from and to utf8 when xmling. Just that it was implemented just in 
Xerces xml backend (linux default). For libxml++ backend (the one that works 
in windows) no encoding conversion was taking part.

I just fixed that. The same conversions in LibXml backend than in Xerces. It 
requires recompiling CLAM, Jun, so i will patiently wait for your 
testing. ;-)



On Dilluns 07 Juliol 2008, David García Garzón wrote:
> That highlights the problem i expected but definitelly a crash was not the
> symptom i expected. When reporting crashes is important to paste, at least,
> the console output. It could be the error we expected but i could be a
> different one so .
>
> If i load the xml you sent i get, not a crash but an error message stating
> that:
> An occurred while loading the network file.
> XML Parser Errors: Fatal Error at file CLAMParser, line 1, col 39: An
> exception occurred! Type:UTFDataFormatException, Message:invalid byte 2
> (�) of a 2-byte sequence
>
> This is not a crash, but an expected error message. A crash is never a good
> behaviour, an error message is.
>
> From the xml you send it is clear that the encoding has been missed. So, we
> should play with three formats, the Qt internal encoding, the local 8 bit
> format and utf8.
>
> We have two options:
> - Using utf8 as CLAM internal encoding, in that case we should use a
> conversion to local 8 bits whenever we use it in the c standard library (i
> guess that it includes filenames used in Xerces, libxml++, libmad...).
> - Using local encoding and doing the conversion whenever we are storing it
> in XML.
>
> The later is simpler to implement, as we should do modifications to the XML
> formating, but my few experiences with unicode tells me that using utf8 as
> lingua franca for the inner application is a good option as i think (not
> sure) that other 8bit encoding might not be c standard lib safe (use the 0
> not just to indicate the end of a string, for example)
>
> Any opinions? Any unicode experiences?
>
> On Dilluns 07 Juliol 2008, JunJun wrote:
> > - Recompile the last svn revision of the NetworkEditor
> >    QTDIR=D:/qt/4.3.3/ scons clam_prefix=d:/mingw/local
> > prefix=d:/mingw/local external_dll_path=d:/mingw/local/bin An unexpected
> > error is showed: "can not locate the program input point _ZN4...SsEE on
> > the clam_core.dll" I just fix it by copy the NetworkEditor.exe and paste
> > to the path of /mingw/local/lib
> >
> > - Open it and drop a MonoAudioFileReader into the canvas
> > - Configure it to take a file which has some special characters in your
> > language
> >   ../朋克punk.wav
> > - If the processing is still in red after clicking ok, you got the bug,
> > report It's not in red after clicking "ok".
> > - Open the configuration dialog again if the special symbol is being
> > displayed wrongly, report
> >   It displays just fine.
> > - Accept the configuration again, if now red, report
> >   Still no problem.
> > - Save the network and load it again, if now red, report
> >   When I load it again, the NetworkEditor crashes!!
> > - Configure the processing, if the symbol now looks bad, report
> >   TBD...
> > - In any case, send me the network file so i can check the file encoding.
> > If you open it with an encoding aware editor, the symbols should look
> > well in utf8 mode.
> >   No, I think it doesn't look well in utf8 mode.



-- 
David García Garzón
(Work) dgarcia at iua dot upf anotherdot es
http://www.iua.upf.edu/~dgarcia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.clam-project.org/pipermail/clam-devel-clam-project.org/attachments/20080707/f91163ea/attachment-0001.sig>


More information about the clam-devel mailing list