This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

writing robust to-unicode converter with iconv


Hi,

I am writing the Unicode capable console of the GNU/Hurd, and use iconv to
transparently provide support for legacy external encodings.  Which works
extremely well, thanks for that!

However, I am stumbling a bit over handling of malformed sequences.
They are easily produced by applications that are not aware of the encoding
in use, and their handling should be robust and strict.

The file UTF-8-test.txt by Markus Kuhn has a
"UTF-8 decoder capability and stress test", which gives some guidelines and
also quotes some requirements from ISO 10646-1 for how to handle malformed
sequences in UTF-8 encodings.  In particular, it is desired to transform
malformed sequences into replacement characters (0xfffd) or something
equivalent.

iconv gives me EILSEQ when it encounters an invalid or malformed input.
So I can detect such a situation.  Now my question:

What is the best way to deal with an EILSEQ, considering the above
requirements?  The only way I see so far is to write a handler that
implements the recommendations, and knows intimately about how malformed
sequences should be translated into replacement characters.  The main
question is how many bytes from the input buffer not so far decoded (eg,
starting from where the error occured) should be replaced by one replacement
character?  This answer is given in Kuhn's test guide for UTF-8.  For other
encodings I don't even have an answer, and in anyway this seems to duplicate
quite some stuff that iconv does for me in the non-error case.

The simplest way seems to me to just output one replacement character and
skip one byte in the input buffer.  That is what I will probably do if there
is no better way.  In UTF-8, this will output too many replacement characters
for sequences that are just too short, but that seems to be a small price to
pay.

An alternative would be to have special encoding converters in iconv that
automatically insert replacement characters when going from any encoding to
WCHAR_T.  Is that feasible?

If anybody has pointers to how properly deal with conversion errors, I would
like to see some example code etc (as long as it is free software).

Thanks,
Marcus

-- 
`Rhubarb is no Egyptian god.' GNU      http://www.gnu.org    marcus@gnu.org
Marcus Brinkmann              The Hurd http://www.gnu.org/software/hurd/
Marcus.Brinkmann@ruhr-uni-bochum.de
http://www.marcus-brinkmann.de/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]