This is the mail archive of the
docbook-apps@lists.oasis-open.org
mailing list .
Re: Choosing a characterset for DocBook
Christopher R. Maden wrote at 15 Mar 2002 02:06:47 -0800:
> The parser obviously is not aware that you have chosen ISO 8859-1. That is
> the expected error message if an 8859-1 document contains any high bytes
> (128+) and the parser is trying to parse it as UTF-8.
>
> 1) Do all of your entities (i.e., files) have encoding declarations? What
> are they? Remember that UTF-8 is the default unless you explicitly specify
> a different encoding (or use a byte-order mark, in which case UTF-16 is the
> default).
Strictly speaking, it's "or use UTF-16 with a byte-order mark", since
you can have a byte-order mark with UTF-8.
UTF-16 without a byte-order mark (BOM) can be mistaken for a number of
other encodings, hence you need the BOM if you're omitting the
encoding declaration. Both UTF-16 without the BOM and the 'number of
other encodings' all need to have the encoding declaration so the XML
processor can determine the encoding. UTF-16 with both the BOM and an
encoding declaration is okay, too.
8-bit text without an encoding declaration is expected to be UTF-8.
Hence, if the text isn't UTF-8, you need the encoding declaration.
UTF-8 text with the BOM (EF BB BF) and without an encoding declaration
should be recognised as UTF-8. However, using the BOM with UTF-8
wasn't mentioned in the Unicode Standard, Version 2.0 (which was
current when XML 1.0 was published), so some early XML processors
weren't designed to recognise the UTF-8 BOM. The UTF-8 BOM was not
mentioned in Appendex F of XML 1.0, but is mentioned in Appendix F of
XML 1.0 Second Edition (and was mentioned in the version of ISO/IEC
10646 current when XML 1.0 was published).
Regards,
Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin mailto:tony.graham@sun.com
Sun Microsystems Ireland Ltd Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3 x(70)19708