This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Choosing a characterset for DocBook

From: Tony Graham <Tony dot Graham at Sun dot COM>
To: docbook-apps at lists dot oasis-open dot org
Date: Fri, 15 Mar 2002 11:52:58 +0000
Subject: Re: DOCBOOK-APPS: Choosing a characterset for DocBook
References: <Pine.LNX.4.44.0203151035570.2207-100000@ares.ddre.dk><5.1.0.14.0.20020315020409.038e95c0@mail.maden.org>

Christopher R. Maden wrote at 15 Mar 2002 02:06:47 -0800:
 > The parser obviously is not aware that you have chosen ISO 8859-1.  That is 
 > the expected error message if an 8859-1 document contains any high bytes 
 > (128+) and the parser is trying to parse it as UTF-8.
 > 
 > 1) Do all of your entities (i.e., files) have encoding declarations?  What 
 > are they?  Remember that UTF-8 is the default unless you explicitly specify 
 > a different encoding (or use a byte-order mark, in which case UTF-16 is the 
 > default).

Strictly speaking, it's "or use UTF-16 with a byte-order mark", since
you can have a byte-order mark with UTF-8.

UTF-16 without a byte-order mark (BOM) can be mistaken for a number of
other encodings, hence you need the BOM if you're omitting the
encoding declaration.  Both UTF-16 without the BOM and the 'number of
other encodings' all need to have the encoding declaration so the XML
processor can determine the encoding.  UTF-16 with both the BOM and an
encoding declaration is okay, too.

8-bit text without an encoding declaration is expected to be UTF-8.
Hence, if the text isn't UTF-8, you need the encoding declaration.
UTF-8 text with the BOM (EF BB BF) and without an encoding declaration
should be recognised as UTF-8.  However, using the BOM with UTF-8
wasn't mentioned in the Unicode Standard, Version 2.0 (which was
current when XML 1.0 was published), so some early XML processors
weren't designed to recognise the UTF-8 BOM.  The UTF-8 BOM was not
mentioned in Appendex F of XML 1.0, but is mentioned in Appendix F of
XML 1.0 Second Edition (and was mentioned in the version of ISO/IEC
10646 current when XML 1.0 was published).

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony.graham@sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

References:
- Choosing a characterset for DocBook
  - From: Jens Stavnstrup
- Re: Choosing a characterset for DocBook
  - From: Christopher R. Maden

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]