This is the mail archive of the docbook@lists.oasis-open.org mailing list for the DocBook project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Language code for FPIs -- not explicit enough?

To: docbook at lists dot oasis-open dot org
Subject: Re: DOCBOOK: Language code for FPIs -- not explicit enough?
From: Tony Graham <tgraham at mulberrytech dot com>
Date: Mon, 27 Sep 1999 12:42:58 -0400 (EST)
References: <19990926180234.A52298@kilt.nothing-going-on.org>
Reply-To: docbook at lists dot oasis-open dot org

At 26 Sep 1999 18:02 +0100, Nik Clayton wrote:
 > You could extend this to French fairly easily; just put the definitions in 
 > another file, and use "//FR" on the end of the FPI instead.
 > 
 > The problems arise when you start using languages with multiple possible
 > encodings.  For example, Chinese can be written in two encodings, EUC, and
 > Big5.  You can't distinguish between them with this scheme, so simply 
 > using "//ZH" in place of "//EN" in the FPI isn't sufficient.
 > 
 > FWIW I think, but could be wrong, that Norm's stylesheets have a similar
 > problem.  Specifying the default language with "<book lang='zh'>" isn't
 > sufficient either, as the various bits of boilerplate emitted by the 
 > stylesheets would need to be in the correct encoding.

Yes, but in principle the text while it's inside the DSSSL engine
should Unicode characters in whatever internal representation the
program cares to use.  Getting it out in the right encoding is a
backend problem.

 > Does anyone have any thoughts on the best way to solve this problem?  The
 > most obvious solution is to specify the encoding in the FPI like this
 > 
 > <!ENTITY bookinfo 
 >          PUBLIC "-//FreeBSD//ENTITIES DocBook BookInfo Entities zh_TW.Big5//EN">
 > 
 > i.e., ignore the language specifer in the FPI, and include it in the 
 > description.  This strikes me as being less than optimal because you then 
 > have important meta information inside what is (I think) supposed to 
 > freeform data.

The text of ISO 8879 says:

   The public text language must be a name, entered with upper-case
   letters.  The name should be the two-character language code from
   ISO 639 that defines the principal natural language used in the
   public text.

The commentary from the SGML Handbook says:

   The word "should" here is an ISO convention that means "strongly
   recommended".  By stopping short of "required", the standard frees
   validating parsers from the burden of checking conformance.  This
   approach also permits user extensibility in an area where it cannot
   compromise the integrity of SGML.

If you think in terms of RFC 1766 language identifiers instead of
locales, it seems to me that this is acceptable:

<!ENTITY bookinfo 
       PUBLIC "-//FreeBSD//ENTITIES DocBook BookInfo Entities//ZH-TW">

The lowercase-UPPERCASE thing for language-COUNTRY is just a
convention for RFC 1766 identifiers.  Everybody reminds you that RFC
1766 defines these identifiers to be case-insensitive right before
they show you a mixed-case example.  In fact, RFC 1766 says:

   The language tag is composed of 1 or more parts: A primary language
   tag and a (possibly empty) series of subtags.

   The syntax of this tag in RFC-822 EBNF is:

    Language-Tag = Primary-tag *( "-" Subtag )
    Primary-tag = 1*8ALPHA
    Subtag = 1*8ALPHA

   Whitespace is not allowed within the tag.

   All tags are to be treated as case insensitive; there exist
   conventions for capitalization of some of them, but these should not
   be taken to carry meaning.

The fact that your text is in Big5 shouldn't be reflected in the FPI.
You say there's two encodings for Chinese -- EUC and Big5 -- but "CJKV
Information Processing" by Ken Lunde lists five other vendor character
sets that are similar to Big5, plus it lists Big5+ and multiple CNS
character sets.  To further confuse things, there's multiple GB/T
character sets from China that use traditional forms for their
characters.  To really confuse things (while actually attempting to
unconfuse them), all of these characters can be represented in
Unicode, which comes in UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, and
UTF-EBCDIC flavours (as well as UTF-7, UCS-2, and UCS-4).

Getting back to the simple case of EUC (EUC-TW, I presume) and Big5,
if you have one boilerplate text but don't include the encoding in its
FPI, then your users use either an EUC-TW or a Big5 version of the
referenced file (depending on what their system uses) without
suffering pangs of guilt about the FPI indicating the wrong encoding.

Regards,

Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

References:
- Language code for FPIs -- not explicit enough?
  - From: Nik Clayton

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]