This is the mail archive of the docbook@lists.oasis-open.org mailing list for the DocBook project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Re: bidi override thoughts


Paul Grosso wrote at 22 Aug 2002 09:00:37 -0500:
...
 > The direction property takes values of left-to-right and right-to-left
 > (well, ltr and rtl because this was inherited from CSS--funny time for 
 > CSS to get terse, if you ask me).

'ltr' and 'rtl' may have been borrowed from HTML.

...


The Unicode Bidirectional Algorithm in a nutshell...


Bidirectional types
===================

Unicode characters have a "bidirectional type".  There's lots of
types, but they're divided into three categories: strong, weak, and
neutral.

Characters with a strong bidirectional type really know their
directionality.  For example, the characters in most alphabets are
"strongly" left-to-right, and the characters in the Hebrew and Arabic
alphabets (and some others) are "strongly" right-to-left.

Characters with a weak bidirectional type determine their
directionality according to their proximity to other characters with
strong directionality.

Characters with a neutral bidirectional type determine their
directionality from either the surrounding strong text or the
embedding level.


Embedding levels
================

The Unicode Bidirectional Algorithm works in terms of "levels" of
right-to-left text embedded with left-to-right text, and vice versa.

Even levels (0, 2, 4...60) are left-to-right.  Odd levels (1, 3..61)
are right-to-left.

Text at an even level is rendered left-to-right.  Text at an odd level
is rendered right-to-left.

The Unicode Bidirectional Algorithm works on paragraphs, so the first
step is dividing text into paragraphs.  You determine the "paragraph
embedding level" by finding the first character in the paragraph with
a strong bidirectional category.  If the character is strongly
left-to-right, the paragraph embedding level is 0, otherwise (i.e. if
the character is strongly right-to-left), the embedding level is 1.

Embedding goes on from there: contained text with the opposite
directionality is at the next embedding level, and text with the
original directionality that is contained by the text with the
opposite directionality is at the next lowest embedding level.


Explicit bidirectional formatting
=================================

Unicode includes characters for fudging the embedding level:

 - RLE, Right-to-Left Embedding, says treat the following text as
   right-to-left.  I.e., it forces the embedding level to the next
   lowest odd number: level 0 -> level 1; 1 or 2 -> 3, etc.

 - LRE, Left-to-Right Embedding, says treat the following text as
   left-to-right.  I.e., it forces the embedding level to the next
   lowest even number: 0 or 1 -> 2; 2 or 3 -> 4, etc.

 - RLO, Right-to-Left Override, says treat the following characters as
   strong right-to-left characters.  I.e. it forces an odd embedding
   level, but it also sets the "override status" to right-to-left so
   the implementation knows which way to push those neutral types.

 - LRO, Left-to-Right Override, says treat the following characters as
   strong left-to-right characters.  I.e. it forces an even embedding
   level, but it also sets the "override status" to left-to-right so
   the implementation knows which way to push those neutral types.

 - PDF, Pop Directional Format, is the generic "end-tag" for the
   previous RLE, LRE, RLO, or LRO character.

 - RLM, Right-to-Left Mark, is a zero-width (i.e. it doesn't print)
   character that is used as an invisible spot of strong right-to-left
   directionality to coerce neighbouring weak and neutral characters
   into behaving the way you want.  This doesn't change the embedding
   level.

   The example in the Unicode Standard shows RLM being used with an
   exclamation mark (i.e., '!')  that is between some left-to-right
   text and some neutral text, all of which is within some
   right-to-left text.  Without the RLM, the ! is treated as part of
   the span of left-to-right text.  With the RLM between the
   left-to-right text and the !, the ! is treated as part of the
   right-to-left text, which changes on which end of the left-to-right
   text it is rendered.

 - LRM, Left-to-Right Mark, is a zero-width (i.e. it doesn't print)
   character that is used as an invisible spot of strong left-to-right
   directionality to coerce neighbouring weak and neutral characters
   into behaving the way you want.  This doesn't change the embedding
   level.

RLM and LRM are good if you know what you're doing, you probably have
an editor that lets you represent them, and you're worried about
conserving embedding levels.  For the rest of us, the other five
characters represent the brute-strength and ignorance approach that
we're more comfortable with.


Bidirectional conformance
=========================

Systems do not need to support any explicit directional formatting
codes.

The "implicit bidirectional algorithm" can be taken as handling
bidirectionality based solely on embedding levels and the characters'
bidirectionality types and without any overrides.

There isn't an "explicit bidirectional algorithm" as such.  The
explicit codes distort the embedding levels compared to what they
would ordinarily be, but after they've been taken into account, the
"implicit" algorithm, based on embedding levels and characters' types,
is what finally determines which text is rendered in which direction.


Higher-level protocols
======================

The "permissible ways for systems to apply higher-level protocols to
the ordering of bidirectional text" are:

 - Override the paragraph embedding level

 - Override the number handling to use information provided by a
   broader context (Let's not go there.)

 - Replace, supplement, or override the bidirectional overrides or
   embedding codes

 - Override the bidirectional character types assigned to control
   codes to match the interpretation of the control codes used within
   the protocol (Let's not go there either.)

 - Remap the number shapes to match those of another set (Ditto.)

HTML, CSS, and XSL do the first and third only.


HTML
====

HTML has a "dir" attribute for indicating the direcionality of text.
The allowed values are RTL and LTR.  I.e., "dir" overrides the
paragraph embedding level and replaces the embedding codes.

HTML also has a <bdo> element that is used for overriding the effects
of the bidirectional algorithm on a span of text.  I.e., it replaces
the override codes.

The HTML Recommendation warns against mixing its controls with
explicit bidirectional override characters.  Hardly surprising.


CSS2
====

As Paul noted, CSS has a "direction" property with values "ltr" and
"rtl" (and "inherit").  It specifies "the base writing direction of
blocks and the direction of embeddings and overrides for the Unicode
BIDI algorithm."  I.e., it overrides the paragraph embedding level for
blocks (i.e., for what Unicode considers paragraphs) and it's also
used for replacing the bidirectional overrides and embedding codes.

The "unicode-bidi" property is the other half of how CSS2 replaces the
bidirectional overrides and embedding codes.  The allowed values are
"normal", "embed", "override", and "inherit".

'unicode-bidi: normal' doesn't do anything, which is why 'normal' is
the default value.

'unicode-bidi: embed' is equivalent to RLE (when 'direction: rtl') or
LRE (when 'direction: ltr') at one end of a span of text and a PDF of
the other.

'unicode-bidi: override' is equivalent to RLO (when 'direction: rtl')
or LRO (when 'direction: ltr') at one end of a span of text and a PDF
of the other.


XSL
===

XSL has "direction", "unicode-bidi" and "writing-mode" properties,
although they don't all apply to all the same formatting objects.

"writing-mode" applies to the formatting objects that set up a
"reference-area", i.e., to the big-picture formatting objects that
specify the page, the regions with the page, to tables, and to table
cells.  It affects how you sequence blocks of text, but it also
overrides the "paragraph embedding level."

"direction" and "unicode-bidi" apply only to the "bidi-override"
formatting object.  They behave pretty much like in CSS2, except that
the inital value of "direction" is derived from the current
"writing-mode" value rather than being explicitly "ltr".

(Determing the initial value of "direction" this way probably means
fewer surprises when formatting a purely right-to-left document, but
the "direction" description does read like it was written for
"direction" to apply to more formatting objects than just
"bidi-override".)


Conclusion
==========

1. If using markup to control bidirectionality, you need a way to set
   the paragraph embedding level (i.e., set whether the paragraph
   starts out right-to-left or left-to-right) as well as a way to
   override the implicit bidirectional algorithm (the algorithm that
   works w.r.t. the characters' bidirectional types).

2. Markup that overrides the implicit bidirectional algorithm should
   support both overrides (RLO and LRO equivalent) and embeds (RLE and
   LRE equivalent).

3. Include strong words against mixing markup-based bidirectionality
   controls and the explicit bidirectionality characters.

4. Consistency with existing standards is a GOOD THING.  Compatibility
   with the Unicode Bidirectional Algorithm is essential.

5. Work out whether every inline can affect bidirectionality (CSS
   style) or whether there's one special-purpose element (HTML and XSL
   style, although I don't expect XHTML to stick to that and it
   doesn't matter for HTML anyway if you're also using CSS).

6. A politically correct default direction is hard to determine.  CSS2
   uses 'ltr', and XSL lets the XSL processor have a default.

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony.graham@sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]