This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: i18n; wide characters; Guile


Jim Blandy <jimb@red-bean.com> writes:
> An Emacs buffer must hold large amounts of text, and must also serve as the
> operand to editing and searching commands.  It is terribly clumsy to
> use a variable-length encoding in buffers.

Why?  It seems terribly elegant to use UTF-8 for buffers.

The problem with variable-length encodings is that character indexing
is not constant-time.  But why would you need constant-time indexing
for buffers?  There are no common user-level operation which
would require this.  ELisp uses buffer indexes, but there is no
inherent reason they need to be character indexes;  they can byte
indexes or some magic cookie instead.  All most code ELisp code
cares about is that buffer indexes are monotonically inreasing,
and can be represented as fixnums.  Code that is likely to break
is anything that subtracts buffer indexes, and assumes the
difference is related to the number of characters in the sub-range.
There are probably a fair amount of places that would need to be
fixed, but it is certainly a reasonable option.  (And I gather
this is what [FSF] Emacs 20 did.  Perhaps Stallman is (partly)
right after all ...)

The other concerns about variable-with-encoded strings does not
apply to buffers using a gap.  If you replace a single-byte character
with one that needs multiple-bytes, no problem - this is what we
have a buffer gap for.

Searching commands work fine on UTF-8.  Plain (non-regexp) searching
works fine with no change, as long as both the buffer and the search
string are UTF-8.  Reg-exp searching would require some hacking,
since single-character patterns might take multiple bytes in the buffer.

Note it is still possible to define a buffer as a sequence of *characters*
(not bytes), as the XEmacs folks want, while using a variable-width
encoding, and still allowing buffer indexes to be byte indexes.
Whether this is the right engineering choice for Emacs is not
obvious, but it certainly could work quite well.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner