This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Eli Zaretskii <eliz at gnu dot org>
To: "Jim Blandy" <jimb at red-bean dot com>
Cc: ghost at cs dot msu dot su, gdb at sources dot redhat dot com
Date: Sat, 15 Apr 2006 00:37:40 +0300
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <200604141257.41690.ghost@cs.msu.su> <uu08w1cnf.fsf@gnu.org> <200604141837.26618.ghost@cs.msu.su> <uirpc19u8.fsf@gnu.org> <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com> <ubqv4108c.fsf@gnu.org> <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com>
Reply-to: Eli Zaretskii <eliz at gnu dot org>

> Date: Fri, 14 Apr 2006 12:16:36 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: ghost@cs.msu.su, gdb@sources.redhat.com
> 
> >  (gdb) print *warray@8
> >   {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}
> >
> > Except for using up 60-odd characters where you used 21, this is IMHO
> > better, since it doesn't require any code on the FE side: just convert
> > the strings to integers, and you've got Unicode, ready to be used for
> > whatever purposes.
> 
> If you're printing an expression that evaluates to a string, sure. 
> But what if you're printing a value of type struct { wchar *key;
> wchar_t *value }?  What if you're using -stack-list-arguments to show
> values in a stack frame?

Sorry, I don't see the difference.  Perhaps I'm too dense.  Are you
talking about the amount of ASCII characters, or something else?

> My point is, MI consumers are already parsing ISO C strings.  They
> just need to parse more of them.

This ``more parsing'' is not magic.  It's a lot of work, in general.

> > For the interactive user, understanding non-ASCII strings in the
> > suggested ASCII encoding might not be easy at all.  For example, for
> > all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> > hard time recognizing the letter Gimel.
> 
> If the host character set includes Gimel, then GDB won't print it with
> a hex escape.

The host character set has nothing to do, in general, with what
characters can be displayed.  The same host character set can be
displayed on an appropriately localized xterm, but not on a bare-bones
character terminal.  Not every system that runs in the Hebrew locale
has Hebrew-enabled xterm.  Some characters may be missing from a
particular font, especially a Unicode-based font (because there so
many Unicode characters).  Etc., etc.

Even if I do have a Hebrew enabled xterm, chances are that it cannot
display characters sent in 16-bit Unicode codepoints, it will want
some single-byte encoding, like UTF-8 or maybe ISO 8859-8.

GDB will generally know nothing about these complications, unless we
teach it.  For example, to display Hebrew letters on a UTF-8 enabled
xterm, we (i.e. the user, through appropriate GDB commands) will have
to tell GDB that wchar_t strings should be encoded in UTF-8 by the CLI
output routines.  Sometimes these settings can be gleaned from the
environment variables, but Emacs's experience shows how very
unreliable and error-prone this is.

> > As for the second sentence, ``reliably find the contents of the
> > string'' there obviously doesn't consider the complexities of handling
> > wide characters.  In my experience, for any non-trivial string
> > processing, working with variable-size encoding is much harder than
> > with fixed-size wchar_t arrays, because you need to interpret the
> > bytes as you go, even if all you need is to find the n-th character.
> > Even the simple task of computing the number of characters in the
> > string becomes complicated.
> 
> I don't understand what you mean.  The rules for parsing ISO C string
> literals into arrays of chars and wide string literals into arrays of
> wide characters are straightforward.

You seem to assume here that the target and the front-end's character
sets and their notion of wchar_t are identical.  Otherwise, what was a
valid array of wide characters on the target side will be gibberish on
the host side, and will certainly not display as anything legible.
Unlike GDB core, which just wants to pass the bytes from here to
there, the UI needs to be able to display the string, and for that it
needs to understand how it is encoded, how many glyphs will it produce
on the screen, where it can be broken into several lines if it is too
long, etc.  This is all trivial with 7-bit ASCII (every byte produces
a single glyph, except a few non-printables, whitespace characters
signal possible locations to break the line, etc.), but can get very
complex with other character sets.

GDB cannot be asked to know about all of those complications, but I
think it should at least provide a few simple translation services so
that a front end will not have to work too hard to handle and display
strings as mostly readable text.  Passing the characters as fixed-size
codepoints expressed as ASCII hex strings leaves the front-end with
only very simple job.  What's more, it uses an existing feature: array
printing.

> > What you are suggesting is simple for GDB, but IMHo leaves too much
> > complexity to the FE.  I think GDB could do better.  In particular, if
> > I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> > show me Unicode characters in their normal glyphs, which would require
> > GDB to output the characters in their UTF-8 encoding (which the
> > terminal will then display in human-readable form).  Your suggestion
> > doesn't allow such a feature, AFAICS, at least not for CLI users.
> 
> When the host character set contains a character, there's no need for
> GDB to use an escape to show it.

Whose host character set? GDB's?  But GDB is not displaying the
strings, the front end is.  And as I wrote above, there's no
guarantees that the host character set can be transparently displayed
on the screen.  This only works for ASCII and some simple single-byte
encodings, mostly Latin ones.  But it doesn't work in general.

And why are you talking about host character set?  The
L"123\x0f04\x0fccxyz" string came from the target, GDB simply
converted it to 7-bit ASCII.  These are characters from the target
character set.  And the target doesn't necessarily talk in the host
locale's character set and language, you could be debugging a program
which talks Farsi with GDB that runs in a German locale.

> > If wchar_t uses fixed-size characters, not their variable-size
> > encodings, then specifying the CCS will do.
> 
> There is no provision in ISO C for variable-size wchar_t encodings. 
> The portion of the standard I referred to says that wchar_t "...is an
> integer type whose range of values can represent distinct codes for
> all members of the largest extended character set specified among the
> supported locales".

I agree, but Windows and who knows what else violates that.  Of
course, for the BMP, UTF-16 is indistinguishable from Unicode
codepoints, so in practice this might not matter too much.

Follow-Ups:
- Re: printing wchar_t*
  - From: Vladimir Prus

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Jim Blandy
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Jim Blandy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]