This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: printing wchar_t*

From: Eli Zaretskii <eliz at gnu dot org>
To: Vladimir Prus <ghost at cs dot msu dot su>
Cc: jimb at red-bean dot com, gdb at sources dot redhat dot com
Date: Mon, 17 Apr 2006 11:35:10 +0300
Subject: Re: printing wchar_t*
References: <e1lsqg$aml$1@sea.gmane.org> <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com> <u64lb25zv.fsf@gnu.org> <200604171036.48833.ghost@cs.msu.su>
Reply-to: Eli Zaretskii <eliz at gnu dot org>

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 10:36:47 +0400
> Cc: "Jim Blandy" <jimb@red-bean.com>,
>  gdb@sources.redhat.com
> 
> On Saturday 15 April 2006 01:37, Eli Zaretskii wrote:
> 
> > > My point is, MI consumers are already parsing ISO C strings.  They
> > > just need to parse more of them.
> >
> > This ``more parsing'' is not magic.  It's a lot of work, in general.
> 
> I don't quite get it. Say that frontend and gdb somehow agree on the 8-bit 
> encoding using by gdb to print the strings. Then frontend can look at the 
> string and:
>   
>   - If it sees \x, look at the following hex digits and convert it to either
>     code point or code unit
>   - If it sees anything else, convert it from local 8-bit to Unicode

That's what Jim was saying.  He thought (or so it seemed to me) that,
once the ASCII-encoded string was read by the front end and converted
back to the integer values, the job is done.  That is, in Jim's
example with L"123\x0f04\x0fccxyz", the character `1' is converted to
its code 49 decimal, \x0f04 is converted to the 16-bit code 3844
decimal, `x' is converted to 120 decimal, etc.

What I was saying that indeed this conversion is easy, but it's not
even close to doing what the front end generally would like to do with
the string.  You want to _process_ the string, which means you want to
know its length in characters (not bytes), you want to know what
character set they encode, you want to be able to find the n-th
character in the string, etc.  The encoding suggested by Jim makes
these tasks very hard, much harder than if we send the string as an
array of fixed-length wide characters.

> Note that due to charset function interface using 'int', you can't use UTF-8 
> for encoding passed to frontend, but using ASCII + \x is still feasible.

I don't understand why UTF-8 cannot be used (an int can hold an 8-bit
byte just fine), nor can I see why this is an issue.  We are not
discussing addition of UTF-8 encoding to GDB, we are discussing how to
pass to a front end wide-character strings held within the debuggee.
Or at least that's what I thought you were trying to solve.

> There's one nice thing about this approach. If there's new 'print array until 
> XX" syntax, I indeed need to special-case processing of values in several 
> contexts -- most notably arguments in stack trace. With "\x" escapes I'd need 
> to write a code to handle them once. In fact, I can add this code right to MI 
> parser (which operates using Unicode-enabled QString class already). That 
> will be more convenient than invoking 'print array' for any wchar_t* I ever 
> see.

I don't think we should optimize GDB for one specific toolkit, even if
that toolkit is Qt.

> I don't quite get. First you say you want \x05D2 to display using Unicode font 
> on console, now you say it's very hard.

No, I said that a GUI front end will be able to display the _binary_
_code_ 0x05D2 with a suitable Unicode font.  Jim suggested that seeing
the _string_ "\x05D2" in GDB's output will allow me to read the text,
to which I replied that it will not be easy at all, since humans
generally don't remember Unicode codepoints by heart, even for their
native languages.

> Now, if you want Unicode display for 
> \x05D2, there should be some method to tell gdb that your console can display 
> Unicode, and if user told that Unicode is supported, what are the problems?

Please read my other messages: the program being debugged might talk
Hebrew in Unicode codepoints, but the locale where we are running GDB
might not support Hebrew on the console.  So, as long as we are
talking about console output (which is different from a GUI front
end), just sending Unicode to the display is not enough.

I suggest not to mix issues relevant for GUI front ends and text-mode
front ends, including the CLI ``front end'' built into GDB itself.
These are different issues, each one with its own set of complexities.

Jim's L"123\x0f04\x0fccxyz" proposal was (I think) more oriented to
text terminals and the CLI, so the discussion wandered off in that
direction.  I don't think your original problem is related to that.

> > how many glyphs will it produce 
> > on the screen, where it can be broken into several lines if it is too
> > long, etc.  This is all trivial with 7-bit ASCII (every byte produces
> > a single glyph, except a few non-printables, whitespace characters
> > signal possible locations to break the line, etc.), but can get very
> > complex with other character sets.
> 
> Isn't this completely outside of GDB?

No, not completely: the ui_output routines do this for the console
output.  Again, this part was about text-mode output, and the CLI in
particular.

> > GDB cannot be asked to know about all of those complications, but I
> > think it should at least provide a few simple translation services so
> > that a front end will not have to work too hard to handle and display
> > strings as mostly readable text.  Passing the characters as fixed-size
> > codepoints expressed as ASCII hex strings leaves the front-end with
> > only very simple job.  What's more, it uses an existing feature: array
> > printing.
> 
> Using \x escapes, provided they encode *code units*, leaves frontend with the 
> same simple job.

Yes, but GDB will need to generate the code units first, e.g. convert
fixed-size Unicode wide characters into UTF-8.  That's extra job for
GDB.  (Again, we were originally talking about wchar_t, not multibyte
strings.)

> Really, using strings with \x escapes differs from array 
> printing in just one point: some characters are printed not as hex values, 
> but as characters in local 8-bit encoding. Why do you think this is a 
> problem?

Because knowing what is the ``local 8-bit encoding'' is in itself a
huge problem.  Emacs is trying to solve it since 1996, and it still
haven't got all the details right in some marginal cases, although we
have people on the Emacs development team who understand more about
i18n than I ever will.  In short, there's no reliable method of
finding out what is the correct 8-bit encoding in which to talk to any
given text-mode display.

And you certainly do NOT want any local 8-bit encodings when you are
going to display the string on a GUI, because that would require that
the front end does some extra job of converting the encoded text back
to what it needs to communicate with the text widgets.

> > And why are you talking about host character set?  The
> > L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> > converted it to 7-bit ASCII.  These are characters from the target
> > character set.  And the target doesn't necessarily talk in the host
> > locale's character set and language, you could be debugging a program
> > which talks Farsi with GDB that runs in a German locale.
> 
> So, characters that happen to exist in German locale are printed as literal 
> chars. Other characters are printed using \x. FE reads the string, and when 
> it sees literal char, it converts it from German locale to Unicode used 
> internally. Where's the problem?

If this conversion is lossless, it's redundant.  It is easier to just
send everything as hex escapes, since no human will see them, only the
FE.  This saves the needless conversion (and potential problems with
incorrect notion of the current locale and encoding).

But some conversions to ``literal characters'' (i.e. to 8-bit binary
codes) are lossy, because the underlying converter needs state
information to correctly interpret the byte stream.  This state
information is thrown away once the conversion is done, and so the
opposite conversion fails to reconstruct the original codepoints.
This is usually the case with ISO-2022 encodings.

So I think on balance it's better to send the original wide characters
as hex, the only downside being that it uses more bytes per character.
(Again, this is about GUI front ends, not about GDB's own CLI output
routines.)

Follow-Ups:
- Re: printing wchar_t*
  - From: Vladimir Prus

References:
- printing wchar_t*
  - From: Vladimir Prus
- Re: printing wchar_t*
  - From: Jim Blandy
- Re: printing wchar_t*
  - From: Eli Zaretskii
- Re: printing wchar_t*
  - From: Vladimir Prus

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]