This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: support C/C++ identifiers named with non-ASCII characters

From: 張俊芝 <zjz at zjz dot name>
To: gdb-patches at sourceware dot org
Date: Tue, 22 May 2018 01:36:42 +0800
Subject: Re: support C/C++ identifiers named with non-ASCII characters
References: <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>

Simon Marchi 於 2018/5/21 下午10:03 寫道:

On 2018-05-21 04:52 AM, 張俊芝 wrote:

Hi Zhang,

Thanks for the patch, I tested it quickly, it seems to work as expected.

Could you please write a small test case in testsuite/gdb.base with the example
you gave, so we make sure this doesn't get broken later?  If you can write it
in such a way that both clang and gcc understand it would be better, because
most people run the testuite using gcc to compile test programs.

I am not a specialist in lexing and parsing C, so can you explain quickly why
you think this is a good solution?  Quickly, I understand that you change the
identifier recognition algorithm to a blacklist of characters rather than
a whitelist, so bytes that are not recognized (such as those that compose
the utf-8 encoded characters) are not rejected.

Given unlimited time, would the right solution be to use a lib to parse the
string as utf-8, and reject strings that are not valid utf-8?

Here are some not and formatting comments:

+static bool is_identifier_separator (char);


You don't have to forward declare the function if it's not necessary.

+    /* '<' should not be a token separator, because it can be an open angle
+       bracket followed by a nested template identifier in C++. */


Please use two spaces after the final period (...C++.  */).

+  if (is_identifier_separator(c))


Please use a space before the parentheses:

   is_identifier_separator (c)

+  for (c = tokstart[namelen]; !is_identifier_separator(c);)


Here too.

Thanks!

Simon


Thank you for the reply, Simon.

This new diff addresses all the code style issues you mentioned.

Yes, you are right in that the blacklist is limited. Actually,`is_identifier_separator` only blacklists the invalid ASCII charactersfor an identifier, leaving all the invalid non-ASCII characters unchecked.

So seems it would be better if non-ASCII characters were also checked.However, unfortunately, GDB is neither aware of the encoding of theterminal input, nor the encoding of the generated symbolic informationin the executable. So the blacklist is made to restrict to the invalidASCII characters in order to support all ASCII-compliant encodings.

Having said that, I find that it does no harm to the user if we onlycheck the ASCII characters. If the user is trying to print an identifierwhich includes an invalid non-ASCII character, say, p 測】, where 】isinvalid, they will get an error message:


No symbol "測】" in current context.

which doesn't seem any worse than an error message like:

Invalid character "】" in expression.

Perhaps the former might be even more intuitional.

So personally, I think it might be safe enough to use this limitedblacklist method.

Attachment: diff
Description: Text document

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]