This is the mail archive of the
gdb-patches@sourceware.org
mailing list for the GDB project.
Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
On 11/20/2017 04:50 PM, Simon Marchi wrote:
> On 2017-11-20 06:56 AM, Pedro Alves wrote:
>> static std::string
>> make_sort_after_prefix_name (const char *search_name)
>> {
>> /* When looking to complete "func", we find the upper bound of all
>> symbols that start with "func" by looking for where we'd insert
>> the closest string that would follow "func" in lexicographical
>> order. Usually, that's "func"-with-last-character-incremented,
>> i.e. "fund". Mind non-ASCII characters, though. Usually those
>> will be UTF-8 multi-byte sequences, but we can't be certain.
>> Especially mind the 0xff character, which is a valid character in
>> non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
>> rule out compilers allowing it in identifiers. Note that
>> conveniently, strcmp/strcasecmp are specified to compare
>> characters interpreted as unsigned char. So what we do is treat
>> the whole string as a base 255 number composed of a sequence of
>> base 255 "digits" and add 1 to it. I.e., adding 1 to 0xff wraps
>> to 0, and carries 1 to the following more-significant position.
>> If the very first character carries/overflows, then the upper
>> bound is the end of the list. Also the string after the empty
>> string is also the empty string.
>
> Making an analogy with base-10 arithmetic is actually what made me
> understand it. The number after 149 is not 140, it's 150. We're
> doing the string equivalent of that. Your explanation with base-255
> numbers is very good. It doesn't really work for all-0xff strings,
> because adding one (with carry) to "\xff\xff" would give "\x01\x00\x00",
> but it doesn't really matter for the explanation :).
:-) I've extended the text slightly to hopefully make that
even clearer.
Oh, and I realized that it's based 256, not 255.
Classical off-by-one. :-P
I also fixed the \xffa\ff in the example: we need to split the
string after ff, because otherwise 'a' is interpreted as a hex digit.
Below's what I pushed now.
Thanks for the reviews!
>From e1ef7d7a5166f846b14bea5a77acb0dec76661a8 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 21 Nov 2017 00:02:46 +0000
Subject: [PATCH] 0xff chars in name components table; cp-name-parser lex UTF-8
identifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The find-upper-bound-for-completion algorithm in the name components
accelerator table in dwarf2read.c increments a char in a string, and
asserts that it's not incrementing a 0xff char, but that's incorrect.
First, we shouldn't be calling gdb_assert on input.
Then, if "char" is signed, comparing a caracther with "0xff" will
never yield true, which is caught by Clang with:
error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
gdb_assert (after.back () != 0xff);
~~~~~~~~~~~~~ ^ ~~~~
And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
E.g., it's 'ÿ' in Latin1. While GCC nor Clang support !ASCII &&
!UTF-8 characters in identifiers (GCC supports UTF-8 characters only
via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
but other compilers might (Visual Studio?), so it doesn't hurt to
handle it correctly. Testing is covered by extending the
dw2_expand_symtabs_matching unit tests with relevant cases.
However, without further changes, the unit tests still fail... The
problem is that cp-name-parser.y assumes that identifiers are ASCII
(via ISALPHA/ISALNUM). This commit fixes that too, so that we can
unit test the dwarf2read.c changes. (The regular C/C++ lexer in
c-lang.y needs a similar treatment, but I'm leaving that for another
patch.)
While doing this, I noticed a thinko in the computation of the upper
bound for completion in dw2_expand_symtabs_matching_symbol. We're
using std::upper_bound but we should use std::lower_bound. I extended
the unit test with a case that I thought would expose it, this one:
+ /* These are used to check that the increment-last-char in the
+ matching algorithm for completion doesn't match "t1_fund" when
+ completing "t1_func". */
+ "t1_func",
+ "t1_func1",
+ "t1_fund",
+ "t1_fund1",
The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
matches incorrectly. But turns out the problem is masked because
later here:
for (;lower != upper; ++lower)
{
const char *qualified = index.symbol_name_at (lower->idx);
if (!lookup_name_matcher.matches (qualified)
the lookup_name_matcher.matches check above filters out "t1_fund"
because that doesn't start with "t1_func".
I'll fix the latent bug in follow up patches, after factoring things
out a bit in a way that allows unit testing the relevant code more
directly.
gdb/ChangeLog:
2017-11-21 Pedro Alves <palves@redhat.com>
* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
(symbol_end): Use cp_ident_is_alnum.
(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
* dwarf2read.c (make_sort_after_prefix_name): New function.
(dw2_expand_symtabs_matching_symbol): Use it.
(test_symbols): Add more symbols.
(run_test): Add tests.
---
gdb/ChangeLog | 10 ++++
gdb/cp-name-parser.y | 28 +++++++++--
gdb/dwarf2read.c | 136 +++++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 157 insertions(+), 17 deletions(-)
diff --git a/gdb/ChangeLog b/gdb/ChangeLog
index 31e447a..1a38573 100644
--- a/gdb/ChangeLog
+++ b/gdb/ChangeLog
@@ -1,3 +1,13 @@
+2017-11-21 Pedro Alves <palves@redhat.com>
+
+ * cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
+ (symbol_end): Use cp_ident_is_alnum.
+ (yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
+ * dwarf2read.c (make_sort_after_prefix_name): New function.
+ (dw2_expand_symtabs_matching_symbol): Use it.
+ (test_symbols): Add more symbols.
+ (run_test): Add tests.
+
2017-11-17 Tom Tromey <tom@tromey.com>
* symtab.h (enum symbol_subclass_kind): New.
diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index 33ecf13..fdfbf15 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
}
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+ multi-byte sequence bytes and non-ASCII characters in
+ extended-ASCII charsets (e.g., Latin1). I.e., returns true if the
+ high bit is set. Note that not all UTF-8 ranges are allowed in C++
+ identifiers, but we don't need to be pedantic so for simplicity we
+ ignore that here. Plus this avoids the complication of actually
+ knowing what was the right encoding. */
+
+static inline bool
+cp_ident_is_alpha (unsigned char ch)
+{
+ return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM. */
+
+static inline bool
+cp_ident_is_alnum (unsigned char ch)
+{
+ return ISALNUM (ch) || ch >= 0x80;
+}
+
/* Find the end of a symbol name starting at LEXPTR. */
static const char *
@@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr)
{
const char *p = lexptr;
- while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
+ while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
p++;
return p;
@@ -1791,7 +1813,7 @@ yylex (void)
return ERROR;
}
- if (!(c == '_' || c == '$' || ISALPHA (c)))
+ if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
{
/* We must have come across a bad character (e.g. ';'). */
yyerror (_("invalid character"));
@@ -1802,7 +1824,7 @@ yylex (void)
namelen = 0;
do
c = tokstart[++namelen];
- while (ISALNUM (c) || c == '_' || c == '$');
+ while (cp_ident_is_alnum (c) || c == '_' || c == '$');
lexptr += namelen;
diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
index 5437d21..96c1393 100644
--- a/gdb/dwarf2read.c
+++ b/gdb/dwarf2read.c
@@ -4188,6 +4188,79 @@ gdb_index_symbol_name_matcher::matches (const char *symbol_name)
return false;
}
+/* Starting from a search name, return the string that finds the upper
+ bound of all strings that start with SEARCH_NAME in a sorted name
+ list. Returns the empty string to indicate that the upper bound is
+ the end of the list. */
+
+static std::string
+make_sort_after_prefix_name (const char *search_name)
+{
+ /* When looking to complete "func", we find the upper bound of all
+ symbols that start with "func" by looking for where we'd insert
+ the closest string that would follow "func" in lexicographical
+ order. Usually, that's "func"-with-last-character-incremented,
+ i.e. "fund". Mind non-ASCII characters, though. Usually those
+ will be UTF-8 multi-byte sequences, but we can't be certain.
+ Especially mind the 0xff character, which is a valid character in
+ non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
+ rule out compilers allowing it in identifiers. Note that
+ conveniently, strcmp/strcasecmp are specified to compare
+ characters interpreted as unsigned char. So what we do is treat
+ the whole string as a base 256 number composed of a sequence of
+ base 256 "digits" and add 1 to it. I.e., adding 1 to 0xff wraps
+ to 0, and carries 1 to the following more-significant position.
+ If the very first character in SEARCH_NAME ends up incremented
+ and carries/overflows, then the upper bound is the end of the
+ list. The string after the empty string is also the empty
+ string.
+
+ Some examples of this operation:
+
+ SEARCH_NAME => "+1" RESULT
+
+ "abc" => "abd"
+ "ab\xff" => "ac"
+ "\xff" "a" "\xff" => "\xff" "b"
+ "\xff" => ""
+ "\xff\xff" => ""
+ "" => ""
+
+ Then, with these symbols for example:
+
+ func
+ func1
+ fund
+
+ completing "func" looks for symbols between "func" and
+ "func"-with-last-character-incremented, i.e. "fund" (exclusive),
+ which finds "func" and "func1", but not "fund".
+
+ And with:
+
+ funcÿ (Latin1 'ÿ' [0xff])
+ funcÿ1
+ fund
+
+ completing "funcÿ" looks for symbols between "funcÿ" and "fund"
+ (exclusive), which finds "funcÿ" and "funcÿ1", but not "fund".
+
+ And with:
+
+ ÿÿ (Latin1 'ÿ' [0xff])
+ ÿÿ1
+
+ completing "ÿ" or "ÿÿ" looks for symbols between between "ÿÿ" and
+ the end of the list.
+ */
+ std::string after = search_name;
+ while (!after.empty () && (unsigned char) after.back () == 0xff)
+ after.pop_back ();
+ if (!after.empty ())
+ after.back () = (unsigned char) after.back () + 1;
+ return after;
+}
+
/* Helper for dw2_expand_symtabs_matching that works with a
mapped_index instead of the containing objfile. This is split to a
separate function in order to be able to unit test the
@@ -4303,21 +4376,20 @@ dw2_expand_symtabs_matching_symbol
{
if (lookup_name_in.completion_mode ())
{
- /* The string frobbing below won't work if the string is
- empty. We don't need it then, anyway -- if we're
- completing an empty string, then we want to iterate over
- the whole range. */
- if (cplus[0] == '\0')
+ /* In completion mode, we want UPPER to point past all
+ symbols names that have the same prefix. I.e., with
+ these symbols, and completing "func":
+
+ function << lower bound
+ function1
+ other_function << upper bound
+
+ We find the upper bound by looking for the insertion
+ point of "func"-with-last-character-incremented,
+ i.e. "fund". */
+ std::string after = make_sort_after_prefix_name (cplus);
+ if (after.empty ())
return end;
-
- /* In completion mode, increment the last character because
- we want UPPER to point past all symbols names that have
- the same prefix. */
- std::string after = cplus;
-
- gdb_assert (after.back () != 0xff);
- after.back ()++;
-
return std::upper_bound (lower, end, after.c_str (),
lookup_compare_upper);
}
@@ -4493,6 +4565,26 @@ static const char *test_symbols[] = {
"ns::foo<int>",
"ns::foo<long>",
+ /* These are used to check that the increment-last-char in the
+ matching algorithm for completion doesn't match "t1_fund" when
+ completing "t1_func". */
+ "t1_func",
+ "t1_func1",
+ "t1_fund",
+ "t1_fund1",
+
+ /* A UTF-8 name with multi-byte sequences to make sure that
+ cp-name-parser understands this as a single identifier ("função"
+ is "function" in PT). */
+ u8"u8função",
+
+ /* \377 (0xff) is Latin1 'ÿ'. */
+ "yfunc\377",
+
+ /* \377 (0xff) is Latin1 'ÿ'. */
+ "\377",
+ "\377\377123",
+
/* A name with all sorts of complications. Starts with "z" to make
it easier for the completion tests below. */
#define Z_SYM_NAME \
@@ -4551,6 +4643,22 @@ run_test ()
{});
}
+ /* Check that the name matching algorithm for completion doesn't get
+ confused with Latin1 'ÿ' / 0xff. */
+ {
+ static const char str[] = "\377";
+ CHECK_MATCH (str, symbol_name_match_type::FULL, true,
+ EXPECT ("\377", "\377\377123"));
+ }
+
+ /* Check that the increment-last-char in the matching algorithm for
+ completion doesn't match "t1_fund" when completing "t1_func". */
+ {
+ static const char str[] = "t1_func";
+ CHECK_MATCH (str, symbol_name_match_type::FULL, true,
+ EXPECT ("t1_func", "t1_func1"));
+ }
+
/* Check that completion mode works at each prefix of the expected
symbol name. */
{
--
2.5.5