This is the mail archive of the archer@sourceware.org mailing list for the Archer project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Python pretty-printers and non-ASCII strings do not play well together :-(


Hi,

Here's my try at what's going wrong:

El mar, 04-11-2008 a las 17:39 -0800, Paul Pluzhnikov escribiÃ:
> 258           common_val_print (((value_object *) self)->value, stb, 0, 0, 0,
<snip>
> 266       result = PyUnicode_Decode (s, strlen (s), host_charset (), NULL);

Just a parenthesis: at first I thought this call host_charset here was
wrong and should be to target_charset. Then I thought again and it's
right if common_val_print converts the string from target_charset to
host_charset. I think that's the case, but it's hard to follow what GDB
does to print a value. I'll put a comment in the call above explaining
this.

Anyway, what this call is doing is converting the string from GDB's host
charset (probably iso-8859-1 in your case, I think it's GDB's default)
to Unicode. Here, your non-ASCII character isn't a problem because it
exists in ISO-8859-1 and Python knows what to do with it.

> 364             if (PyUnicode_Check(res)) {
> (top)
> 366                     str = PyUnicode_AsEncodedString(res, NULL, NULL);
> (top)
> 367                     Py_DECREF(res);
> (top) p str
> $7 = (PyObject *) 0x0

PyUnicode_AsEncodedString converts a Unicode string to a different
charset. Since this call is passing NULL as the 'charset' argument,
Python will convert to its default charset which is, unfortunately,
ASCII. Since the Unicode string contains a non-ASCII character, the
conversion will fail. At this point, a UnicodeError exception is raised.

> > What do you want to know?  Both Thiago and I have worked in this area,
> > maybe one of us knows.
> 
> How to turn raw buffer contents with unprintable characters into something
> which will print as "\xef\xcd\xab" :)

Tromey mentioned that if you set host-charset to ASCII, that's what GDB
will do. If I followed correctly what it does to print a value, in
valpy_str the call to common_val_print will convert the string from
target-charset to host-charset (I believe the magic happens in
c_emit_char) and PyUnicode_Decode will receive a pure ASCII string, with
the non-ASCII chars escaped.

What you hit is a shortcoming in Python itself, due to the fact that it
has ASCII as its default charset. I can reproduce the problem in a
Python interpreter:

% python
Python 2.5.2 (r252:60911, Jun 25 2008, 17:58:32)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "Ã"
>>> print a
Ã
>>> print str(a)
Ã
>>> b = u"Ã"
>>> print b
Ã
>>> print str(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
>>> print str(b.encode("utf8"))
Ã

The lesson to learn here is to never use str on a Unicode string. :-/
This is a known limitation of Python. I talked about this issue in:

http://sourceware.org/ml/gdb/2008-07/msg00037.html

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]