UnicodeDecodeError from bzr version if platform contains non-ASCII

Bug #1195783 reported by Toshio Kuratomi
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Vincent Ladeuil

Bug Description

https://bugzilla.redhat.com/show_bug.cgi?id=979399

The bzr version command can traceback if there are non-ASCII characters in the system's platform information.

bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 66: ordinal not in range(128)

Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/bzrlib/commands.py", line 920, in exception_to_return_code
    return the_callable(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/bzrlib/commands.py", line 1131, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib64/python2.7/site-packages/bzrlib/commands.py", line 673, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib64/python2.7/site-packages/bzrlib/commands.py", line 695, in run
    return self._operation.run_simple(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/bzrlib/cleanup.py", line 136, in run_simple
    self.cleanups, self.func, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/bzrlib/cleanup.py", line 166, in _do_with_cleanups
    result = func(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/bzrlib/commands.py", line 1148, in ignore_pipe
    result = func(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/bzrlib/builtins.py", line 4215, in run
    show_version(to_file=self.outf)
  File "/usr/lib64/python2.7/site-packages/bzrlib/version.py", line 69, in show_version
    to_file.write(" Platform: %s\n" % platform.platform(aliased=1))
  File "/usr/lib64/python2.7/site-packages/bzrlib/ui/text.py", line 669, in write
    self.wrapped_stream.write(to_write)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 66: ordinal not in range(128)

bzr 2.5.1 on python 2.7.5 (Linux-3.9.6-301.fc19.x86_64-x86_64-with-
    fedora-19-Schrödinger’s_Cat)
arguments: ['/usr/bin/bzr', 'version']
plugins: bash_completion[2.5.1], bzrtools[2.5.0], changelog_merge[2.5.1],
    fastimport[0.13.0], launchpad[2.5.1], netrc_credential_store[2.5.1],
    news_merge[2.5.1], po_merge[2.5.1], weave_fmt[2.5.1]
encoding: 'utf-8', fsenc: 'UTF-8', lang: 'en_US.UTF-8'

*** Bazaar has encountered an internal error. This probably indicates a
    bug in Bazaar. You can help us fix it by filing a bug report at
        https://bugs.launchpad.net/bzr/+filebug
    including this traceback and a description of the problem.
Bazaar (bzr) 2.5.1
  Python interpreter: /usr/bin/python 2.7.5
  Python standard library: /usr/lib64/python2.7

Platform is being taken from the stdlib platform.platform() function. On Fedora this is reading the information from /etc/fedora-release. On Debian, this is read from /etc/debian-release. I'm not sure where it would be read for Ubuntu.

If the name inside of this file contains non-ASCII characters, then a non-ascii byte string (str) is passed into bzrlib.ui.text::TextUIOutputStream.write(). The byte str is then passed to self.wrapped_stream.write(). self.wrapped_stream is a codecs.StreamWriter object. This is a problem because StreamWriter objects can only handle unicode input (in the non-ascii range). [1]_

The easiest fix for this is to make sure to transform the string into unicode before handing it to the StreamWriter. Modifying ui.text.py::TextUIOutputStream.write() like this works:

    def write(self, to_write):
        self.ui_factory.clear_term()
        if isinstance(to_write, str):
            to_write = unicode(to_write, encoding='utf-8', errors='replace')
        self.wrapped_stream.write(to_write)

I'm not submitting this as a patch because most projects have their own helper function that does the unicode transformation and their own policies on what the default encoding should be (utf-8 and the user's locale setting being the top two choices I've seen). The important thing to note is that you have a suitable "errors=" setting so that bytes that are undecodable in the default chosen encoding do not end up causing a traceback when simply printing something for the user to read on the screen.

.. [1]_: http://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-4-now-it-doesn-t-take-byte-strings

Related branches

Revision history for this message
Toshio Kuratomi (toshio) wrote :

Two further comments to this issue:

First, an update to the code that I think is correct: I noticed that the TextUIOutputStream has an encoding attribute. I believe we can use this in the write() method something like this:

    def write(self, to_write):
        self.ui_factory.clear_term()
        if isinstance(to_write, str):
            # Assume ascii if encoding is undefined because it's the only encoding that is
            # guaranteed not to traceback in the StreamWriter.
            encoding = self.encoding or 'ascii'
            to_write = unicode(to_write, encoding=encoding, errors='replace')
        self.wrapped_stream.write(to_write)

You probably noticed the comment about falling back to ascii if encoding wasn't specified. I wasn't sure in what instances encoding wouldn't be set and whether we might be able to be a bit more lenient (and use utf-8 or the user's locale rather than ascii). I tried to run bzr selftests to determine that. Unfortunately, that lead to a second issue:

I ran bzr selftests both with and without the changes I mentioned to ui.text.py::TextUIOutputStream.write(). I found that there are a few selftest failures without the patch applied which point out the issue with using non-ascii byte str here but with the patch there are many more failures. After instrrumenting the code I found information like this:

45.5s write: type(to_write)=<type 'unicode'>
45.5s write: repr(to_write)='u\'"\\u0422\\u0435\\u0441\\u04422"\\n\''
45.5s write: self.wrapped_stream=<bzrlib.tests.StringIOWrapper object>

This is showing that the selftests that are failing with the patch are using a tests.StringIOWrapper instead of a StreamWriter. I believe that this means the test cases are incorrect. cStringIO can't accept unicode strings which are non-ascii:

http://docs.python.org/2/library/stringio.html#cStringIO.StringIO

So StringIOWrapper needs to be given a byte str object while StreamWriter needs to be given unicode type objects. The selftests seem to be incorrectly trying to substitute StringIOWrapper for StreamWriter so the tests themselves are broken, not the code.

Revision history for this message
Toshio Kuratomi (toshio) wrote :

Okay... here's a haaaaack that worksaround the tracebacks. I encountered a problem with the code. The comment in this sample explains the second problem as well. I'm going to apply this to the fedora package under the assumption that text/ui.py only affects the user interface and therefore it's fine to use errors='replace' there. If that's not the case, someone please stop me! I wouldn't want to corrupt someone's branch with replacement characters by mistake. (But bzr version issuing a traceback looks really bad for bzr so I figure it's worth the risk).

You could apply this to bzr if you want but the right way is probably much more invasive:

1) re-evaluate the test cases and switch to using some form of StreamWriter around a StringIO instead of a raw StringIO.
2) Either change your calling code to make sure that you always send byte str to test/ui.py::write() or add something to write() to make sure if the string is unicode it can be encoded in the StreamWriter's encoding.

        try:
            self.wrapped_stream.write(to_write)
        except UnicodeError:
            # Hack around several problems:
            # StreamWriters cannot handle non-ascii byte strs. So we have to
            # make sure that it is getting a unicode string.
            #
            # If we have a unicode string containing characters not available
            # in the stream's encoding, then we'll traceback here. So we have
            # to round trip to bytes and back to unicode, getting rid of the
            # characters we can't handle along the way.
            if isinstance(to_write, unicode):
                to_write = to_write.decode(self.encoding or 'ascii', errors='replace')
            to_write = unicode(to_write, encoding=self.encoding or 'ascii', errors='replace')
            self.wrapped_stream.write(to_write)

Without this patch, bzr selftests (with qbzr installed) results are:

  FAILED (failures=10, errors=29, known_failure_count=62)
  1093 tests skipped

With the patch, bzr selftests results are:
  FAILED (failures=7, errors=25, known_failure_count=62)
  1093 tests skipped

You might also be interested in these functions from the kitchen library:

* getwriter() -- A replacement for codecs.getwriter() that aims to provider a StreamWriter replacement that does not traceback.
  * Docs: http://pythonhosted.org/kitchen/api-text-converters.html#kitchen.text.converters.getwriter
  * Code: http://bzr.fedorahosted.org/bzr/kitchen/devel/annotate/head:/kitchen/text/converters.py#L301
* byte_string_valid_encoding -- Detect whether a byte str is valid in a particular encoding.
  * Docs: http://pythonhosted.org/kitchen/api-text-misc.html#kitchen.text.misc.byte_string_valid_encoding
  * Code: http://bzr.fedorahosted.org/bzr/kitchen/devel/annotate/head:/kitchen/text/misc.py#L343

If you don't want another dependency but would like to use some of that code, it is licensed under the LGPLv2+ so you should be able to copy it. Note that if signing the Canonical Contributor Agreement is a concern for you to do that, I have signed the Agreement but we'd have to look at who contributed to those functions to see whether I was the only contributor.

Revision history for this message
Vincent Ladeuil (vila) wrote :

@Toshi: Thanks a lot for you rhard work here and sorry for the delay !

I think there may be a simpler way to address this issue.

> This is a problem because StreamWriter objects can only handle unicode
> input (in the non-ascii range)

Then I think we should just provide it unicode no ?

We've been trying to use only unicode internally and get away with magical
auto-conversion (there is a long history being this choice, mgz is the most
knowledgeable in this area and will correct me if I'm wrong). In a nutshell,
the policy is to chose the encoding as close as possible to where it's
needed.

So, I think the fix should be to consider the content of the platform file
as being utf-8 and be done.

> I'm not submitting this as a patch because most projects have their own
> helper function that does the unicode transformation and their own
> policies on what the default encoding should be (utf-8 and the user's
> locale setting being the top two choices I've seen).

Right, the locale matters for the output but the root cause here is the
input.

Can you test the following patch:

=== modified file 'bzrlib/version.py'
--- bzrlib/version.py 2011-12-19 13:23:58 +0000
+++ bzrlib/version.py 2013-07-06 08:51:29 +0000
@@ -66,7 +66,8 @@

     to_file.write(" Python standard library:" + ' ')
     to_file.write(os.path.dirname(os.__file__) + '\n')
- to_file.write(" Platform: %s\n" % platform.platform(aliased=1))
+ to_file.write(" Platform: %s\n"
+ % platform.platform(aliased=1).decode('utf8'))

Changed in bzr:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Jeff Licquia (jeff-licquia) wrote :

I can confirm that the patch fixes the bug for me. Tested on Fedora 19 s390x.

Vincent Ladeuil (vila)
Changed in bzr:
assignee: nobody → Vincent Ladeuil (vila)
Vincent Ladeuil (vila)
Changed in bzr:
status: Confirmed → In Progress
Vincent Ladeuil (vila)
Changed in bzr:
milestone: none → 2.6b3
status: In Progress → Fix Released
Vincent Ladeuil (vila)
Changed in bzr:
milestone: 2.6b3 → 2.6.0
Revision history for this message
Toshio Kuratomi (toshio) wrote :

Thanks Vincent! This fix works on my box. Sorry for getting interrupted in getting back to this -- it's been a rough month in Fedora-land. I have some comments on the patch despite it solving the issue:

* It's a bit cleaner to combine unicode strings with unicode strings rather than unicode and byte str:

  - to_file.write(u" Platform: %s\n" % platform.platform(aliased=1).decode('utf8'))
  + to_file.write(u" Platform: %s\n" % platform.platform(aliased=1).decode('utf8'))

* You want to be sure to include an "errors=" parameter in case some distribution stores the platform information in a non-utf-8 encoding:

    platform.platform(aliased=1).decode('utf-8', errors='replace'))

* The other to_file.write() lines that take a variable are also susceptible to this issue as the variables could contain non-ascii characters (they seem to contain file paths which, on unix systems, can be nearly arbitrary bytes.)

* It's still possible to get tracebacks if the platform information contains characters that are not decodable in the encoding that to_file can handle. For instance, if to_file.encoding='ascii' and platform.platform is "Schrödinger's cat". (I believe that this is taken care of for the StreamWriter that handles version.py via the StreamWriter's errors attribute but it isn't cared for by the StringIO used in the unittests. So I'm not sure if this is something to fix in bzrlib.ui.text::TextUIOutputStream.write() (We don't know the encoding of the output stream before this so we can't fix it earlier) or by making the tests use a StreamWriter instead of a raw StringIO.)

* The bzr selftest output points at three more test failures caused by platform.platform() output. It looks like those all have the same line of code causing it though:

/usr/lib64/python2.7/site-packages/bzrlib/tests/__init__.py:report_tests_starting()

I suggest changing it like this:

        self.stream.write(
            u' bzr-%s python-%s %s\n' % (
                    bzrlib.version_string,
                    bzrlib._format_version_tuple(sys.version_info),
                    platform.platform(aliased=1).decode('utf-8', errors='replace'),
                    ))

It looks to me like bzrlib/__init__.py: _format_version_tuple() should also be changed to return unicode strings instead of byte str if it's used here. But that might have an impact on code in other places as well (mixing of unicode and str causes coercions from byte str to unicode string. Most UnicodeErrors are thrown from implicit coercions like this. So changing the output type of _format_version_tuple() might cause tracebacks in other code that uses _format_version_tuple().

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.