subunit

removing \r from streams causes ValueError

Bug #505078 reported by Martin Pool on 2010-01-09

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	subunit	Fix Released	Wishlist	Martin Pool

Bug Description

It turns out that this crash occurs if the linefeeds are missing from the multipart section.

  File "/home/mbp/subunit/trunk/python/subunit/details.py", line 109, in lineReceived
    self._parse_state(line)
  File "/home/mbp/subunit/trunk/python/subunit/details.py", line 92, in _feed_chunks
    residue = self._chunk_parser.write(line)
  File "/home/mbp/subunit/trunk/python/subunit/chunked.py", line 117, in write
    return self.state()
  File "/home/mbp/subunit/trunk/python/subunit/chunked.py", line 90, in _read_length
    self.body_length = int(count_str[:-2], 16)
ValueError: invalid literal for int() with base 16: ''

The root cause is that subunits current default stream depends on 8-bit safe, non-cr/lf altering transports. Changing the stream to be more conservative - e.g. not depend on \r\n characters being unaltered, or even on whitespace being unaltered, would avoid this. Some possibilities are uuencoding, json etc. However we need to be able transport 8-bit whitespace and \r, \n containing attachments, and we need to be able to stream things out - a self delimiting format is essential.

See original description

Related branches

lp:~mbp/subunit/505078-chunked

Merged into lp:~subunit/subunit/trunk at revision 139

Jonathan Lange: Approve on 2011-01-28

Martin Pool: Pending requested 2011-01-28

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-09:

failure-with-traceback.subunit Edit (837 bytes, text/plain)

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-09: Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

The root cause of this is that the subunit protocol and implementation requires the file contain plan carriage returns in some places, and crlf in others. If you lose the linefeeds, you get this error.

In a sense this is user error: I copied the text from a terminal which lost the linefeeds. However requiring this seems to make things unnecessarily difficult for something that's meant to be a human-readable interchange format.

summary:	- crashes with "invalid literal for int() with base 16: ''" + crashes with "invalid literal for int() with base 16: ''" if linefeeds + are missing from the file
description:	updated

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-09: Re: [Bug 505078] [NEW] crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

On Sat, 2010-01-09 at 08:36 +0000, Martin Pool wrote:
> Public bug reported:
>
> It turns out that this crash occurs if the linefeeds are missing from
> the multipart section.

Well the line feeds are part of the chunked specification - HTTP - that
subunit follows. So, is something stripping them? subunit should be
emitting them.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-09:

On Sat, 2010-01-09 at 08:36 +0000, Martin Pool wrote:
> Public bug reported:
>
> It turns out that this crash occurs if the linefeeds are missing from
> the multipart section.

Oh also, if something is stripping them and we can reasonably expect
this to happen, then I'm very happy to make subunit's parser be more
tolerant - a subunit stream isn't http itself, after all, and I don't
think a more tolerant parser will lead to security issues in this
context.

-Rob

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-09: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

On Sat, 2010-01-09 at 08:47 +0000, Martin Pool wrote:
> The root cause of this is that the subunit protocol and implementation
> requires the file contain plan carriage returns in some places, and
> crlf
> in others. If you lose the linefeeds, you get this error.
>
> In a sense this is user error: I copied the text from a terminal which
> lost the linefeeds. However requiring this seems to make things
> unnecessarily difficult for something that's meant to be a human-
> readable interchange format.

So, the following:
3\r\n
0\r\n
0\r\n

(a single chunk consisting of '0\r\n') if copied and pasted will fail to
parse even with a more relaxed parse because it would become:
3\n
0\n
0\n

Now, I appreciate that many attachments will not contain \r and so not
fail in this way - but tabs are also lost by some terminals.

Sadly, the tensions:
- relatively easy for humans to read
- binary attachment safe

have caused simple copy and paste to stop being as robust. Now, I've
documented the multipart stuff as experimental, to permit it to be
changed if needed. We could uuencode the contents of the attachments,
but that would make them considerably less human inspectable IMO.

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-11: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

Copying from the gnome terminal into gvim lost the linefeeds. I don't
know whose fault it is, and in a sense it doesn't matter because
similar problems will occur in different situations, such as

* pasting data from a failing test run into an email or bug report
* opening the failing run in an editor, trimming it, and saving it
* pasting from a failed run in a terminal, or mail from pqm, or a
failure in a web page
* new implementations of the subunit protocol
* running existing implementations connected to a file stream that
does eol conversion

I think supporting these things are useful. Obviously there is no end
to how much text can be mangled and it would probably be asking too
much that subunit be robust against programs wrapping the text or
deleting whitespace, though both of those will eventually happen.

A format that specifically requires both \n and \r\n at different
points is pretty perverse: it looks like text but it can't actually be
treated as such. I think the smallest sensible fix would be to
define it to be all unix form, and give a clean error if it's not
transmitted as such.

> So, the following:
> 3\r\n
> 0\r\n
> 0\r\n
>
> (a single chunk consisting of '0\r\n') if copied and pasted will fail to
> parse even with a more relaxed parse because it would become:
> 3\n
> 0\n
> 0\n
>
> Now, I appreciate that many attachments will not contain \r and so not
> fail in this way - but tabs are also lost by some terminals.
>
> Sadly, the tensions:
> - relatively easy for humans to read
> - binary attachment safe
>
> have caused simple copy and paste to stop being as robust. Now, I've
> documented the multipart stuff as experimental, to permit it to be
> changed if needed. We could uuencode the contents of the attachments,
> but that would make them considerably less human inspectable IMO.

You're already partly using mime here, so I think you might as well do
it the rest of the way, by specifying a content-encoding for the
attachments. They could be quoted-printable for things that are
mostly text (like tracebacks) but need to be sent byte-for-byte, plain
text for things like tracebacks, and base64 for tarballs.

--
Martin <http://launchpad.net/~mbp/>

Copying from the gnome terminal into gvim lost the linefeeds.  I don't
know whose fault it is, and in a sense it doesn't matter because
similar problems will occur in different situations, such as

I think supporting these things are useful.  Obviously there is no end
to how much text can be mangled and it would probably be asking too
much that subunit be robust against programs wrapping the text or
deleting whitespace, though both of those will eventually happen.

A format that specifically requires both \n and \r\n at different
points is pretty perverse: it looks like text but it can't actually be
treated as such.   I think the smallest sensible fix would be to
define it to be all unix form, and give a clean error if it's not
transmitted as such.

> So, the following:
> 3\r\n
> 0\r\n
> 0\r\n
>
> (a single chunk consisting of '0\r\n') if copied and pasted will fail to
> parse even with a more relaxed parse because it would become:
> 3\n
> 0\n
> 0\n
>
> Now, I appreciate that many attachments will not contain \r and so not
> fail in this way - but tabs are also lost by some terminals.
>
> Sadly, the tensions:
>  - relatively easy for humans to read
>  - binary attachment safe
>
> have caused simple copy and paste to stop being as robust. Now, I've
> documented the multipart stuff as experimental, to permit it to be
> changed if needed. We could uuencode the contents of the attachments,
> but that would make them considerably less human inspectable IMO.

You're already partly using mime here, so I think you might as well do
it the rest of the way, by specifying a content-encoding for the
attachments.  They could be quoted-printable for things that are
mostly text (like tracebacks) but need to be sent byte-for-byte, plain
text for things like tracebacks, and base64 for tarballs.

-- 
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-11: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

On Mon, 2010-01-11 at 04:48 +0000, Martin Pool wrote:
>
> You're already partly using mime here, so I think you might as well do
> it the rest of the way, by specifying a content-encoding for the
> attachments. They could be quoted-printable for things that are
> mostly text (like tracebacks) but need to be sent byte-for-byte, plain
> text for things like tracebacks, and base64 for tarballs.

Do you mean transfer encodings perhaps? They already do specify a
content encoding?

The thing breaking though, isn't actually the encoded content, its the
wrapper - the http chunking (which is a per-spec implementation of http
chunking, precisely to avoid implementing something new, or something
complex).

This isn't to say that you're wrong, just being clear about where the
failure is occuring.

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-11: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

2010/1/11 Robert Collins <email address hidden>:
> On Mon, 2010-01-11 at 04:48 +0000, Martin Pool wrote:
>>
>> You're already partly using mime here, so I think you might as well do
>> it the rest of the way, by specifying a content-encoding for the
>> attachments. They could be quoted-printable for things that are
>> mostly text (like tracebacks) but need to be sent byte-for-byte, plain
>> text for things like tracebacks, and base64 for tarballs.
>
> Do you mean transfer encodings perhaps? They already do specify a
> content encoding?

really? where?

they seem to only specify a content-type at the moment.

istm that what is wanted here is a content-transfer-encoding which
<http://en.wikipedia.org/wiki/MIME#Content-Transfer-Encoding>

> 1. It indicates whether or not a binary-to-text encoding scheme has been used on top of the original encoding as specified within the Content-Type header, and
> 2. If such a binary-to-text encoding method has been used it states which one.

however, this doesn't mark boundaries between attachments.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-11: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

On Mon, 2010-01-11 at 07:01 +0000, Martin Pool wrote:

> > Do you mean transfer encodings perhaps? They already do specify a
> > content encoding?
>
> really? where?

sorry yes, c-t only.

> they seem to only specify a content-type at the moment.
>
> istm that what is wanted here is a content-transfer-encoding which
> <http://en.wikipedia.org/wiki/MIME#Content-Transfer-Encoding>

Yes, something like that might work - I was using a conceptually similar
thing in using chunked encoding.

> > 1. It indicates whether or not a binary-to-text encoding scheme has been used on top of the original encoding as specified within the Content-Type header, and
> > 2. If such a binary-to-text encoding method has been used it states which one.
>
> however, this doesn't mark boundaries between attachments.

Currently they are self delimiting so thats not an issue.

I have the following constraints here:
- I want something that doesn't require arbitrary buffering per
attachment.
- It should be dead simple to output. E.g. the shell bindings should be
able to do it. (This is a preference not a hard requirement.)
- Parsing can be a bit harder. I feel strongly that it should be
something that already exists, so that new parsers don't need to be
written (or if they do need to be written they can be reused by other
things).
- It shouldn't add a lot of overhead in the common case: most channels
on the internet today (e.g. launchpad attachments) are 8 bit clean, and
test runs with 10's of thousands of entries could suffer significant
inflation if a large overhead scheme is chosen.

I appreciate that you want to be able to copy and paste segments of a
stream and like the flexability that that will offer: Do you think the
constraints above are reasonable?

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-12: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

#10

2010/1/11 Robert Collins <email address hidden>:
> I have the following constraints here:
> - I want something that doesn't require arbitrary buffering per
> attachment.
> - It should be dead simple to output. E.g. the shell bindings should be
> able to do it. (This is a preference not a hard requirement.)
> - Parsing can be a bit harder. I feel strongly that it should be
> something that already exists, so that new parsers don't need to be
> written (or if they do need to be written they can be reused by other
> things).
> - It shouldn't add a lot of overhead in the common case: most channels
> on the internet today (e.g. launchpad attachments) are 8 bit clean, and
> test runs with 10's of thousands of entries could suffer significant
> inflation if a large overhead scheme is chosen.
>
> I appreciate that you want to be able to copy and paste segments of a
> stream and like the flexability that that will offer: Do you think the
> constraints above are reasonable?

I think those are an excellent set of constraints.

I do want to be able to copy-and-paste a stream, or chop it up using
text-based tools. But even more than that, I don't want it to _look_
like I can and actually not. It would in some ways be better if it
was binary garbage than text with invisible but critical markers.

base64 of attachments seems to meet all of these: it streams; shell
programs can use base64 from gnu coreutils, and it is only about 30%
bigger than the input.

The main drawback, which is perhaps not negligible, is that it would
mean you could no longer directly read the attachments. Since the
attachments include the traceback, this would be a fairly severe
problem for eg "bzr selftest --subunit |subunit-filter|less". So the
question is whether there is another constraint that text attachments
should be directly readable in the stream, or whether the raw stream
is only for debugging.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-12: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

#11

On Tue, 2010-01-12 at 02:24 +0000, Martin Pool wrote:
> So the
> question is whether there is another constraint that text attachments
> should be directly readable in the stream, or whether the raw stream
> is only for debugging.

I think its nifty to be able to read a raw stream via less but I don't
think its essential.

I'd rather have something you're happy with that you need an interpreter
to use (rather than debug visually) than the converse.

I will ask that we keep the parser for the current chunked attachments
around for a couple of releases, to ease transitions.

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2010-01-12: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

#12

How would you feel about using mime multipart with separators? That
allows streaming, has little overhead, could even allow things as
unescaped 8bit, and will leave text readable.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-01-12: Re: [Bug 505078] Re: crashes with "invalid literal for int() with base 16: ''" if linefeeds are missing from the file

#13

On Tue, 2010-01-12 at 04:21 +0000, Martin Pool wrote:
> How would you feel about using mime multipart with separators? That
> allows streaming, has little overhead, could even allow things as
> unescaped 8bit, and will leave text readable.

Sounds fine to me as long as the programming API (addDetails,
getDetails, and the Content and ContentType API's used by the details
dict passed to outcomes) remains the same. If it needs to change I may
still be happy, but I'd want to see the changes first :)

-Rob

Robert Collins (lifeless) on 2010-01-20

Changed in subunit:
status:	New → Triaged
importance:	Undecided → Wishlist
summary:	- crashes with "invalid literal for int() with base 16: ''" if linefeeds - are missing from the file + copy and paste or newline mangling can damage streams

Robert Collins (lifeless) on 2010-04-16

description:

updated

Revision history for this message

Martin Pool (mbp) wrote on 2010-04-20: Re: copy and paste or newline mangling can damage streams

#14

Just for other people following this, the problem also came up in bzr that win32 Python defaults to having stdout in text mode, in which case it damages streams in this format. See bug 551332.

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-15:

#15

This also causes pain when subunit output is naively sent across mailing without an encoding that's guaranteed to be binary-safe, as we have at the moment with pqm.

I'm going to see about adding quopri encoding.

Changed in subunit:
assignee:	nobody → Martin Pool (mbp)

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-15:

#16

I propose to add the following to the protocol spec:

=== modified file 'README'
--- README 2009-10-10 03:42:32 +0000
+++ README 2010-06-15 08:21:50 +0000
@@ -159,13 +159,17 @@
tags: [-]TAG ...
time: YYYY-MM-DD HH:MM:SSZ

-DETAILS ::= BRACKETED | MULTIPART
+DETAILS ::= BRACKETED | MULTIPART | QUOPRI_DETAILS
BRACKETED ::= '[' CR lines ']' CR
MULTIPART ::= '[ multipart' CR PART* ']' CR
PART ::= PART_TYPE CR NAME CR PART_BYTES CR
PART_TYPE ::= Content-Type: type/sub-type(;parameter=value,parameter=value)
PART_BYTES ::= (DIGITS CR LF BYTE{DIGITS})* '0' CR LF

+QUOPRI_DETAILS ::= '[ quopri' QP_EOL QP_LINE* QP_EOL ']' CR
+QP_EOL = CR | CR LF | LF
+QP_LINE = *QP_CHAR ['='] QP_EOL
+
unexpected output on stdout -> stdout.
exit w/0 or last test completing -> error

@@ -175,6 +179,14 @@
A '-' before a tag is used to remove tags - e.g. to prevent a global tag
applying to a single test, or to cancel a global tag.

+Quoted-Printable detail sections (QUOPRI_DETAILS) are supported only after
+0.0.5 but are recommended because they are more robust against line-ending
+conversions or other whitespace transformation of the stream. (See
+<https://launchpad.net/bugs/505078>.)
+The format is the same as that of RFC 1521 section 5.1 <http://tools.ietf.org/html/rfc1521.html#page-18>.
+The grammar from RFC 1521 is included by reference, with the clarification that
+any type of line-ending is permitted.
+
The progress directive is used to provide progress information about a stream
so that stream consumer can provide completion estimates, progress bars and so
on. Stream generators that know how many tests will be present in the stream

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-15:

#17

actually that's not quite right because it doesn't include the content-type header - but see the attached branch

Revision history for this message

Robert Collins (lifeless) wrote on 2010-06-16:

#18

Its important that we capture the new facilities the multipart encoding added:
- content type
- attachment name (perhaps generalised to content disposition this time around)
- multiple attachments
- binary safe

I have no attachment to any particular escaping mechanism used, but the quopri proposal you've supplied here seems to be lacking three of the four things I've listed here, while delivering the fourth very nicely :)

Something combining the two would be most excellent.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-06-16:

#19

Oh, and bonus points for:
- streams - handles iso's etc without the presumption of complete buffering
- easy to write C parser for

QP as a substrate seems to match these things which is why I didn't call it out particularly, but if you were to grab something different, they matter too.

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-16: Re: [Bug 505078] Re: copy and paste or newline mangling can damage streams

#20

On 16 June 2010 11:12, Robert Collins <email address hidden> wrote:
> Its important that we capture the new facilities the multipart encoding added:
> - content type
> - attachment name (perhaps generalised to content disposition this time around)
> - multiple attachments
> - binary safe
>
> I have no attachment to any particular escaping mechanism used, but the
> quopri proposal you've supplied here seems to be lacking three of the
> four things I've listed here, while delivering the fourth very nicely :)
>
> Something combining the two would be most excellent.

Actually it did (in the branch) have the content-type, but needs 2 and 3.

Maybe there is a more mime-y way to give the attachment name.

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2011-01-10: Re: copy and paste or newline mangling can damage streams

#21

Happy first birthday, bug!

Hooking in a heavyweight mime format is probably possible but turns out to be just a bit difficult, both in terms of defining the grammar, and in hooking it up to Python mime libraries which rather assume they're parsing a mail. It's definitely possible but given I didn't do it in all of last year I probably won't get to it.

So I propose to just do the much smaller fix, which is to tolerate missing \r on the chunk marker lines. The result won't require a change to the grammar and will stay compatible with earlier releases. It doesn't make the stream safe if something is arbitrarily mangling it, but doing so probably requires much stronger encoding (like uuencode) than we really want here. It will make it safer in the case I originally had which is just pasting streams holding plain text attachments into eg temporary files or source files. At any rate I think the current behaviour of crashing with an internal error is not really helping anyone there.

Martin Pool (mbp) on 2011-01-10

summary:	- copy and paste or newline mangling can damage streams + removing \r from streams causes ValueError
Changed in subunit:
status:	Triaged → In Progress