bzr serve doesn't drop idle/dead connections

Bug #824797 reported by Michael J. Flickinger
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
John A Meinel

Bug Description

When used with xinetd, hpss won't auto-kill processes with idle sockets.

Other inetd processes, such as IMAPd, will auto terminate idle connections.

On Savannah, for example, I limited the amount of available hpss processes to 40, but idle connections build up and we quickly develop a "denial of service" problem specific to bzr hpss.

I propose a three minute timeout on idle hpss sockets, similar to what's implemented by default with IMAP.

$ time telnet bzr.sv.gnu.org 2401
Trying 140.186.70.72...
Connected to bzr.sv.gnu.org.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

real 60m29.058s
user 0m0.008s
sys 0m0.004s

Tags: hpss serve

Related branches

security vulnerability: yes → no
visibility: private → public
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 824797] [NEW] hpss hangs on idle connection

there are three related bugs:

bug 819604: reconnect when connection drops
bug 795025: allow admins to ask the server to close the connection
after it finishes the current request
bug 824797: drop connections that have been idle for a while

summary: - hpss hangs on idle connection
+ bzr serve doesn't drop idle/dead connections
tags: added: serve
Changed in bzr:
status: New → Confirmed
importance: Undecided → High
John A Meinel (jameinel)
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
John A Meinel (jameinel) wrote :

I have a branch which seems to add support for it. There are still bits that need to be worked on:

1) I set the default idle timeout to 300s (5 min). Which seems an ok start point.
2) It should be configurable, though.
3) I only timeout waiting for the start of a new RPC. I *think* that is reasonable. The main concern about faulty timeouts is initial push for a lot of ancestry. Sometimes it can take a while for us to work out what needs to be sent.
4) bug #819604 should be a good counterpoint, though. Which will allow clients to attempt to reconnect.
5) Currently my patch seems to work, however it causes a fair amount of headache in the test suite. If you set the timeout low you get timeouts while running (*very bad*), but if you set it high then you seem to get "hung threads" (bad). I have to check the 'close_server()' code to figure out why we're hanging. (If you do select.select([r], [], []) and the far side of r has been closed, would select hang forever?, maybe I need to pass it in the last list as well.)

Changed in bzr:
status: Confirmed → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

Some odd bits:
1) Just running: bzr selftest -s bt.per_branch.test_branch.TestBranch Remote -v
  If I call select.select() with a timeout, then some tests (~randomly) will timeout after say 4s. (If I set the timeout >5s the hung-thread code gets invoked.)
  If I just remove that call, all the tests succeed, without any 'leaking threads' with no test taking more than 400ms (72 tests in 9s total).
2) So *something* about osutils.read_bytes_from_socket (wrapper around socket.recv) seems happy to return an empty string quickly, but the same trigger is causing select.select( ) to hang until timeout.

3) Just closing 'client_sock' (the client-side of the server_sock/client_sock pair) seems to be enough to trigger select to say that it *can* read on the socket (though it returns the empty string.)

AFAICT, even though the socket is set to blocking mode, sock.recv() is returning an empty string, but select(sock) is not returning to indicate the socket has bytes/is closed.

Revision history for this message
John A Meinel (jameinel) wrote :

The tests which fail is *not* deterministic. I don't know what the race condition is. Maybe if the socket is closed after the select started? My test with close() && select() clearly indicated it returned immediately. But maybe select() & close() would fail. Will do a manual test.

Revision history for this message
John A Meinel (jameinel) wrote :

select([c], [], [], 10.0) in terminal A and s.close() in terminal B will cause select to return before the timeout, and s.recv() to return the empty string. Is this a 'not-guaranteed-but-likely' behaviour?

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 824797] Re: bzr serve doesn't drop idle/dead connections

On 13 September 2011 23:23, John A Meinel <email address hidden> wrote:
> select([c], [], [], 10.0) in terminal A and s.close() in terminal B will
> cause select to return before the timeout, and s.recv() to return the
> empty string. Is this a 'not-guaranteed-but-likely' behaviour?

You very likely want to select on c for errors as well as readability
(ie put it in the third list.)

If the remote end closes while select is waiting I think you are
guaranteed to get that behaviour - select will return communicating
"you can read" and then what you actually read is an EOF.

> AFAICT, even though the socket is set to blocking mode, sock.recv() is
returning an empty string, but select(sock) is not returning to indicate
the socket has bytes/is closed.

I wonder if this is a race where the eof arrives before you do the
select? I can't remember if that can actually happen.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/15/2011 2:46 AM, Martin Pool wrote:
> On 13 September 2011 23:23, John A Meinel <email address hidden>
> wrote:
>> select([c], [], [], 10.0) in terminal A and s.close() in
>> terminal B will cause select to return before the timeout, and
>> s.recv() to return the empty string. Is this a
>> 'not-guaranteed-but-likely' behaviour?
>
> You very likely want to select on c for errors as well as
> readability (ie put it in the third list.)
>
> If the remote end closes while select is waiting I think you are
> guaranteed to get that behaviour - select will return communicating
> "you can read" and then what you actually read is an EOF.

Right, it is even documented as such.

What errors could we get, so I can write a test case for it?

>
>> AFAICT, even though the socket is set to blocking mode,
>> sock.recv() is
> returning an empty string, but select(sock) is not returning to
> indicate the socket has bytes/is closed.
>
> I wonder if this is a race where the eof arrives before you do the
> select? I can't remember if that can actually happen.
>

It seems that it is happening after we start select.select() and then
the *local* socket is getting closed by the test suite teardown code.
After which, select doesn't trigger 'closed'. If you run select again,
you actually get EBADF.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5yeIAACgkQJdeBCYSNAAM+vACggkQ68F7RVTuDAopKmdg023dQ
dDoAnj45SEjw+oYCq+ywMCzHtBVO0eAe
=0dqR
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

On 16 September 2011 08:13, John Arbash Meinel <email address hidden> wrote:
>> If the remote end closes while select is waiting I think you are
>> guaranteed to get that behaviour - select will return communicating
>> "you can read" and then what you actually read is an EOF.
>
> Right, it is even documented as such.
>
> What errors could we get, so I can write a test case for it?

The most obvious would be ECONNRESET indicating a network problem,
though this might be a bit hard to actually produce from a test suite,
since it's normally from below the application layer. Possibly EBADF
when the local end of the socket is closed would come through there.

> It seems that it is happening after we start select.select() and then
> the *local* socket is getting closed by the test suite teardown code.
> After which, select doesn't trigger 'closed'. If you run select again,
> you actually get EBADF.

Revision history for this message
Andrew Bennetts (spiv) wrote :

On Tue, Sep 13, 2011 at 12:31:29PM -0000, John A Meinel wrote:
> (If you do select.select([r], [], []) and the far side of r has been closed,
> would select hang forever?, maybe I need to pass it in the last list as well.)

Beware Windows (and diverse platforms in general): the third argument to select
has a subtly different meaning there. I forget the exact details, just that
Linux tends to use the exceptfds set for fewer things (e.g. when a TCP URG bit
was received?), and Windows tends to signal some connection errors there as
well.

Basically, be extra sure to test on both Windows and Linux :)

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/16/2011 2:16 AM, Andrew Bennetts wrote:
> On Tue, Sep 13, 2011 at 12:31:29PM -0000, John A Meinel wrote:
>> (If you do select.select([r], [], []) and the far side of r has
>> been closed, would select hang forever?, maybe I need to pass it
>> in the last list as well.)
>
> Beware Windows (and diverse platforms in general): the third
> argument to select has a subtly different meaning there. I forget
> the exact details, just that Linux tends to use the exceptfds set
> for fewer things (e.g. when a TCP URG bit was received?), and
> Windows tends to signal some connection errors there as well.
>
> Basically, be extra sure to test on both Windows and Linux :)
>

So, I can understand the desire to pass the file descriptor to
'errors'. But what do you actually *do* if it returns something there:

rs, _, xs = select.select([fd], [], [fd], timeout)

if xs:
  ????

Do I:

1) Ignore xs, but assume that it makes it more likely that select()
   will raise an exception?

2) Treat xs having a file descriptor as meaning the client has timed
   out, and thus we should just clean up and go home?

3) Try to do something like read on the descriptor in order to figure
   out what is actually wrong with it?

The only place I've been able to find anything about errfds was:
http://www.mkssoftware.com/docs/man3/select.3.asp

Where it says:

If a process is blocked on a select() while waiting for input from a
socket and the sending process closes the socket, the select() notes
this as an exception rather than as data. Therefore, if the select()
is not currently looking for exceptions, it waits indefinitely.

However, my empirical testing disproves this. Specifically, if the
client process closes their end of the socket, select() returns that
read is no longer blocked, and if you recv() on it, it returns there
are no bytes to be consumed (thus you must be closed).

Now maybe select(, , errfds) is very implementation specific. I don't
have any problem using it, but I have no idea what to *do* with the
results.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5zVMQACgkQJdeBCYSNAAMNHACg2X/pJ4ECkf3yxkTLRu3b8zxM
FJQAn2eNvhRkXxLzSW4At71eYbzOBo9t
=Y73m
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

On Fri, Sep 16, 2011 at 03:53:08PM +0200, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 9/16/2011 2:16 AM, Andrew Bennetts wrote:
> > On Tue, Sep 13, 2011 at 12:31:29PM -0000, John A Meinel wrote:
> >> (If you do select.select([r], [], []) and the far side of r has
> >> been closed, would select hang forever?, maybe I need to pass it
> >> in the last list as well.)
> >
> > Beware Windows (and diverse platforms in general): the third
> > argument to select has a subtly different meaning there. I forget
> > the exact details, just that Linux tends to use the exceptfds set
> > for fewer things (e.g. when a TCP URG bit was received?), and
> > Windows tends to signal some connection errors there as well.
> >
> > Basically, be extra sure to test on both Windows and Linux :)
> >
>
> So, I can understand the desire to pass the file descriptor to
> 'errors'. But what do you actually *do* if it returns something there:
>
> rs, _, xs = select.select([fd], [], [fd], timeout)
>
> if xs:
> ????
>
> Do I:
>
> 1) Ignore xs, but assume that it makes it more likely that select()
> will raise an exception?

I don't understand why you'd expect more exceptions to be raised in this
case.

> 2) Treat xs having a file descriptor as meaning the client has timed
> out, and thus we should just clean up and go home?

I don't think that's a good assumption.

> 3) Try to do something like read on the descriptor in order to figure
> out what is actually wrong with it?

That's a reasonable approach.

With either 1 or 3, the difference vs. not passing any fds to the 3rd
arg is that there are fewer cases where select will hang until the
timeout. Precisely which cases those are varies by platform,
unfortunately.

[…]
> Now maybe select(, , errfds) is very implementation specific. I don't
> have any problem using it, but I have no idea what to *do* with the
> results.

It is, as I mentioned in my original mail. e.g. IIRC a failure of a
connect(2) is signalled via exceptfds on Windows but not on Linux.

As for what to do with it: I was responding to your concern that select
might “hang forever” in some cases. If you want select to wake up when
any event happens on a fd, then you need to ask select to notify you of
all events. I haven't taken the time to get familiar with this
particular case, so it's certainly possible that just passing reader_fds
is sufficient. But if you're not sure, I'd err on the side of caution,
because hangs are a very frustrating failure mode.

You could also try looking at Twisted's selectreactor to see how it
interprets the result of select in general and w.r.t. to specific
platforms.

-Andrew.

Revision history for this message
Martin Pool (mbp) wrote :

On 19 September 2011 09:57, Andrew Bennetts
<email address hidden> wrote:
>> So, I can understand the desire to pass the file descriptor to
>> 'errors'. But what do you actually *do* if it returns something there:
>>
>> rs, _, xs = select.select([fd], [], [fd], timeout)
>>
>> if xs:
>>   ????
>>
>> Do I:
>>
>> 1) Ignore xs, but assume that it makes it more likely that select()
>>    will raise an exception?
>
> I don't understand why you'd expect more exceptions to be raised in this
> case.

It doesn't make `select` more likely to raise an exception. However,
it does make it more likely to _return_ when an error has occurred,
rather than just continuing to block.

>> 2) Treat xs having a file descriptor as meaning the client has timed
>>    out, and thus we should just clean up and go home?
>
> I don't think that's a good assumption.

No, I think it only means "there's something for you to know about",
rather than telling you it's necessarily dead.

>> 3) Try to do something like read on the descriptor in order to figure
>>    out what is actually wrong with it?
>
> That's a reasonable approach.
>
> With either 1 or 3, the difference vs. not passing any fds to the 3rd
> arg is that there are fewer cases where select will hang until the
> timeout.  Precisely which cases those are varies by platform,
> unfortunately.

+1

Since you're only passing one fd in, the question of what to do with
xs is a bit moot. If select returns nothing (ie timeout) you do the
timeout case, otherwise the only thing you can do is read from the fd
and then you'll get data, an error, or eof.

You could log it if you get a xs, just in case.

> It is, as I mentioned in my original mail.  e.g. IIRC a failure of a
> connect(2) is signalled via exceptfds on Windows but not on Linux.

I think so too, and that is confirmed by
<http://msdn.microsoft.com/en-us/library/ms740141(v=VS.85).aspx>. I
thought there were other cases where exceptfds could matter, but I
can't find a citation. At any rate I agree with spiv that checking it
seems likely to do no harm and it might help.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>>> 2) Treat xs having a file descriptor as meaning the client has
>>> timed out, and thus we should just clean up and go home?
>>
>> I don't think that's a good assumption.
>
> No, I think it only means "there's something for you to know
> about", rather than telling you it's necessarily dead.
>
>>> 3) Try to do something like read on the descriptor in order to
>>> figure out what is actually wrong with it?
>>
>> That's a reasonable approach.
>>
>> With either 1 or 3, the difference vs. not passing any fds to the
>> 3rd arg is that there are fewer cases where select will hang
>> until the timeout. Precisely which cases those are varies by
>> platform, unfortunately.
>

Sure. My question was because all that 'man select' tells me is that
'there is a pending error on this handle', but it doesn't tell me how
to actually get an error from it. I didn't know whether I should be
doing something like fctl, or whatever. But I guess just treat it as
though the file was ready to be read, and get the error from there.

Thanks,
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk53P44ACgkQJdeBCYSNAANqPwCeKfGsbnMAKZTn5TD7E04uUPWW
OF0AoIMpp2Rm+OkUJftzlnMq+cDTDOFr
=UY4l
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

On Mon, Sep 19, 2011 at 01:11:42PM -0000, John A Meinel wrote:
[…]
> Sure. My question was because all that 'man select' tells me is that
> 'there is a pending error on this handle', but it doesn't tell me how
> to actually get an error from it. I didn't know whether I should be
> doing something like fctl, or whatever. But I guess just treat it as
> though the file was ready to be read, and get the error from there.

Right, just try use it: the mechanism for getting the error is the usual
way errors are reported via read(2) etc.

To be pedantic, on Linux select's exceptfds are for when something
exceptional has happened rather than an error (so the presence of an URG
pointer in a TCP stream), see 'man 7 socket' and search for 'exception'
or 'urg'. It gives the application the opportunity to deal with
“urgent” data first before attending to any other events, if it chooses.

Windows, in it's ineffable wisdom, chooses to interpret exceptfds as
being for certain errors (not sure what if anything it does with the URG
bit).

-Andrew.

Vincent Ladeuil (vila)
Changed in bzr:
milestone: none → 2.5b2
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.