Out of memory issue for websocket client

Bug #1903733 reported by Heather Lemon
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
python-tornado (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Low
Heather Lemon
Bionic
Fix Released
Low
Heather Lemon
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Applications using package python-tornado v5.1.1 or earlier are susceptible to an out of memory error related to websockets.

[Other Info]

Upstream commit(s):
https://github.com/tornadoweb/tornado/pull/2351/commits/20becca336caae61cd24f7afba0e177c0a210c70

$ git remote -v
origin https://github.com/tornadoweb/tornado.git (fetch)
origin https://github.com/tornadoweb/tornado.git (push)

$ git describe --contains 20becca3
v5.1.0b1~28^2~1

$ rmadison python3-tornardo
 => python3-tornado | 4.2.1-1ubuntu3 | xenial
 python3-tornado | 4.5.3-1 | bionic/universe
 => python3-tornado | 4.5.3-1ubuntu0.1 | bionic-updates/universe
 python3-tornado | 6.0.3+really5.1.1-3 | focal/universe
 python3-tornado | 6.0.4-2 | groovy/universe
 python3-tornado | 6.0.4-3 | hirsute/universe
 python3-tornado | 6.1.0-1 | hirsute-proposed/universe

[Original Description]

Tornado has no 'flow control', [8] TCP flow control definition, for websockets. A websocket will receive data as fast as it can, and store the data in a deque. If that data is not consumed as fast as it is written, then that deque will grow in size indefinitely, ultimately leading to a memory error and killing the process.
Fix is to use a Queue. Read and get messages from the queue on the client side.

Patch file [0]
Commit history [1]
GitHub [2]
Issue [3]

[0] https://patch-diff.githubusercontent.com/raw/tornadoweb/tornado/pull/2351.patch
[1] https://github.com/tornadoweb/tornado/pull/2351/commits
[2] https://github.com/tornadoweb/tornado
[3] https://github.com/tornadoweb/tornado/issues/2341

[Test Case]

I will be attaching two python test files.
client.py
server.py

# create lxc container & limits on memory and turn off swap
$ sudo apt-get install lxc lxd
$ lxd init
$ lxc launch ubuntu:18.04 bionic-python-tornado

# shrink server size
lxc config set server limits.cpu 2
# changes ram setting
lxc config set server limits.memory 150MB
# severely limits amount of swap used [4]
lxc config set bionic-py-tornado limits.memory.swap false

# install dev tools and download source code
$ lxc exec bionic-python-tornado bash
$ apt-get update
$ apt install ubuntu-dev-tools -y
$ pull-lp-source python-tornado bionic
$ sudo apt build-dep .

# copy client.py and server.py to
# $ ~/python-tornado-4.5.3/demos
$ scp or touch client.py and server.py

# build code
$ python3 setup.py build
$ python3 setup.py install

# I have 3 terminals open
2 for executing python, one for the client and one for server
and another one using top to view memory constraints

# run server.py, client.py, and top in separate terminals
$ python3 demos/client.py
$ python3 demos/server.py
$ top

What gets print out in the client.py is the length of the collections.deque

In the server.py prints out messages like:
message: keep alive

* press ctrl+E for showing memory in MB in the terminal with top
top - shows that swap is off/ running very low and our memory is only 150MB

Although I never hit the oom exception that is expected to be thrown,
you can check dmesg
$ sudo dmesg | grep -i python

looks similar to this:
[ 3250.067833] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.iptest,mems_allowed=0,oom_memcg=/lxc.payload.iptest,task_memcg=/lxc.payload.iptest,task=python3,pid=44889,uid=1000000
[ 3250.067838] Memory cgroup out of memory: Killed process 44889 (python3) total-vm:304616kB, anon-rss:235152kB, file-rss:0kB, shmem-rss:0kB, UID:1000000 pgtables:628kB oom_score_adj:0
[ 3250.075096] oom_reaper: reaped process 44889 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0k

After either adding the patch or running focal or later versions
*pull-lp-source python-tornado focal

We can run the exact same setup again, and this time it shows that the new queue object has only a length of 1.

We have shown that before the patch, what was used to store messages in the queue was unbounded and could grow "If maxlen is not specified or is None, deques may grow to an arbitrary length over time."[6] Afterwards they decided using a blocking queue with size 1. (Queue(1)), where there is only ever 1 item in the queue at a time. [7]

[4] https://discuss.linuxcontainers.org/t/limiting-swap-in-lxc-container/6343/4
[5] attached screenshot
[6] https://docs.python.org/3/library/collections.html?highlight=collections%20deque#collections.deque
[7] https://www.tornadoweb.org/en/stable/queues.html
[8] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control

[Where Problems Could Occur]

Potential problems could occur if messages are not being consumed from the queue.
"""
  File "/usr/local/lib/python3.8/dist-packages/tornado-5.1.1-py3.8-linux-x86_64.egg/tornado/queues.py", line 202, in put_nowait
    raise QueueFull
tornado.queues.QueueFull
"""
If the messages are not being taken off then you will always have something in the queue.

[Extra Notes]

Although they have solved the oom error by using a blocking queue of size one. Ex: (Queue(1)). [9] Which they have used their own implementation of what a queue is [10]. If items are not being consumed from the queue, then there will be problems there as well.

They have pushed the issue from the application/library onto the networking side. Having dropped data/packets in the network is more likely since there will only ever be one item at the queue at a time.

" Constructor for a FIFO queue. maxsize is an integer that sets the upperbound limit on the number of items that can be placed in the queue. Insertion will block once this size has been reached, until queue items are consumed. If maxsize is less than or equal to zero, the queue size is infinite." [11]

[9] https://github.com/tornadoweb/tornado/blob/master/tornado/websocket.py#L1376
[10] https://github.com/tornadoweb/tornado/blob/f399f28fecc741667b63b7c20b930d7926d34ac3/tornado/queues.py#L81
[11] https://docs.python.org/3/library/queue.html

description: updated
summary: - Backport out of memory issue for websocket client
+ Out of memory issue for websocket client
Changed in python-tornado (Ubuntu):
assignee: nobody → Heather Lemon (hypothetical-lemon)
Eric Desrochers (slashd)
tags: added: seg sts
Eric Desrochers (slashd)
description: updated
Changed in python-tornado (Ubuntu Hirsute):
status: New → Fix Released
Changed in python-tornado (Ubuntu Groovy):
status: New → Fix Released
Changed in python-tornado (Ubuntu Focal):
status: New → Fix Released
Changed in python-tornado (Ubuntu Hirsute):
assignee: Heather Lemon (hypothetical-lemon) → nobody
Changed in python-tornado (Ubuntu Bionic):
assignee: nobody → Heather Lemon (hypothetical-lemon)
Changed in python-tornado (Ubuntu Xenial):
assignee: nobody → Heather Lemon (hypothetical-lemon)
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

python-tornado bionic debdiff

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

python tornado xenial

description: updated
Changed in python-tornado (Ubuntu Xenial):
status: New → In Progress
Changed in python-tornado (Ubuntu Bionic):
status: New → In Progress
Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsor][Bionic]

Is this your final debdiff ?

* The bionic debdiff is incomplete as it only contains d/changelog.
* Version in d/changelog should be "4.5.3-1ubuntu0.2" instead of "4.5.3-1ubuntu0.1.1" since the previous version was "4.5.3-1ubuntu0.1"

Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsor][xenial]

d/changelog mentioned 3 patches:
  * d/p/0001-read-queue-of-1-message.patch
  * d/p/0001-Remove-unused-import.patch
  * d/p/0001-test-Skip-test_source_port_fail-when-running-as-root.patch

but d/p/series only has 2 of them ?
+0001-read-queue-of-1-message.patch
+0001-Remove-unused-import.patch

Having all your patches starting by 0001 can be confusing for future reference. I would suggest since it's a patchset fixing the same bug to order them and use the bug # for making sure folks know it was a patchset to fix this bug, as follows:

0001-lp1903733-read-queue-of-1-message.patch
0002-lp1903733-Remove-unused-import.patch
and so on ....

description: updated
description: updated
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
description: updated
Changed in python-tornado (Ubuntu Xenial):
importance: Undecided → Low
Changed in python-tornado (Ubuntu Bionic):
importance: Undecided → Low
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

attached screenshot of terminal setup and running

description: updated
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

updated bionic python-tornado package

description: updated
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsors][Review]

Bionic debdiff nitpicks:

* Change 4.5.3-1ubuntu0.1.1 -> 4.5.3-1ubuntu0.2
* Change (LP#:1903733) to (LP: #1903733)

Xenial debdiff nitpicks:
* Change (LP#:1903733) to (LP: #1903733)

Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsors][Review]

Quilt header for X and B:

* Add the commit link, not the PR please.
* Clean-up the launchpad.net bug -> https://launchpad.net/bugs/1903733

+Origin: upstream, https://github.com/tornadoweb/tornado/pull/2351/commits/20becca336caae61cd24f7afba0e177c0a210c70
+Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/python-tornado/+bug/1903733

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

ah ok i see, I will fix today. Thanks!

Dan Streetman (ddstreet)
tags: added: sts-sponsor-slashd
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

v2 of python-tornado bionic lp#1903733

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

v2 of xenial python-tornado package upload

Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsors][xenial]

Sponsored.

Thanks for your contribution Heather !

- Eric

Revision history for this message
Eric Desrochers (slashd) wrote :

[sts-sponsors][bionic]

Sponsored.

Thanks for your contribution Heather !

- Eric

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Ok, the patches *seem* fine (especially that they're cherry-picks), but this SRU seems to be missing the regression potential analysis (per the [Where problems could occur] section). I don't know the code enough to feel if and which parts might regress after we land this change. Could you update the bug description with such information? Thanks!

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Heather, or anyone else affected,

Accepted python-tornado into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-tornado/4.5.3-1ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in python-tornado (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Changed in python-tornado (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Heather, or anyone else affected,

Accepted python-tornado into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-tornado/4.2.1-1ubuntu3.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

description: updated
description: updated
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (python-tornado/4.5.3-1ubuntu0.2)

All autopkgtests for the newly accepted python-tornado (4.5.3-1ubuntu0.2) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

ipykernel/4.8.2-2 (i386)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#python-tornado

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

description: updated
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :
Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

Kicked the autopackage tests again.. passed 2nd time around

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

I don't believe what they have done with implementing a Queue(1) solves the client oom errors.
This actually didn't solve the problem because the underlying implementation still uses collection.deque() which was the issue to begin with.

websocket.py
self.read_queue = Queue(1)

queues.py
class Queue(object):
        self._getters = collections.deque([]) # Futures.
        self._putters = collections.deque([]) # Pairs of (item, Future).

Revision history for this message
Dan Streetman (ddstreet) wrote :

I went through the code with @hypothetical-lemon and the patches do appear to fix the issue; with the previous code they were adding new messages into a deque without any throttling, but now they are using a different queue class that, while it still uses deques internally, returns a future object for any queued message. The throttling is then implemented in the websocket's read queue, which delays reading more bytes from the actual low-level socket until after the future object completes. This should restrict the queue in the tornado websocket class to no more than 1 queued message.

Additionally, as this has always been a completely theoretical "problem" (even the upstream bug was entirely theoretical, with nobody involved reproducing the problem), we have no clear and easy reproducer. However, all the package's test cases passed, indicating no regression due to the patches.

Thus, marking this as verified.

tags: added: verification-done verification-done-bionic verification-done-xenial
removed: verification-needed verification-needed-bionic verification-needed-xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you for digging into the verification, Heather, Dan. Seeing the change and reading the verification comments, I agree with the rationale and will proceed with the release.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for python-tornado has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-tornado - 4.5.3-1ubuntu0.2

---------------
python-tornado (4.5.3-1ubuntu0.2) bionic; urgency=low

  * d/p/0001-lp1903733-read-queue-of-1-message.patch (LP: #1903733)
    - fixes potential oom error with python-tornado client websocket
  * d/p/0002-lp1903733-Remove-unused-import.patch
    - code clean up

 -- Heather Lemon <email address hidden> Tue, 12 Jan 2021 20:57:40 +0000

Changed in python-tornado (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-tornado - 4.2.1-1ubuntu3.1

---------------
python-tornado (4.2.1-1ubuntu3.1) xenial; urgency=low

  * d/p/0001-lp1903733-read-queue-of-1-message.patch (LP: #1903733)
    - fixes potential oom error with python-tornado client websocket
  * d/p/0002-lp1903733-Remove-unused-import.patch
    - code clean up
  * d/control: update-maintainer

 -- Heather Lemon <email address hidden> Tue, 12 Jan 2021 21:32:45 +0000

Changed in python-tornado (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Seth Arnold (seth-arnold) wrote :

I was asked to give this a quick look; do you have a reproducer that demonstrates the problem?

It looks to me like the bounded Queue implementation tries hard to push back against unconstrained resource growth:

    def put(
        [ ... ]
        future = Future() # type: Future[None]
        try:
            self.put_nowait(item)
        except QueueFull:
            self._putters.append((item, future))
            _set_timeout(future, timeout)
        else:
            future.set_result(None)
        return future

    def put_nowait(self, item: _T) -> None:
        [ ... ]
        self._consume_expired()
        if self._getters:
            assert self.empty(), "queue non-empty, why are getters waiting?"
            getter = self._getters.popleft()
            self.__put_internal(item)
            future_set_result_unless_cancelled(getter, self._get())
        elif self.full():
            raise QueueFull
        else:
            self.__put_internal(item)

    def full(self) -> bool:
        if self.maxsize == 0:
            return False
        else:
            return self.qsize() >= self.maxsize

Taken together, these look to me like they should raise an error if the Queue is already full, place the requestor on the _putters list, and keep going.

It's a guess, but I'd hope the size of the _putters list is constrained by per-process open file limits -- eg I'm hoping every _putter represents a socket, and those are usually constrained to 1024 per process, or 10k per process, etc.

Thanks

Revision history for this message
Heather Lemon (hypothetical-lemon) wrote :

No, I was unable to recreate a reproducer. I attached 2 files (client.py, server.py) to demonstrate the oom error, but was unsuccessful.

| raise an error if the Queue is already full, place the requestor on the _putters list, and keep going.

Yeah I agree.

About the size of the getters and putters list. This is all I could find.
An unbounded deque,
        self._getters = collections.deque([]) # Futures.
        self._putters = collections.deque([]) # Pairs of (item, Future).

Revision history for this message
Dan Streetman (ddstreet) wrote :

> the size of the _putters list is constrained by per-process open file limits

there is no mechanism to restrict this inherent in the Queue class, it would be up to the caller to use the class with that restriction

> every _putter represents a socket

no, their design for the Queue is for it to be used by async callers (coroutines) so there would never be more than 1 entry in the getters and/or putters.

In the specific use case modified by this patch, the WebSocketClientConnection class is updated to use the Queue class and, from its on_message method, return the future object representing the item put into the Queue. If the Queue has room for the object, its future object is the 'gen._null_future' object which is always completed (and whose result it None, but that doesn't matter). If the Queue doesn't have room, it returns a real future object that won't be completed until something calls the Queue's get() to take it off the putters deque.

The caller of WebSocketClientConnection.on_message() is thus responsible for performing the queue throttling by using the returned future object. In this case, that caller is WebSocketProtocol13._handle_message(), which returns the future provided by the call to on_message. The caller of _handle_message() is WebSocketProtocol13._on_frame_data(), that sets its local var handled_future to the future returned from the Queue. It then adds a callback to be called when the future is done, to call self._receive_frame(); in previous behavior, there was no future object returned from on_message (and thus _handle_message also returned None) so _on_frame_data() immediately called self._receive_frame(), which is what could lead to unthrottled filling of the queue.

Now that _on_frame_data() does get a future object and defers self._receive_frame() to execution only after the current message waiting in the Queue._putters deque has been removed, the Queue should never have more than a single message waiting in its putters.

Revision history for this message
Dan Streetman (ddstreet) wrote :

I should add that while they claim their design is for async python code (e.g. coroutines) they don't actually use it from a coroutine in this older version of code, they are manually using the future's on_done_callback() instead of just await-ing the future object (as they do in newer code). However the manual addition of the callback should work just as well as a real coroutine awaiting the future object.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.