Network lock up Windows Azure/Hyper-V

Bug #1200243 reported by Ben Howard
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gentoo Linux
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Saucy
Fix Released
High
Joseph Salisbury

Bug Description

Netwokring on Windows Azure 13.10 instance just lock up. After diagnosis, it appears that the root cause is networking. MS reports that OpenSUSE 12.3 had the same problem and was patched via:

Signed-off-by: K. Y. Srinivasan <email address hidden>
Reviewed-by: Haiyang Zhang <email address hidden>
Reported-by: Olaf Hering <email address hidden>

Cc: Stable <email address hidden> (V3.8+)
---
 drivers/hv/ring_buffer.c | 1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index cafa72f..d6fbb577 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -71,6 +71,7 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi)

 static bool hv_need_to_signal(u32 old_write, struct hv_ring_buffer_info *rbi)
 {
+ smp_mb();
  if (rbi->ring_buffer->interrupt_mask)
   return false;

affects: ubuntu → linux-meta (Ubuntu)
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Confirmed that we have the patch in the kernel source.

Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
tags: added: saucy
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The patch mentioned is in Saucy as the follow commit:
commit 288fa3e022eb85fa151e0f9bcd15caeb81679af6
Author: K. Y. Srinivasan <email address hidden>
Date: Fri Mar 29 14:30:38 2013 -0700

    Drivers: hv: vmbus: Fix a bug in hv_need_to_signal()

This commit is in saucy as of Ubuntu-3.10.0-0. Do you still see this bug with the 3.10.0-0 kernel?

tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

jsalisbury: yes, we are seeing a network lock up on the latest Saucy kernel 3.10.0-2-generic

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

v3.8 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-raring/
v3.9 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-saucy/
v3.10-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10-rc1-saucy/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

tags: added: performing-bisect
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It appears this bug was introduced in v3.9-rc1. I started a kernel bisect between v3.8 final and v3.9-rc1. The first test kernel is built up to commit:
b274776c54c320763bc12eb035c0e244f76ccb43

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1200243/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Kernel up to commit b274776c54c320763bc12eb035c0e244f76ccb43 is bad.

Next test kernel built up to:
a0b1c42951dd06ec83cc1bc2c9788131d9fefcd8

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This bug appears to have been introduced by the following commit:

commit c2b8e5202cf7670f918d0f7439ed2123cd58e1b7
Author: K. Y. Srinivasan <email address hidden>
Date: Sat Dec 1 06:46:57 2012 -0800

    Drivers: hv: Implement flow management on the send side

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It looks like commit c2b8e5202cf7670f918d0f7439ed2123cd58e1b7 did not introduce the regression, so we'll continue the bisect.

I built the next kernel up to the following commit:
a0b1c42951dd06ec83cc1bc2c9788131d9fefcd8

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next kernel up to the following commit:
06991c28f37ad68e5c03777f5c3b679b56e3dac1

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1200243/

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Latest test kernel works with out a problem.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next kernel up to the following commit:
771f3eed631be02b08544fc46cdfd2558599cf5d

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1200243/

The kernel filename is: linux-image-3.8.0-030800rc5-generic_3.8.0-030800rc5.201307111858_amd64.deb

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

jsalisbury: the latest test kernel _FAILED_.

Output:
Linux u-test-r-a6 3.10.0-3-generic #12~lp1177609v201307121249 SMP Fri Jul 12 12:03:16 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
utlemming@u-test-r-a6:~$ curl -o /dev/null http://cdimage.ubuntu.com/source/pending/source/saucy-src-1.iso
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 4567M 100 4567M 0 0 46.3M 0 0:01:38 0:01:38 --:--:-- 52.4M
utlemming@u-test-r-a6:~$ while /bin/true; do curl -o /dev/null http://cdimage.ubuntu.com/source/pending/source/saucy-src-1.iso; done
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 4567M 199 4557M 0 0 48.0M 0 0:01:35 0:01:34 0:00:01 41.3M

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

On an interesting side note: apw produced a kernel to enable trim support on Windows Azure/Hyper-V SCSI devices. That kernel is a 3.10-0.3 kernel and it _does not_ exhibit the problem. I transferred 50GB without issue using that kernel.

http://people.canonical.com/~apw/lp1177609-saucy/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I marked commit 771f3eed631be02b08544fc46cdfd2558599cf5d as good after testing.

The next test kernel is up the following commit:
21eaab6d19ed43e82ed39c8deb7f192134fb4a0e

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

21eaab6d19ed43e82ed39c8deb7f192134fb4a0e was good.

Next test kernel up to commit:
7ed214ac2095f561a94335ca672b6c42a1ea40ff

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

7ed214ac2095f561a94335ca672b6c42a1ea40ff was bad.

Next test kernel up to commit:
917ea427c78670958488f7f304e4629c325969a4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1200243/

The kernel filename is: linux-image-3.8.0-030800rc3-generic_3.8.0-030800rc3.201307121210_amd64.deb

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

917ea427c78670958488f7f304e4629c325969a4 was bad.

Next test kernel up to commit:
b5071f2cd89bfd88cc3c3a820cbb9e7d7d9b5c92

Revision history for this message
Milan Dadok (dadok) wrote :

I'm seeing a network lock up on sys-kernel/gentoo-sources-3.10.0 kernel.

But I don't have any lock up on sys-kernel/gentoo-sources-3.9.1 running from 12.5.2013

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Milan Dadok: for 3.9.x I was able to replicate a network lock up after transferring quite a bit of information. Generally I saw the network lock up between 200MB and 1.2GB of transferred data.

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

@jsalisbury: the latest test kernel is OK. No issues after transferring 40GB.

Revision history for this message
Milan Dadok (dadok) wrote :

3.10-rc1 - lock up EVERYTIME after SCP server sends first bigger data block (data or big folder listing)
(from tcpdump more than 3 x 1460 data bytes +1xrest bytes - only 1,2,3 packets are visible on wire)
3.9.9 downloading from SCP server 2.5GB (many bz2,gz files) work's ok

I may have hit another bug ...
from my previous test - it was probably in linux-next-20130328 git tree ...

Revision history for this message
Milan Dadok (dadok) wrote :

I bisected my problem and have same result as Joao Correia - https://lkml.org/lkml/2013/5/29/480
After running as suggested
#ethtool -K eth0 sg off
everythnig is ok, no network lock up on 3.10 kernel

ec5f061564238892005257c83565a0b58ec79295 is the first bad commit
commit ec5f061564238892005257c83565a0b58ec79295
Author: Pravin B Shelar <email address hidden>
Date: Thu Mar 7 09:28:01 2013 +0000

    net: Kill link between CSUM and SG features.

    Earlier SG was unset if CSUM was not available for given device to
    force skb copy to avoid sending inconsistent csum.
    Commit c9af6db4c11c (net: Fix possible wrong checksum generation)
    added explicit flag to force copy to fix this issue. Therefore
    there is no need to link SG and CSUM, following patch kills this
    link between there two features.

    This patch is also required following patch in series.

    Signed-off-by: Pravin B Shelar <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

:040000 040000 1460da77fe80b11e7a91575dcc8733801c1acf5c dcbb6ff495c41faf92a1c5b122384ff0d39fbfa0 M include
:040000 040000 023ebac2c2f7f8c4fcce5b48808812a4ccd590bb f0025cd5f8d17f33481dbf74450ee144d5edecb5 M net

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Testing with "ethtool -K eth0 sg off" returns stability to the images. Although the throughput is noticably impacted.

Revision history for this message
Scott Moser (smoser) wrote :

If we're not close to a solution... I'm tempted to suggest creating daily saucy images with 'ethtool -K eth0 sg off' running on ifup, via some injected network config script.

it could look like this.
$ cat /etc/network/if-up.d/bug-1200243
#!/bin/sh
[ "$IFACE" = "eth0" ] || exit 0
kver=$(uname -r)
case "$kver" in
  3.10.0-3*) :;;
  *) exit 0;;
esac
echo "BUG 1200243 workaround: ethtool -K eth0 sg off" | logger
ethtool -K eth0 sg off
$ chmod 755 /etc/network/if-up.d/bug-1200243

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I think I see what may have happened. Commit 288fa3e0 was applied to v3.10-rc1 before commit ec5f0615 :

git describe --contains 288fa3e022eb85fa151e0f9bcd15caeb81679af6
<jsalisbury> v3.10-rc1~194^2~38
 git describe --contains ec5f061564238892005257c83565a0b58ec79295
<jsalisbury> v3.10-rc1~66^2~545

Commit 288fa3e fixed the Hyper-V bug, but then commit ec5f061 introduced a new, but similar bug. They were both introduced in v3.10-rc1, so we never saw the Hyper-V fix before this new bug came along.

@Milan Dadok, I built a Saucy test kernel with commit ec5f061 reverted. The kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1200243/ec5f061-reverted/

Can you test this kernel and post back if it resolves this bug? You may have to install both the linux-image and linux-image-extra .deb packages.

Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Milan Dadok (dadok) wrote :

@jsalisbury, I compiled own kernel from linux-stable git tree
git checkout tags/v3.10
git revert ec5f061564238892005257c83565a0b58ec79295

After that I don't see any lock up downloading files from server

ethtool -k eth0 now have
scatter-gather: off
        tx-scatter-gather: off [requested on]
        tx-scatter-gather-fraglist: off [fixed]
generic-segmentation-offload: off [requested on]

before reverting path
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
generic-segmentation-offload: on

Revision history for this message
Milan Dadok (dadok) wrote :

I made test with clean 3.10
ethtool -K eth0 gso off
and there is no lock up

There is problem in transfering to hyper-v host SG lists generated by gso or something like that.
It may be there from beginning, sg was probably not used before.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

I've just verified on saucy daily images that this seems to be work-aroundable by launching with user-data containing a boothook that will run ethtool -K eth0 gso off.
example cloud-config:

#cloud-config
apt_update: true
apt_upgrade: true
packages: [ pastebinit ]
output: {all: '| tee -a /var/log/cloud-init-output.log'}
bootcmd:
 - [ ethtool, -K, eth0, gso, "off" ]

So far though, I'm 2 for 2 in reproducing bug 1201567 (300 second scsi timeout before / is mounted RW), that results in a over 5 minutes before instance is reachable.

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Confirmed the latest test kernel fixes the problem.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Saucy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.10.0-4.13

---------------
linux (3.10.0-4.13) saucy; urgency=low

  [ Joseph Salisbury ]

  * SAUCE: (no-up) hyperv: Fix the NETIF_F_SG flag setting in netvsc
    - LP: #1200243

  [ Tim Gardner ]

  * [Config] CONFIG_DYNAMIC_DEBUG=y
    - LP: #1202018

  [ Upstream Kernel Changes ]

  * mfd: rtsx: Add support for RTL8411B
    - LP: #1201321
 -- Tim Gardner <email address hidden> Tue, 16 Jul 2013 06:41:44 -0600

Changed in linux (Ubuntu Saucy):
status: Fix Committed → Fix Released
no longer affects: linux (Ubuntu)
Changed in gentoo:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.