OCFS2 intermittently not mountable on a second Node in Ubuntu 20.04.3 LTS

Bug #1959581 reported by Norbert B.
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ocfs2-tools (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Environment:

Oracle Cloud Infrastructure (OCI)
2 Nodes + ISCSI block volume with OCFS2
Operating System: Ubuntu 20.04.3 LTS
Kernel: Linux 5.11.0-1027-oracle
ocfs2-tools: 1.8.6-2ubuntu1

Problem:

OCFS2 is intermittently not mountable on a second Node.
Sometimes it works for hours and many reboots.
Sometimes it hangs for hours, even after reboot.
In case of error: If OCFS2 is umounted from first node, mount on second node works.
But sometimes unmount from first node is not possible (hangs and sometimes whole current SSH shell hangs).

ISCSI volume is attached and visible.
O2CB runs without errors.

In older Ubuntu/OCFS2 version this does not happen, e. g.:
Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 5.4.0-1061-oracle
ocfs2-tools: 1.8.5-3ubuntu1

Also in current Oracle Linux version this does not happen:
Operating System: Oracle Linux Server 7.9
Kernel: Linux 5.4.17-2136.302.7.2.2.el7uek.x86_64
ocfs2-tools.x86_64: 1.8.6-14.el7

Some error and log messages:

Boot screen on no error:
# console (good, mounted after reboot)
[ OK ] Started Disk Manager.
[ OK ] Finished Load o2cb Modules.
         Starting Mount ocfs2 Filesystems...
[ OK ] Found device BlockVolume ubuntu20-ocfs.
         Mounting /ocfs2volume...

Boot screen on error:
[ OK ] Found device BlockVolume ubuntu20-ocfs.
         Mounting /ocfs2volume...
[FAILED] Failed to mount /ocfs2volume.
See 'systemctl status ocfs2volume.mount' for details.
[DEPEND] Dependency failed for Remote File Systems.
...
node2 login: [ 242.574617] INFO: task mount.ocfs2:1000 blocked for more than 120 seconds.
[ 242.576764] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 242.578017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.405225] INFO: task mount.ocfs2:1000 blocked for more than 241 seconds.
[ 363.407645] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 363.410559] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.230345] INFO: task mount.ocfs2:1000 blocked for more than 362 seconds.
[ 484.232416] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 484.233670] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

# systemctl status ocfs2volume.mount
● ocfs2volume.mount - /ocfs2volume
     Loaded: loaded (/etc/fstab; generated)
     Active: failed (Result: timeout) since Wed 2022-01-19 13:02:04 CET; 8min ago
      Where: /ocfs2volume
       What: /dev/disk/by-uuid/7da60327-fd56-4abd-8f96-4c96260bb2c7
       Docs: man:fstab(5)
             man:systemd-fstab-generator(8)
      Tasks: 1 (limit: 19138)
     Memory: 756.0K
     CGroup: /system.slice/ocfs2volume.mount
             └─1000 /sbin/mount.ocfs2 /dev/sdb /ocfs2volume -o rw,acl,_netdev
Jan 19 13:00:34 node2 systemd[1]: Mounting /ocfs2volume...
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Mounting timed out. Terminating.
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Mount process exited, code=killed, status=15/TERM
Jan 19 13:02:04 node2 systemd[1]: ocfs2volume.mount: Failed with result 'timeout'.
Jan 19 13:02:04 node2 systemd[1]: Failed to mount /ocfs2volume.

# systemctl status ocfs2
● ocfs2.service - Mount ocfs2 Filesystems
     Loaded: loaded (/lib/systemd/system/ocfs2.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-01-19 13:00:31 CET; 11min ago
       Docs: man:ocfs2(7)
             man:mount.ocfs2(8)
    Process: 934 ExecStart=/usr/lib/ocfs2-tools/ocfs2 start (code=exited, status=0/SUCCESS)
   Main PID: 934 (code=exited, status=0/SUCCESS)
Jan 19 13:00:31 node2 systemd[1]: Starting Mount ocfs2 Filesystems...
Jan 19 13:00:31 node2 ocfs2[934]: Starting Oracle Cluster File System (OCFS2)
Jan 19 13:00:31 node2 ocfs2[939]: mount: /ocfs2volume: can't find UUID=7da60327-fd56-4abd-8f96-4c96260bb2c7.
Jan 19 13:00:31 node2 systemd[1]: Finished Mount ocfs2 Filesystems.
Jan 19 13:00:31 node2 ocfs2[934]: Failed
# INFO: UUID=7da60327-fd56-4abd-8f96-4c96260bb2c7 was present
# But errors differ in differnent runs

# iscsiadm -m session -P 3
************************
Attached SCSI devices:
************************
Host Number: 3 State: running
scsi3 Channel 00 Id 0 Lun: 0
scsi3 Channel 00 Id 0 Lun: 2
  Attached scsi disk sdb State: running

# dmesg
[ 1.337891] scsi 2:0:0:1: Direct-Access ORACLE BlockVolume 1.0 PQ: 0 ANSI: 6
...
[ 7.886896] scsi host3: scsi_eh_3: sleeping
[ 7.897819] scsi host3: iSCSI Initiator over TCP/IP
[ 8.198102] ocfs2: Registered cluster interface o2cb
[ 8.244150] OCFS2 User DLM kernel interface loaded
[ 8.272364] o2hb: Heartbeat mode set to local
[ 8.355866] scsi 3:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
[ 8.358478] scsi 3:0:0:0: Attached scsi generic sg1 type 12
[ 8.360633] scsi 3:0:0:2: Direct-Access ORACLE BlockVolume 1.0 PQ: 0 ANSI: 6
[ 8.361620] sd 3:0:0:2: Attached scsi generic sg2 type 0
[ 8.373937] sd 3:0:0:2: [sdb] 2147483648 512-byte logical blocks: (1.10 TB/1.00 TiB)
[ 8.373941] sd 3:0:0:2: [sdb] 4096-byte physical blocks
[ 8.374114] sd 3:0:0:2: [sdb] Write Protect is off
[ 8.374117] sd 3:0:0:2: [sdb] Mode Sense: 2b 00 10 08
[ 8.374552] sd 3:0:0:2: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 8.374925] sd 3:0:0:2: [sdb] Optimal transfer size 1048576 bytes
[ 8.436917] sd 3:0:0:2: [sdb] Attached SCSI disk
...
[ 14.820725] o2net: Connected to node node1 (num 0) at 192.168.14.50:7777
[ 18.832591] o2dlm: Joining domain 7DA60327FD564ABD8F964C96260BB2C7
[ 18.832595] (
[ 18.832595] 0
[ 18.832596] 1
[ 18.832597] ) 2 nodes
[ 242.574617] INFO: task mount.ocfs2:1000 blocked for more than 120 seconds.
[ 242.576764] Not tainted 5.11.0-1027-oracle #30~20.04.1-Ubuntu
[ 242.578017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.579678] task:mount.ocfs2 state:D stack: 0 pid: 1000 ppid: 1 flags:0x00004000
[ 242.579684] Call Trace:
[ 242.579689] __schedule+0x44c/0x8a0
[ 242.579696] schedule+0x4f/0xc0
[ 242.579697] schedule_timeout+0x202/0x290
[ 242.579700] wait_for_completion+0x94/0x100
[ 242.579705] __ocfs2_cluster_lock.isra.0+0x54b/0x820 [ocfs2]
[ 242.579747] ocfs2_inode_lock_full_nested+0x116/0x470 [ocfs2]
[ 242.579788] ? ocfs2_inode_lock_full_nested+0x116/0x470 [ocfs2]
[ 242.579825] ocfs2_journal_init+0x98/0x340 [ocfs2]
[ 242.579863] ? ocfs2_get_system_file_inode+0x14e/0x310 [ocfs2]
[ 242.579906] ocfs2_check_volume+0x3b/0x4e0 [ocfs2]
[ 242.579942] ? iput+0x7a/0x200
[ 242.579947] ocfs2_mount_volume.isra.0+0x108/0x430 [ocfs2]
[ 242.579986] ocfs2_fill_super+0x961/0xda0 [ocfs2]
[ 242.580026] mount_bdev+0x18d/0x1c0
[ 242.580030] ? ocfs2_initialize_super.isra.0+0x1000/0x1000 [ocfs2]
[ 242.580070] ocfs2_mount+0x15/0x20 [ocfs2]
[ 242.580108] legacy_get_tree+0x2b/0x50
[ 242.580113] vfs_get_tree+0x2a/0xc0
[ 242.580116] ? capable+0x19/0x20
[ 242.580120] path_mount+0x460/0xa50
[ 242.580124] do_mount+0x7c/0xa0
[ 242.580126] __x64_sys_mount+0x8b/0xe0
[ 242.580129] do_syscall_64+0x38/0x90
[ 242.580133] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 242.580136] RIP: 0033:0x7f5d2b9ecdde
[ 242.580138] RSP: 002b:00007ffe1754ab38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 242.580141] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5d2b9ecdde
[ 242.580143] RDX: 000055fb738f40ae RSI: 000055fb74188340 RDI: 000055fb7418f430
[ 242.580145] RBP: 00007ffe1754ace0 R08: 000055fb7418f410 R09: 00007ffe17548570
[ 242.580146] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe1754abd0
[ 242.580147] R13: 00007ffe1754ab50 R14: 000055fb7418f410 R15: 0000000000000000

# @shutdown:
[*** ] A stop job is running for Login to …lt iSCSI targets (2min 47s / 3min)
[ *] A stop job is running for Login to …lt iSCSI targets (5min 24s / 6min)

Additional information:
The bug happened on two different, independent installations with Ubuntu 20.04 in OCI.
This bug was reported to Oracle too, but Oracle declined processing it as an Oracle bug.

Norbert B. (nbpq)
description: updated
Norbert B. (nbpq)
description: updated
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for the bug report, Norbert.

First of all, I noticed that the ocfs2-tools package contains a delta relative to the Debian package. The delta is trivial, though, and is there just to make sure the package is built only on support architectures (see bug #1745155 for more details). I don't think it is what's causing this bug.

Debian itself carries a few distro-specific patches. Upon inspecting them, I don't think they're directly related to this issue either. So we're left with two possibilities here:

1) This is an upstream bug (which may have been fixed meanwhile), or

2) This is a bug elsewhere (Linux kernel, maybe?)

I did some research on the internet and found some interesting things.

First, I found that it's really hard to obtain the list of patches that have been applied to a package on Oracle [GNU/]Linux! I knew that theoretically already, but now I have seen it for real. Anyway, I was trying to take a look at their package to see if anything strange/interesting came up. No deal.

Then, I decided to do a search for some terms from your bug descriptions. It caught my attention that you were able to provide a stack trace (thanks, by the way!), so I tried searching for one of the functions listed there. And I did find something interesting:

https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2375753_1.html

Both the descript of the problem (high I/O wait) and the stack trace provided in the link above match what you're seeing. I don't have an Oracle account so I can't see the rest of the article, but I'm assuming you do, so could you please take a look and see if there is anything useful for you there?

Meanwhile, I'm going to mark this bug as Incomplete (which is the closest we have to a "Waiting" state). When you have more information, please mark it as New again.

Thanks.

Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Norbert B. (nbpq) wrote (last edit ):

Hello Sergio,

thank you for picking up this bug.
I realize that this is a very difficult problem.

Regarding the possibilty of an upstream bug:

Meanwhile there is a new kernel available: Linux 5.11.0-1028-oracle
I will install it and do tests again.
This can need some days, because the bug does not always occur.
I will inform you as soon as I have results.

The offered solution in https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2375753_1.html looks like this:

###
Cause
The server is running with incorrect system configurations and not following Oracle's Best Practices for OCFS2.

Solution
Make sure that /usr/bin/updatedb (aka /usr/bin/locate, slocate) does not run against OCFS2 partitions since updatedb, a file indexer, will reduce OCFS2 file I/O performance.

To prevent updatedb from indexing OCFS2 partitions, add 'ocfs2' to the PRUNEFS= and/or list OCFS2 volumes specifically in the PRUNEPATHS list in /etc/updatedb.conf e.g.

[/etc/updatedb.conf]
PRUNEFS="devpts NFS nfs afs proc smbfs autofs auto iso9660 ocfs2"
PRUNEPATHS="/tmp /usr/tmp /var/tmp /afs /net /ocfs2-data /ocfs2-index /quorum"
export PRUNEFS
export PRUNEPATHS

References
NOTE:237997.1 - Linux OCFS - Best Practices
###

I tried to find the related settings, but:

ll /etc/updatedb.conf
ls: cannot access '/etc/updatedb.conf': No such file or directory

ll /usr/bin/locate
ls: cannot access '/usr/bin/locate': No such file or directory

find / -name "*updatedb*"
/usr/share/vim/vim81/syntax/updatedb.vim
/usr/share/vim/vim81/ftplugin/updatedb.vim

find / -name "*locate*"
=> result contains only entries belonging to "allocate" or "fallocate"

The general settings on the affected systems are the same as on the (working) Ubuntu 18 and Qracle Linux systems.

Thanks for helping me!

Revision history for this message
Norbert B. (nbpq) wrote :
Download full text (7.3 KiB)

OK, the test with the new kernel went fast.
The problem appeared at the first boot with the new kernel.

Here some console output while trying to reboot:

# @ first shutdown with new kernel
[** ] A stop job is running for Login to …SCSI targets (3min 36s / 4min 30s)
[ 605.284682] INFO: task mount.ocfs2:1036 blocked for more than 483 seconds.
[ 605.285814] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ 605.287107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.289340] INFO: task (sd-sync):1828 blocked for more than 120 seconds.
[ 605.292876] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ 605.295340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.296958] INFO: task sync:2228 blocked for more than 120 seconds.
[ 605.299587] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ ***] A stop job is running for Login to …iSCSI targets (5min 37s / 6min 1s)
[ 726.113776] INFO: task mount.ocfs2:1036 blocked for more than 604 seconds.
[ 726.114855] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ 726.115754] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.117146] INFO: task (sd-sync):1828 blocked for more than 241 seconds.
[ 726.121041] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ 726.122251] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.123522] INFO: task sync:2228 blocked for more than 241 seconds.
[ 726.124481] Not tainted 5.11.0-1028-oracle #31~20.04.1-Ubuntu
[ OK ] Stopped Login to default iSCSI targets.
         Stopping iSCSI initiator daemon (iscsid)...
[ OK ] Stopped iSCSI initiator daemon (iscsid).
[ OK ] Stopped target Network is Online.
[ OK ] Stopped target Network.
[ OK ] Closed Open-iSCSI iscsid Socket.
[ OK ] Stopped target System Initialization.
[ OK ] Stopped target Local Encrypted Volumes.
[ OK ] Stopped Dispatch Password …ts to Console Directory Watch.
[ OK ] Stopped Forward Password R��uests to Wall Directory Watch.
[ OK ] Stopped Initial cloud-init…ob (metadata service crawler).
[ OK ] Stopped Wait for Network to be Configured.
         Stopping Network Name Resolution...
         Stopping Network Time Synchronization...
         Stopping Update UTMP about System Boot/Shutdown...
[ OK ] Stopped Network Time Synchronization.
[ OK ] Stopped Update UTMP about System Boot/Shutdown.
[ OK ] Stopped Network Name Resolution.
         Stopping Network Service...
[ OK ] Stopped Create Volatile Files and Directories.
[ OK ] Stopped Network Service.
[ OK ] Stopped target Network (Pre).
[ OK ] Stopped Initial cloud-init job (pre-networking).
         Stopping netfilter persistent configuration...
[ OK ] Stopped Apply Kernel Variables.
[ OK ] Stopped netfilter persistent configuration.
[ OK ] Stopped target Local File Systems.
         Unmounting /boot/efi...
         Unmounting /dlm...
         Unmounting /run/snapd/ns/lxd.mnt...
         Unmounting Mount unit for core18, revision 2253...
         Unmounting Mount unit for core18, revision 2284...
         Unmounting Mount unit for core20, revision 1...

Read more...

Changed in ocfs2-tools (Ubuntu):
status: Incomplete → New
Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Hi Norbert,

Hm, interesting. I am not entirely sure what causes this (race?) condition given the kernel upgrade didn't fix this. But this is still not very clear where the problem is. As you noted, this is rather a very difficult problem to run into and requires the infra that I/we don't have access to. :/

Since Oracle (upstream) declined this, it gives us three plausible causes:
1. Either it happens due to a local issue or misconfiguration of some sort. But this doesn't look very likely at the moment.
2. This is maybe a kernel bug (somehow?), irrespective of upgrading/et al.
3. There is some sorta regression in 20.04 due to updates or something that is causing this. To verify that, I'd ask if you can maybe try to reproduce this in 21.10 (Impish Indri), please?

Either way, I'll defer this to Sergio once since he might know more about this than I do. :)

Revision history for this message
Norbert B. (nbpq) wrote :

After Sergio didn't find any corresponding change in OCFS2 tools delta I could imagine, that there was a change in ISCSI adaption whicht could be the reason in combination with OCFS2?

Unfortunately Oracle Cloud (OCI) does not offer Impish Indri as cloud image.
They only support the LTS versions 18/20 as oracle build.

What we could offer you, is access to a sandbox installation.
You would not have access to console, but with SSH.
If it could help, I would build such one and would need public keys for the people, that do the tests.
For details like the public IPs we should use a private communication channel (email).

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Hi Norbert,

It would be nice if we'd be able to come up with a minimal reproducer here.

in the meanwhile, before we discard the issue linked by Sergio, would you mind verifying in which cases (bionic, focal, and oracle linux) you do have locate (or mlocate) installed (dpkg -l | grep locate) and the contents of /etc/updatedb.conf when available (do verify both nodes as well)?

Revision history for this message
Norbert B. (nbpq) wrote (last edit ):

Hello Athos, here is the result:

Ubuntu 20.04 Cluster 1: -
Ubuntu 20.04 Cluster 2: -

Ubuntu 18.04: mlocate 0.26-2ubuntu3.1

PRUNE_BIND_MOUNTS="yes"
# PRUNENAMES=".git .bzr .hg .svn"
PRUNEPATHS="/tmp /var/spool /media /var/lib/os-prober /var/lib/ceph /home/.ecryptfs /var/lib/schroot"
PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs devtmpfs fuse.mfs shfs sysfs cifs lustre tmpfs usbfs udf fuse.glusterfs fuse.sshfs curlftpfs ceph fuse.ceph fuse.rozofs ecryptfs fusesmb"

Oracle Linux Server 7.9: mlocate.x86_64 0.26-8.el7

PRUNE_BIND_MOUNTS = "yes"
PRUNEFS = "9p afs anon_inodefs auto autofs bdev binfmt_misc cgroup cifs coda configfs cpuset debugfs devpts ecryptfs exofs fuse fuse.sshfs fusectl gfs gfs2 gpfs hugetlbfs inotifyfs iso9660 jffs2 lustre mqueue ncpfs nfs nfs4 nfsd pipefs proc ramfs rootfs rpc_pipefs securityfs selinuxfs sfs sockfs sysfs tmpfs ubifs udf usbfs fuse.glusterfs ceph fuse.ceph"
PRUNENAMES = ".git .hg .svn"
PRUNEPATHS = "/afs /media /mnt /net /sfs /tmp /udev /var/cache/ccache /var/lib/yum/yumdb /var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph"

The nodes are identical.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Norbert,

Could you also provide which open-iscsi version you are using? I was checking the difference between open-iscsi in bionic and focal and the upstream version is the same but some patches were applied. For instance, this is one bug reported by one Orable Cloud user which was fixed in bionic version 2.0.874-5ubuntu2.9:

https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/1800681

Revision history for this message
Norbert B. (nbpq) wrote :

Thanks @all for not leaving me alone with this issue!

Open-iscsi versions:

Ubuntu 20.04.3 LTS: 2.0.874-7.1ubuntu6.2

Ubuntu 18.04.6 LTS: 2.0.874-5ubuntu2.10

(Oracle Linux Server 7.9: 6.2.0.874-22)

Revision history for this message
Paride Legovini (paride) wrote :

Hi Norbert,

I had a look at the differences between those two versions but I found no clue. In any case I prepared a PPA with Bionc's open-iscsi compiled for Focal:

https://launchpad.net/~paride/+archive/ubuntu/lp1959581

you can use this to verify that the issue isn't a regression that happened between 2.0.874-7.1ubuntu6.2 and 2.0.874-5ubuntu2.10. My expectation here is that you'll still hit the issue, but I'd be happy to be proven wrong.

Note that even that with that PPA enabled you'll still have for force the installation of its version of open-iscsi:

  apt install open-iscsi=2.0.874-5ubuntu2.10~focal2

as that's a downgrade wrt Focal's version.

Another important bit of information would come from testing newer Ubuntu releases, as Utkarsh mentioned. I understand that you only have LTS images available on OCI, but it *should* still be possible for you to install Focal and then upgrade to the newer releases, reboot to the newer kernel and test again. Knowing that this is fixed in Impish or Jammy would give us a good starting point to actually identify a fix.

I can see that this is "debugging with an axe", but I think there's little more we can do.

Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Norbert B. (nbpq) wrote :

Hello Paride,

thank you for the effort with the downgrade version.
The problem stays the same in open-iscsi 2.0.874-5ubuntu2.10~focal2.

Should we try a downgrade of ocfs2?

Do you know a description somewhere, how to upgrade an Oracle Focal LTS version to Impish Indri?
It would be helpful for me and I would like to try it.

Best regards, Norbert

Changed in ocfs2-tools (Ubuntu):
status: Incomplete → New
Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

I did not find any Oracle specific doc on how to upgrade a Ubuntu system to Impish, which is kind of expected since they do not support this use case. I'd recommend you to try to use `do-release-upgrade` to do that.

Revision history for this message
Paride Legovini (paride) wrote :

@Norbert, as Lucas suggested please try using `do-release-upgrade`. You'll have to edit

  /etc/update-manager/release-upgrades

and set:

  Prompt=normal

then just running `do-release-upgrade` should upgrade your system to Impish (21.10). If you can still reproduce the issue there, please also try to update to the current development release, via:

  do-release-upgrade -d

This is especially interesting because the current devel release is going to be an LTS release (22.04).

Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Norbert B. (nbpq) wrote :

Hello Paride,

thank you for the detailed instruction - very helpful!
I hope it will work in Oracle environment, because they have their own Oracle repositories.

For the moment I'm testing with current updates and new kernel 5.13.0-1018-oracle.
The error didn't happen immediately after the update (as it did last time).
So I will test for one or two days before trying to upgrade.

I'll post results as soon as I have such ones.

Revision history for this message
Norbert B. (nbpq) wrote :

Hello Paride & all,

sorry for the late reply but I was on vacation last week.
I found a solution/workaround now but it's a little bit strange.

Background: Oracle is offering different shapes of virtual machines.
In the OCFS2 cluster we are using:
VM.Standard.E4.Flex (AMD)
VM.Standard3.Flex (Intel)

These machines use different virtual NICs too.

I did upgrades to kernel Ubuntu-oracle-5.13.0.1018.22~20.04.1 (and some other packages).

After these upgrades the problem was gone in our testing cluster, where both VMs run on VM.Standard.E4.Flex (AMD).

In the (planned) production cluster, which runs on mixed VMs (VM.Standard.E4.Flex / VM.Standard3.Flex) the problem was still present.
So I changed the shape from the 2nd VM from Intel to AMD - same as in testing environment - and the problem was gone too.

In summary:
Upgrade to Ubuntu-oracle-5.13.0.1018 solves the problem, but only when all cluster machines have same VM shape (and/or possibly same type of NIC).
For the case that the machines run on different VM shapes (and NICs?) the problem remains the same.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for describing your findings Norbert, and good to hear you have a functional environment now. This might be helpful for others. In this case, I do not think there is much we can do here, since this is a very specific case in Oracle Cloud.

Revision history for this message
Norbert B. (nbpq) wrote :

Unfortunately the problem is back in:

Ubuntu 20.04.4 LTS
Kernel: Linux 5.13.0-1027-oracle
ocfs2-tools 1.8.6-2ubuntu1
open-iscsi 2.0.874-7.1ubuntu6.2
... or was never solved an the solution before was just a lucky accident.

Meanwhile I have two servers, who never can mount an OCFS2 volume at the same time.
Each single server can mount it, when it's not mounted on the other one.
I did tests with all possible virtual machine shapes (AMD/Intel) and network adapters including hardware-assisted (SR-IOV) networking. The problem remained.

The bug appears in Debian bug reports too:
https://groups.google.com/g/linux.debian.bugs.dist/c/0xf2QXZOpE4

I tested the solution offered there and disabled all quotas on system and volume, but it had no effect. The problem remained.

===

A second problem appeared:

It seems, that needed OCFS2 modules are no more present in in regular Ubuntu kernels for Oracle Cloud.
After apt upgrades osfs2 and o2cb service do no more start:
Apr 27 10:05:06 trans2 modprobe[759]: FATAL: Module ocfs2_stackglue not found in directory /lib/modules/5.13.0-1027-oracle
Apr 27 10:05:06 trans2 modprobe[773]: FATAL: Module ocfs2_dlmfs not found in directory /lib/modules/5.13.0-1027-oracle

Solution:
apt install linux-modules-extra-5.13.0-1027-oracle

Not really good in production environments, where upgrades can run automatically.
Was this done by Oracle or by the Canonical team that builds the Images for Oracle?
As far as I know, Oracle gets ready Images from Canonical.

Changed in ocfs2-tools (Ubuntu):
status: Incomplete → New
Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for getting back to us with more information Norbert.

I went through the Debian bug and it seems to be a linux bug instead of ocfs2-tools one. The Debian maintainer came to a conclusion that this commit needs to be included in the linux build:

From de19433423c7bedabbd4f9a25f7dbc62c5e78921 Mon Sep 17 00:00:00 2001
From: Joseph Qi <email address hidden>
Subject: ocfs2: fix crash when mount with quota enabled

From your comment, the ocfs2 modules are available only in linux-modules-extra-5.13.0-1027-oracle. I'd need to investigate that, and I do not know how the Oracle images are built to be honest. I'll take a look at that and get back to you.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi,

I couldn't look at this bug in more detail, but I checked kernel git repos.

It's not yet in ubuntu generic/oracle 5.13 kernel, but it's coming sometime.

...

The commit was introduced in linux mainline in v5.18, and is present in
linux stable v5.15.33, but not yet in v5.10.x.

~/git/linux$ git describe --contains de19433423c7bedabbd4f9a25f7dbc62c5e78921
v5.18-rc1~32^2~14

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/fs/ocfs2?h=v5.15.33
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/fs/ocfs2?h=v5.10.113

Those 2 versions are pulled by our ubuntu 5.13 generic kernel stable maintenance,
which is currently at v5.15.26 (so a few to go for v5.15.33).

The ubuntu 5.13 oracle kernel is derived from it (with custom additions), and
so far, hasn't pulled that patch yet (neither via -generic nor custom additions).

[3] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oracle/+git/focal/log/fs/ocfs2?h=oracle-5.13-next

...

It's possible to send that patch before it gets in via stable maintenance.

However, I was checking, and it seems that patch fixes a kernel crash
("BUG: kernel NULL pointer dereference"), but this bug observes a hang?

[4] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1007141

Sorry if I missed this through the comments, but could it be confirmed
that the kernel logs show that symptom too?

Thanks!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Forgot this line,

> Those 2 versions are pulled by our ubuntu 5.13 generic kernel stable maintenance,
> which is currently at v5.15.26 (so a few to go for v5.15.33).

~/git/kernel/ubuntu-impish$ git log --oneline origin/master-next | grep stable | head -n1
ffd44d75c03e UBUNTU: upstream stable to v5.10.103, v5.15.26

Revision history for this message
Norbert B. (nbpq) wrote :

I hope, that this will work.

What irritates me:
I did tests with deactivated quotas on system and disk, but the problem remained.
We don't have the problem with Oracle Linux (RHEL), which uses regular kernels too, afaik.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Norbert,

This would suggest this indeed isn't related to that kernel commit
(different symptom: hang vs. crash; and independently of quotas).

I'd like to check this in more detail on the kernel side, next week.

Could you please confirm whether my understanding to reproduce this is correct?

- There are 2 servers in OCI (apparently the type/shape make no difference),
- with 1 (same) iSCSI volume shared between both,
- that is formatted with OCFS2 (mkfs.ocfs2; which flags?)
- and it mounts just fine at the first time/node,
- BUT it hangs mount at the second time/node (intermittently).

If you have any details to add (e.g., mkfs.ocfs2 flags, dlm setup customization),
or even the exact commands you used (if that's easy for you; scripts are fine),
that's very welcome.

And if you could please upload the `dmesg` output of the hang in the 5.13 kernel
(bug description has it for 5.11), that'd be great.

Thanks!

Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Norbert B. (nbpq) wrote :

Hello Mauricio,

thank you for picking up the problem.
I confirm your understanding and attach setup steps and logs.

I can give you more details if needed - just ask me.
I'm in office from Monday to Thursday.

Best regards

Changed in ocfs2-tools (Ubuntu):
status: Incomplete → New
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Last week has been a bit busy and I couldn't get to this, sorry; it's still on my list.

Revision history for this message
Norbert B. (nbpq) wrote (last edit ):

Update:
Problem is still present in:

Ubuntu 20.04.4 LTS
Kernel: Linux 5.13.0-1030-oracle

OCFS2-Modules were gone after upgrade and had to be reinstalled with
apt install linux-modules-extra-5.13.0-1030-oracle

Problem is (after current upgrades) still NOT present in:

Ubuntu 18.04.6 LTS
Kernel: Linux 5.4.0-1071-oracle

Oracle Linux Server 7.9
Kernel: Linux 5.4.17-2136.307.3.5.el7uek.x86_64

Revision history for this message
Gavin Lu (gguanglu) wrote :
Download full text (25.5 KiB)

This issue also happens on

Ubuntu 22.04 LTS
Kernel: 5.15.0-40-generic
ocfs2-tools: 1.8.7-1build1

[66083.443026] o2hb: Heartbeat mode set to local
[66106.958102] o2net: Connected to node host1 (num 0) at xxxx:7777
[66113.032068] ocfs2: Registered cluster interface o2cb
[66113.034718] o2dlm: Joining domain 46F2901EB11A4795953C95921232F55A
[66113.034722] (
[66113.034724] 0
[66113.034726] 1
[66113.034727] ) 2 nodes
[66337.770968] INFO: task mount.ocfs2:4663 blocked for more than 120 seconds.
[66337.771051] Not tainted 5.15.0-40-generic #43-Ubuntu
[66337.773821] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[66337.776954] task:mount.ocfs2 state:D stack: 0 pid: 4663 ppid: 4662 flags:0x00004002
[66337.776968] Call Trace:
[66337.776973] <TASK>
[66337.776979] __schedule+0x23d/0x590
[66337.777065] schedule+0x4e/0xb0
[66337.777071] schedule_timeout+0xfb/0x140
[66337.777081] __wait_for_common+0xab/0x150
[66337.777088] ? usleep_range_state+0x90/0x90
[66337.777096] wait_for_completion+0x24/0x30
[66337.777113] __ocfs2_cluster_lock.constprop.0+0x1f9/0x900 [ocfs2]
[66337.777251] ocfs2_inode_lock_full_nested+0x17e/0x480 [ocfs2]
[66337.777343] ? ocfs2_inode_lock_full_nested+0x17e/0x480 [ocfs2]
[66337.777439] ocfs2_journal_init+0x98/0x340 [ocfs2]
[66337.777540] ocfs2_check_volume+0x3b/0x4f0 [ocfs2]
[66337.777682] ? iput+0x74/0x1c0
[66337.777728] ocfs2_mount_volume.isra.0+0x12a/0x460 [ocfs2]
[66337.777857] ocfs2_fill_super+0x57f/0x8d0 [ocfs2]
[66337.778031] ? vsnprintf+0x1df/0x550
[66337.778086] mount_bdev+0x193/0x1c0
[66337.778099] ? ocfs2_mount_volume.isra.0+0x460/0x460 [ocfs2]
[66337.778222] ocfs2_mount+0x15/0x20 [ocfs2]
[66337.778378] legacy_get_tree+0x28/0x50
[66337.778401] vfs_get_tree+0x27/0xc0
[66337.778411] do_new_mount+0x16e/0x2d0
[66337.778422] path_mount+0x1db/0x880
[66337.778431] ? putname+0x55/0x60
[66337.778442] __x64_sys_mount+0x108/0x140
[66337.778453] do_syscall_64+0x59/0xc0
[66337.778462] ? do_readlinkat+0x10f/0x120
[66337.778473] ? exit_to_user_mode_prepare+0x37/0xb0
[66337.778524] ? syscall_exit_to_user_mode+0x27/0x50
[66337.778535] ? __x64_sys_readlink+0x1e/0x30
[66337.778543] ? do_syscall_64+0x69/0xc0
[66337.778547] ? do_syscall_64+0x69/0xc0
[66337.778551] ? sysvec_apic_timer_interrupt+0x4e/0x90
[66337.778557] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[66337.778568] entry_SYSCALL_64_after_hwframe+0x44/0xae
[66337.778577] RIP: 0033:0x7fe4a7265eae
[66337.778583] RSP: 002b:00007ffc84474158 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[66337.778591] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe4a7265eae
[66337.778595] RDX: 00005625ffe1629d RSI: 00005626002c5370 RDI: 00005626002cc350
[66337.778598] RBP: 00007ffc84474490 R08: 00005626002c5fe0 R09: 00007ffc84471ce0
[66337.778601] R10: 0000000000000000 R11: 0000000000000246 R12: 00005626002c53d0
[66337.778605] R13: 00005626002c5350 R14: 00007ffc844741c0 R15: 0000000000000000
[66337.778610] </TASK>
[66458.603056] INFO: task mount.ocfs2:4663 blocked for more than 241 seconds.
[66458.605733] Not tainted 5.15.0-40-generic #43-Ubuntu
[66458.608015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disable...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ocfs2-tools (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.