October 23, 2018 · linux bug

This Data Corruption Bug will Shock You

It's not common to have a severe bug in Ubuntu LTS, and it's even less common to have a severe data corruption bug! We are used to the idea that data is stored reliably and safely on our computers. However, for the past week, I have been pulling my hair out because of a data corruption issue, and was really surprised to discover the true cause. I have been attempting to create several Virtual Machines (VMs) to compartmentalize my essential network services, e.g. email is separated from contacts, and I decided to use virtualized QEMU instances on my Ubuntu (18.04 LTS) system. I first provisioned a Windows Server VM, which installed flawlessly on a qcow2 image on an ext4 disk. However, I ran into issues when provisioning a Linux VM with a different virtual hard drive configuration. The drive format and definition (libvirt) for the VM was:

<disk type='file' device='disk'>
   <driver name='qemu' type='raw' cache='none'/>
   <source file='/dev/rust/client'/>
   <target dev='sda' bus='sata'/>

This disk image was stored on a ZVOL, i.e. a block device on a ZFS filesystem, and not as a file.

After installing Debian with disk encryption, and only with a long encryption key, I kept getting a kernel panic upon first boot:

[    0.799754] Unpacking initramfs...
[    0.800970] Initramfs unpacking failed: junk in compressed archive
[    0.936387] List of all partitions:
[    0.937590] No filesystem could mount root, tried: [    0.938747] 
[    0.939199] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    0.941199] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2
[    0.943290] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[    0.966656] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

The console line, Initramfs unpacking failed: junk in compressed archive, is the kernel saying that its boot file is corrupt! Initramfs is an image file that provides essential modules and drivers during system boot. I decided to mount the guest drive and extract the contents of  initrd.img, but I received more error messages from gzip:

gzip: initrd.img-4.9.0-7-amd64: invalid compressed data--crc error
gzip: initrd.img-4.9.0-7-amd64: invalid compressed data--length error

At this point I gave up, thinking that the root cause was either of the following:

I even contemplated submitting a bug report to the initramfs-tools package. However, it turns out I was completely mistaken with my guesses, and that having the bug only occur with long encryption keys was a red herring. A Google search for "qemu corrupt" with results from to the past month revealed a news article detailing the cause of the data corruption bug.

It turns out that my data corruption was due to a kernel bug! This bug only occurs when the virtual drive is set to cache='none' and located on a non-ext4 (mine was ZFS) file system. Like in any accident, there was a long chain of events:  QEMU cache='none' -> buggy system call on a ZVOL/ZFS file system -> corrupt disk write -> corrupt initramfs -> kernel panic on VM boot.

Even though I hoped that Linux (and by extension Linus!) would be infallible, and that such a serious data corruption bug would have been a package's fault, this bug is just one of many kernel bugs every year. This is the Ubuntu Bug Fix Report. The bug fix was actually released today and thankfully I did not lose any important data. In conclusion, as silly as it sounds, I learned that Linux isn't perfect.