As a testbed for commercial TCAD software, we will use a standard desktop PC equipped with an i4790 CPU and 32 GB RAM, and an Nvidia 710 GT to connect the monitor via DVI. As a substitute for Redhat Enterprise Linux (RHEL) required by all of the packages we're about to evaluate, I'm going to install CentOS.

I create a bootable USB stick by issuing

dd bs=4M if=CentOS-7-x86_64-Minimal.iso of=/dev/sdc status=progress && sync


The first stick doesn't work, and it takes us perhaps half an hour to realize that it's the fault of the stick, not the system. A second one works right away. I manually partition the disk, as in my installation in a virtual machine, choosing btrfs as filesystem for both system and home partitions. I add a UEFI boot partition as well as a swap partition.

During the first boot from hard disk, the system throws an error message concerning nouveau (the open-source driver for the Nvidia graphics card) and subsequently hangs with a kernel soft lock signified by the repeated message that CPU #i stuck for 120s. After a hard reset, the system boots and behaves as expected, but when I boot again to activate the new kernel installed by yum, the system hangs again. Since the hangup is always preceded by the error message from the nouveau driver, we remove the Nvidia card and reboot. Now the systems hangs without any message. Great! A cold boot (with the Nvidia card installed again) seems to help at first, but later the system hangs repeatedly and finally refuses to boot to the login screen at all.

What I thought to be done within an hour has already taken most of this Friday morning. Since the Nvidia card does not seem to responsible for these problems, I start to suspect the btrfs filesystem, which Redhat still considers to be a technology preview. In the afternoon, I thus reinstall the system, this time choosing the default filesystem XFS. And indeed, while I still get the error message from nouveau, the system boots up without any of the previous symptoms. I go home with the conviction that I've solved the problem.

Saturday morning, curiosity gets the better of me and I decide to check if the system still runs. It does, but htop shows that the uptime is only 49 min instead of the expected 13 h. Weird! I check again Sunday morning, and again the uptime is less than an hour. At the same time, I notice that the filesystem usage seems to increase by roughly 1 GB per day. Aha! An 'll /var/crash' confirms my suspicion: the system crashes and dumps the kernel roughly every 2 or 3 hours.

Core dumps can be analyzed to determine their cause. Following the Redhat tutorial and issuing the command

crash /usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2017-05-01-07\:35\:53/vmcore


I get the following crash report:

KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-05-01-07:35:53/vmcore  [PARTIAL DUMP]
CPUS: 8
DATE: Mon May  1 07:35:42 2017
UPTIME: 08:27:43
NODENAME: testbed.de
RELEASE: 3.10.0-514.16.1.el7.x86_64
VERSION: #1 SMP Wed Apr 12 15:04:24 UTC 2017
MACHINE: x86_64  (3600 Mhz)
MEMORY: 31.9 GB
PANIC: "Kernel panic - not syncing: Hard LOCKUP"
PID: 0
COMMAND: "swapper/6"
CPU: 6

 ob@testbed:~\$ uptime