we have been running our java system for over 2 years without ever having a system hang. We have 2 physical servers running similar java software (2 JVMs on each server) to form a cluster. As far as I can tell the crashing only started happening when we introduced core pinning and mappedbus.io for shared memory access between 2 JVMs on one of the servers. The system hang has only happened 4 times in 2 weeks, and it only ever happens on the machine where we configured the core pinning and memory mapped file access between the JVMs. We disabled that config, so we don't pin the cores to spin on reading the memory mapped files and we don't pin our primary app thread. Note, when I say pin, we also busy spin the thread running on that pinned core.
That is totally anecdotal though. As the system does not hang every day, I cannot say for sure that it is anything to do with core pinning or shared memory access. However, with pinning (and busy spin) disabled, and accessing the shared memory in a loop with a LockSupport.parkNanos(5000) we don't seem to have any system hangs.
Latency is critical for us, so this "non-busy" set up is a temporary work around only.
Also, please note that I have moved the application across to an identical server and was also able to experience this full system hang. So I can't see this being a hardware failure.
So, from digging around logs before or after a crash, this is what seems to be relevant to me. There are several of these stacks. I am just posting the first one here (i.e. I don't believe this is anything to do with postgres itself)
kernel: [25738.874778] INFO: task postgres:2155 blocked for more than 120 seconds.
kernel: [25738.874833] Not tainted 5.4.0-050400-generic #201911242031
kernel: [25738.874878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: [25738.874928] postgres D 0 2155 2056 0x00004000
kernel: [25738.874931] Call Trace:
kernel: [25738.874942] __schedule+0x2e3/0x740
kernel: [25738.874948] ? __wake_up_common_lock+0x8a/0xc0
kernel: [25738.874951] schedule+0x42/0xb0
kernel: [25738.874957] jbd2_log_wait_commit+0xaf/0x120
kernel: [25738.874961] ? wait_woken+0x80/0x80
kernel: [25738.874965] jbd2_complete_transaction+0x5c/0x90
kernel: [25738.874969] ext4_sync_file+0x38c/0x3e0
kernel: [25738.874974] vfs_fsync_range+0x49/0x80
kernel: [25738.874977] do_fsync+0x3d/0x70
kernel: [25738.874980] __x64_sys_fsync+0x14/0x20
kernel: [25738.874985] do_syscall_64+0x57/0x190
kernel: [25738.874991] entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: [25738.874993] RIP: 0033:0x7f96dc24b214
kernel: [25738.875002] Code: Bad RIP value.
kernel: [25738.875003] RSP: 002b:00007fffb2abd868 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
kernel: [25738.875006] RAX: ffffffffffffffda RBX: 00007fffb2abd874 RCX: 00007f96dc24b214
kernel: [25738.875007] RDX: 00005635889ba238 RSI: 00005635889a1490 RDI: 0000000000000003
kernel: [25738.875009] RBP: 00007fffb2abd930 R08: 00005635889a1480 R09: 00007f96cc1e1200
kernel: [25738.875010] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
kernel: [25738.875011] R13: 0000000000000000 R14: 000056358899c5a0 R15: 0000000000000001
p.s. this happened on 16.04 and kernel 4.15 also. The upgrade to 18.04 and 5.0 was an attempt to resolve the system hang but has not made any difference.