Surprisingly high kernel CPU for programs that normally uses little CPU. Linux machine alternates between states. Most of the time, programs execute normally using low CPU. During a CPU "surge", programs use high kernel CPU using 100% available CPU.
Sample C program and output below.
The machine goes in and out of a weird state roughly every five minutes where some, but not all, programs use high kernel CPU. The CPU "surge" might last a minute then the machine returns to normal state for another 5-10 minutes. Reboot sometimes help but the surges gradually build up over a week until the problem becomes severe enough that another reboot is required. Sometimes a reboot doesn't help and the only temporary fix is to try another reboot.
- CentOS release 6.9
- Dell PowerEdge R630 with 14 CPUs, 32 GB Ram
- Linux 2.6.32-696.30.1.el6.x86_64 x86_64
I was able to reproduce the CPU issue with this sample C program. It runs a shell script that executes a sleep for 0.01 second and prints the run time for each of 10 iterations. It runs quickly when the machine is in a normal state and runs slow when the machine is in the abnormal state.
test_system.c
#include <stdio.h>#include <stdlib.h>int main(int argc, char *argv[]){ int i, n; char cmd[100]; if (argc == 2) { n = atoi(argv[1]); } else { n = 1; } printf("n=%d\n", n); for (i=0; i<n; i++) { system("ts=$(date +%s%N) ; sleep 0.01 ; tt=$((($(date +%s%N) - $ts)/1000000)) ; echo \"Time taken: $tt milliseconds\""); }}Here's output when the machine is in a normal state. Most of the CPU is in the user space.
$ time test_system 10n=10Time taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsTime taken: 12 millisecondsreal 0m0.210suser 0m0.059ssys 0m0.015s$Here's output when the machine is experiencing CPU "surge" mode. I added comments where two long pauses occurred. The delays are due to machine being CPU overloaded. The run time is 35.6 sec, 170x longer than normal. The kernel CPU usage for this run is 7.2 sec, a 480x increase from the normal run.
$ time test_system 10n=10Time taken: 161 millisecondsTime taken: 406 millisecondsTime taken: 58 millisecondsTime taken: 176 millisecondsTime taken: 189 milliseconds--- approx. 17 sec delay ---Time taken: 25 millisecondsTime taken: 127 millisecondsTime taken: 82 millisecondsTime taken: 84 millisecondsTime taken: 12 milliseconds--- approx. 17 sec delay ---real 0m35.641suser 0m0.077ssys 0m7.233s$This post suggests too much memory allocated for I/O buffers can cause this problem because the kernel has to work hard to reclaim memory in order to run programs. But there's no indication of memory swapping or shortage. I ran a separate test of allocating 100 MB memory and don't see delays or high CPU even during the CPU surge.
Any other suggestions on what can cause this behavior?