I am working in this relatively large code base where I am seeing a file descriptor leak and processes start complaining that they are not able to open files after I run certain programs.
Though this happens after 6 days , I am able to reproduce the problem in 3-4 hours by reducing the value in /proc/sys/fs/file-max to 9000.
There are many processes running at any moment. I have been able to pin point couple of processes that could be causing the leak. However, I don't see any file descriptor leak either through lsof or through /proc//fd.
If I kill the processes(they communicate with each other) that I am suspecting of leaking, the leak goes away. FDs are released.
cat /proc/sys/fs/file-nr in a while(1) loop shows the leak. However, I don't see any leak in any process.
Here is a script I wrote to detect that leak is happening :
#!/bin/bashif [ "$#" != "2" ];then name=`basename $0` echo "Usage : $name <threshold for number of pids> <check_interval>" exit 1fifd_threshold=$1check_interval=$2total_num_desc=0touch pid_monitor.txtnowdate=`date`echo "=================================================================================================================================">> pid_monitor.txtecho "****************************************MONITORING STARTS AT $nowdate***************************************************">> pid_monitor.txtwhile [ 1 ]do for x in `ps -ef | awk '{ print $2 }'` do if [ "$x" != "PID" ];then num_fd=`ls -l /proc/$x/fd 2>/dev/null | wc -l` pname=`cat /proc/$x/cmdline 2> /dev/null` total_num_desc=`expr $total_num_desc + $num_fd` if [ $num_fd -gt $fd_threshold ]; then echo "Proces name $pname($x) and number of open descriptor = $num_fd">> pid_monitor.txt fi fi done total_nr_desc=`cat /proc/sys/fs/file-nr` lsof_desc=`lsof | wc -l` nowdate=`date` echo "$nowdate : Total number of open file descriptor = $total_num_desc lsof desc: = $lsof_desc file-nr descriptor = $total_nr_desc">> pid_monitor.txt total_num_desc=0 sleep $2done./monitor.fd.sh 500 2 & tail -f pid_monitor.txt
As I mentioned earlier, I don't see any leak in /proc//fd for any , but leak is happening for sure and system is running out of file descriptors.
I suspect something in the kernel is leaking. Linux kernel version 2.6.23.
My questions are follows :
Will 'ls /proc//fd' show list descriptors for any library linked to the process with pid . If not how do i determine when there is a leak in the library i am linking to.
How do I confirm that leak is in the userspace vs. in kernel.
If the leak is in the kernel what tools can I use to debug ?
Any other tips you can give me.
Thanks for going through the question patiently.
Would really appreciate any help.