[Home]

Summary:ASTERISK-07503: [patch] closing fd twice with FAGI in use under good load causes deadlock
Reporter:Bill Nesbitt (bn999)Labels:
Date Opened:2006-08-09 13:42:01Date Closed:2006-09-06 15:05:11
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Resources/res_agi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) debug.info_thread
( 1) debug.messages
( 2) debug.thread_bt
( 3) res_agi.c.patch
Description:In the middle of normal call processing (SIP & FAGI) on a moderately loaded system, the system will stop processing calls.  Show Channles will show a short list of channels with a disproportionate number of calls.  SIP Show Channels will not show anything.  Compiled with DEBUG_THREADS = -DDEBUG_THREADS -DDETECT_DEADLOCKS, the following is shown in the logs:

Aug  9 13:52:40 ERROR[27346] ../include/asterisk/lock.h: chan_sip.c line 11459 (do_monitor): Deadlock? waited 5 sec for mutex '&sip->lock'?
Aug  9 13:52:40 ERROR[27346] ../include/asterisk/lock.h: chan_sip.c line 3039 (sip_read): '&sip->lock' was locked here.
Aug  9 13:52:40 ERROR[27346] ../include/asterisk/lock.h: chan_sip.c line 11459 (do_monitor): Deadlock? waited 5 sec for mutex '&sip->lock'?
Aug  9 13:52:40 ERROR[27346] ../include/asterisk/lock.h: chan_sip.c line 3039 (sip_read): '&sip->lock' was locked here.

which will repeat for thousands of lines.  This can happen with only 50 channels, but in this instance happend with over 200 channels.  The box is 60% idle at this channel load.


****** STEPS TO REPRODUCE ******

Cannot recreate at will, but it will typically happen after a few hours of moderate load.

****** ADDITIONAL INFORMATION ******

Attached are the thread info and backtraces.  I have a log file recorded at debug = 4 available for inspection (260+ MB).
Comments:By: Serge Vecher (serge-v) 2006-08-09 13:51:26

does this only happen with sip channels in use?

By: Bill Nesbitt (bn999) 2006-08-09 13:54:07

SIP is the only channel type that we use.  I have never had it happen on an idle box.

By: Bill Nesbitt (bn999) 2006-08-09 14:27:31

I have attached message that I thought were interested in the debug log leading up to the block.

By: Serge Vecher (serge-v) 2006-08-09 14:29:42

could you be running out of open files?

By: Bill Nesbitt (bn999) 2006-08-09 14:32:58

I have:

ulimit -n 8192

in safe_asterisk.  Is that enough?

By: Jason Parker (jparker) 2006-08-14 12:24:29

With only 50-200 calls, I wouldn't think ulimit to be a problem.

By: Bill Nesbitt (bn999) 2006-08-15 19:04:15

I have invested huge amounts of time trying to track the root cause of this down.  In the end, what I am seeing is file descriptors suddenly going bad.  That is system calls like recvfrom returning EBADF (Bad File Descriptor).  These are fd's that have been open for some time and in use.  I have also had fopen()'s return NULL and set the errno to EBADF.  According to the man pages, this is not possible.  Sometimes a seeming valid >0 fd returned from a socket() call immediately failing with an EBADF when trying to do a fcntl().  The block is actually caused by a fd somehow, all of a sudden, loosing it's O_NONBLOCK flag.

My first thought is that there is some sort of corruption of the glibc's fd list cause by a bug in one of the threads.  Since I have a few custom modules, I found ways to remove them one by one until I was left with almost completely stock source, except for my debug enhancements.  I have also removed every module that I am not explicitly using.  I have ruled a lot out, but am still left with the bulk of the Asterisk code.

Another thought is that there is some sort of kernel bug.  I am running RHEL4U4 2.6.9-42.ELsmp with a 4 way motherboard.  Has anyone had any success with this OS?  Is there a recommended OS to try?

Running valgrind gives me no memory leaks or buffer overflows.  The problem with this is that I cannot run more than a few simualtaneous calls under valgrind as it's single threaded and there is not enough CPU available.

Are there any utilities that might help track down unauthorized writes to system controlled memory (like the fd table)?  Could there be a situation where a fd is closed more than once?  Any ideas on where to look next?

By: Serge Vecher (serge-v) 2006-08-16 08:14:35

bn999: you may want to join the #asterisk-dev channel on IRC to get some "realtime" help with this...

By: Bill Nesbitt (bn999) 2006-08-21 17:51:30

Got it!

This problem only occurs when using FastAGI.  The more load that is put on the box, the more often it will happen.  What I found was that a fd was being closed twice.  When there is enough activity, the second close could happen after another thread was issued the exact same fd number (integer).  This causes all of the strange behavior reported here for any section of code dealing with fds.  The attached patch solves the problem for 1.2.10.  The bug is present all the way back to the initial addition of FastAGI (~ 1.x).

The box is now happily serving over 400 sim calls w/ under 50% CPU utilization.

By: Serge Vecher (serge-v) 2006-08-22 08:16:21

good job, bn999. Could you please use tabs instead of spaces for indenation?

Thanks,

By: Bill Nesbitt (bn999) 2006-08-22 09:57:58

It was a cut and paste.  New attachment 'res_agi.c.patch'.

By: Joshua C. Colp (jcolp) 2006-09-06 15:05:10

file swoops in from out of nowhere and commits this! Boom! In 1.2 as of revision 42148 and trunk as of revision 42149. Thanks!