[Home]

Summary:ASTERISK-16865: deadlock on 1.8.0-rc2 and crash on 1.8.0 with multipule sip channels
Reporter:Kun Liu (knkcn)Labels:
Date Opened:2010-10-25 02:30:43Date Closed:2010-12-16 10:30:28.000-0600
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/Channels
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace.txt
( 1) crash-backtrace-2010-10-29.txt
( 2) deadlock_gdb.txt
( 3) newdeadlock-backtrace.log
( 4) newdeadlock-coreshowlocks.log
( 5) newdeadlock-message.log
Description:Asterisk 1.8.0-rc2 as a call center server run well with 1 or 2 concurrent lines. We did about 50 concurrent lines lab test with sipp and it run well too. But under a more than 20 lines field trial, the Asterisk deadlocked. After deadlocked I found many rtp channels and udp5060 had a large number on Recv-Q. At the same time "core show channels" showed nothing and many cli commands hang random. I collected some information using "gdb /usr/sbin/asterisk <pid>" but I have not enough time to debug what locked it before restarted it.

After that I upgraded the server to 1.8.0 with "DONT_OPTIMIZE" and "DEBUG_THREADS", then I did the test again. But this time it crashed when 10 phones were calling in.

BTW, after the "DEBUG_THREADS" is enabled, every time the "bridge" application is executed the following error will be reported.
[Oct 25 15:03:04] ERROR[25661] lock.c: features.c line 5551 (bridge_exec): mutex 'current_dest_chan' freed more times than we've loc
ked!
[Oct 25 15:03:04] ERROR[25661] lock.c: features.c line 5551 (bridge_exec): Error releasing mutex: Operation not permitted
Comments:By: Kun Liu (knkcn) 2010-10-25 05:12:08

Log Messeage when crashing:
[Oct 25 14:01:44] ERROR[25148] lock.c: features.c line 5551 (bridge_exec): mutex 'current_dest_chan' freed more times than we've loc
ked!
[Oct 25 14:01:44] ERROR[25148] lock.c: features.c line 5551 (bridge_exec): Error releasing mutex: Operation not permitted
[Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 385 (local_queue_frame): Error obtaining mutex: Invalid argument
[Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 915 (local_hangup): mutex '&p->lock' freed more times than we've locked!
[Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 915 (local_hangup): Error releasing mutex: Invalid argument

By: Kun Liu (knkcn) 2010-10-25 07:37:27

(gdb) f 10
ASTERISK-6 0x00000000004673a2 in ast_cel_report_event (chan=0x2aaad000ff78, event_type=AST_CEL_CHANNEL_END, userdefevname=0x59db15 "",
   extra=0x59db15 "", peer2=0x0) at cel.c:574
574             if (ev && ast_event_queue(ev)) {
(gdb) p chan->name
$1 = (const ast_string_field) 0x2aaad0002666 "Bridge/SIP/inbound-0000002b<ZOMBIE>"

By: SIP:ian@txrx.ca (weezey) 2010-10-25 12:17:36

I experienced the same problem this morning.  Unfortunately it's a production box with less than tolerant users so debugging wasn't an option and I had to roll it back to 1.4

By: Kun Liu (knkcn) 2010-10-26 05:38:37

I can not roll it back to 1.4 or 1.6 because the cel manager events are used to generate some special logs.

By: Kun Liu (knkcn) 2010-10-27 04:09:28

Asterisk 1.8.0 deadlock again - almost every time trial.

By: Kun Liu (knkcn) 2010-10-27 15:10:47

Understand the last deadlock.

Lock thread 2: handle_incoming locks SIP/outbound-00000309's pvt in add_header_max_forwards and trys to lock channels container.

Lock thread 4: bridge_exec for SIP/outbound-0000309 and SIP/inbound-00000306 locks channel SIP/outbound-00000309 and trys to lock SIP/outbound-00000309's pvt.

Lock thread 5: Channel SIP/inbound-00000311 hangup event locks channels container, but the cel_check_retire_linkdid function trys to lock channel SIP/outbound-0000309.

Then it becomes deadlock. But the first deadlock and the crash log have not cel_check_retire_linkdid, all the issues seem have relationship with bridge.

By: Kun Liu (knkcn) 2010-10-27 20:51:48

I think this is realated with https://reviewboard.asterisk.org/r/778/.

By: Kun Liu (knkcn) 2010-10-28 04:55:30

static int add_header_max_forwards(struct sip_pvt *dialog, struct sip_request *req)
{
/*      char clen[10];
       const char *max = NULL;

       while (dialog->owner && ast_channel_trylock(dialog->owner)) {
               sip_pvt_unlock(dialog);
               usleep(1);
               sip_pvt_lock(dialog);
       }

       if (dialog->owner) {
               max = pbx_builtin_getvar_helper(dialog->owner, "SIP_MAX_FORWARDS");
               ast_channel_unlock(dialog->owner);
       }

       if (max == NULL) {
               snprintf(clen, sizeof(clen), "%d", dialog->maxforwards);
       }
       return add_header(req, "Max-Forwards", max != NULL ? max : clen);
*/
       return add_header(req, "Max-Forwards", "70");
}

This works well under 4 times trial today. Should dialog be unlocked before return?

By: Kun Liu (knkcn) 2010-10-30 07:08:07

Crash again when using sipp test.
[Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 385 (local_queue_frame): Error obtaining mutex: Invalid argument
[Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 915 (local_hangup): mutex '&p->lock' freed more times than we've locked!
[Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 915 (local_hangup): Error releasing mutex: Invalid argument

By: Kun Liu (knkcn) 2010-10-31 23:57:44

Crashed again yesterday caused by the same issue. Anybody can tell me how to find the crash reason? I can use simple commands of gdb and valgrind.

By: Leif Madsen (lmadsen) 2010-11-02 09:20:51

knkcn: I understand you're having issues, but the issue tracker is a best effort system and you will not be able to receive immediate attention until the issue has been assigned to a developer.

If you have an immediate need to have this resolved I'd suggest hiring a consultant if you are unable to resolve the issue. If you have some development skills you may be able to get some help on the asterisk-dev mailing list (or #asterisk-dev IRC channel) if you're having an issue with a particular section of the code.

Beyond that, you can use the asterisk-users mailing list (or #asterisk IRC channel), or you'll have to be patient until this issue has been assigned to a developer.

By: Russell Bryant (russell) 2010-11-11 11:25:19.000-0600

Please try the latest code from the 1.8 branch as I think this was just fixed.  Alternatively, you can wait for Asterisk 1.8.1-rc1 and try that.



By: Marcin Kowalczyk (kowalma) 2010-11-11 17:36:47.000-0600

Hi Russell,

Any specific SVN release?

By: Stefan Schmidt (schmidts) 2010-11-14 13:12:18.000-0600

higher than 294640 looks like a good idea to me, cause this also solves a deadlock situation with statechanges.

but you just can try the latest svn rev.

best regards

stefan

By: Marcin Kowalczyk (kowalma) 2010-11-29 04:03:32.000-0600

I'm running SVN-branch-1.8-r295078 for some time and since upgrade I havn't noticed deadlock problem anymore.
I've processed approx 500k+ calls on this load

By: Kun Liu (knkcn) 2010-12-13 00:22:54.000-0600

The 1.8.1-rc1 runs well for me. I think it is fixed.

By: Leif Madsen (lmadsen) 2010-12-16 10:30:27.000-0600

Closed per the reporter.