Summary: | ASTERISK-16865: deadlock on 1.8.0-rc2 and crash on 1.8.0 with multipule sip channels | ||
Reporter: | Kun Liu (knkcn) | Labels: | |
Date Opened: | 2010-10-25 02:30:43 | Date Closed: | 2010-12-16 10:30:28.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/Channels |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) backtrace.txt ( 1) crash-backtrace-2010-10-29.txt ( 2) deadlock_gdb.txt ( 3) newdeadlock-backtrace.log ( 4) newdeadlock-coreshowlocks.log ( 5) newdeadlock-message.log | |
Description: | Asterisk 1.8.0-rc2 as a call center server run well with 1 or 2 concurrent lines. We did about 50 concurrent lines lab test with sipp and it run well too. But under a more than 20 lines field trial, the Asterisk deadlocked. After deadlocked I found many rtp channels and udp5060 had a large number on Recv-Q. At the same time "core show channels" showed nothing and many cli commands hang random. I collected some information using "gdb /usr/sbin/asterisk <pid>" but I have not enough time to debug what locked it before restarted it. After that I upgraded the server to 1.8.0 with "DONT_OPTIMIZE" and "DEBUG_THREADS", then I did the test again. But this time it crashed when 10 phones were calling in. BTW, after the "DEBUG_THREADS" is enabled, every time the "bridge" application is executed the following error will be reported. [Oct 25 15:03:04] ERROR[25661] lock.c: features.c line 5551 (bridge_exec): mutex 'current_dest_chan' freed more times than we've loc ked! [Oct 25 15:03:04] ERROR[25661] lock.c: features.c line 5551 (bridge_exec): Error releasing mutex: Operation not permitted | ||
Comments: | By: Kun Liu (knkcn) 2010-10-25 05:12:08 Log Messeage when crashing: [Oct 25 14:01:44] ERROR[25148] lock.c: features.c line 5551 (bridge_exec): mutex 'current_dest_chan' freed more times than we've loc ked! [Oct 25 14:01:44] ERROR[25148] lock.c: features.c line 5551 (bridge_exec): Error releasing mutex: Operation not permitted [Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 385 (local_queue_frame): Error obtaining mutex: Invalid argument [Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 915 (local_hangup): mutex '&p->lock' freed more times than we've locked! [Oct 25 14:01:44] ERROR[25089] lock.c: chan_local.c line 915 (local_hangup): Error releasing mutex: Invalid argument By: Kun Liu (knkcn) 2010-10-25 07:37:27 (gdb) f 10 ASTERISK-6 0x00000000004673a2 in ast_cel_report_event (chan=0x2aaad000ff78, event_type=AST_CEL_CHANNEL_END, userdefevname=0x59db15 "", extra=0x59db15 "", peer2=0x0) at cel.c:574 574 if (ev && ast_event_queue(ev)) { (gdb) p chan->name $1 = (const ast_string_field) 0x2aaad0002666 "Bridge/SIP/inbound-0000002b<ZOMBIE>" By: SIP:ian@txrx.ca (weezey) 2010-10-25 12:17:36 I experienced the same problem this morning. Unfortunately it's a production box with less than tolerant users so debugging wasn't an option and I had to roll it back to 1.4 By: Kun Liu (knkcn) 2010-10-26 05:38:37 I can not roll it back to 1.4 or 1.6 because the cel manager events are used to generate some special logs. By: Kun Liu (knkcn) 2010-10-27 04:09:28 Asterisk 1.8.0 deadlock again - almost every time trial. By: Kun Liu (knkcn) 2010-10-27 15:10:47 Understand the last deadlock. Lock thread 2: handle_incoming locks SIP/outbound-00000309's pvt in add_header_max_forwards and trys to lock channels container. Lock thread 4: bridge_exec for SIP/outbound-0000309 and SIP/inbound-00000306 locks channel SIP/outbound-00000309 and trys to lock SIP/outbound-00000309's pvt. Lock thread 5: Channel SIP/inbound-00000311 hangup event locks channels container, but the cel_check_retire_linkdid function trys to lock channel SIP/outbound-0000309. Then it becomes deadlock. But the first deadlock and the crash log have not cel_check_retire_linkdid, all the issues seem have relationship with bridge. By: Kun Liu (knkcn) 2010-10-27 20:51:48 I think this is realated with https://reviewboard.asterisk.org/r/778/. By: Kun Liu (knkcn) 2010-10-28 04:55:30 static int add_header_max_forwards(struct sip_pvt *dialog, struct sip_request *req) { /* char clen[10]; const char *max = NULL; while (dialog->owner && ast_channel_trylock(dialog->owner)) { sip_pvt_unlock(dialog); usleep(1); sip_pvt_lock(dialog); } if (dialog->owner) { max = pbx_builtin_getvar_helper(dialog->owner, "SIP_MAX_FORWARDS"); ast_channel_unlock(dialog->owner); } if (max == NULL) { snprintf(clen, sizeof(clen), "%d", dialog->maxforwards); } return add_header(req, "Max-Forwards", max != NULL ? max : clen); */ return add_header(req, "Max-Forwards", "70"); } This works well under 4 times trial today. Should dialog be unlocked before return? By: Kun Liu (knkcn) 2010-10-30 07:08:07 Crash again when using sipp test. [Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 385 (local_queue_frame): Error obtaining mutex: Invalid argument [Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 915 (local_hangup): mutex '&p->lock' freed more times than we've locked! [Oct 29 14:24:04] ERROR[6217] lock.c: chan_local.c line 915 (local_hangup): Error releasing mutex: Invalid argument By: Kun Liu (knkcn) 2010-10-31 23:57:44 Crashed again yesterday caused by the same issue. Anybody can tell me how to find the crash reason? I can use simple commands of gdb and valgrind. By: Leif Madsen (lmadsen) 2010-11-02 09:20:51 knkcn: I understand you're having issues, but the issue tracker is a best effort system and you will not be able to receive immediate attention until the issue has been assigned to a developer. If you have an immediate need to have this resolved I'd suggest hiring a consultant if you are unable to resolve the issue. If you have some development skills you may be able to get some help on the asterisk-dev mailing list (or #asterisk-dev IRC channel) if you're having an issue with a particular section of the code. Beyond that, you can use the asterisk-users mailing list (or #asterisk IRC channel), or you'll have to be patient until this issue has been assigned to a developer. By: Russell Bryant (russell) 2010-11-11 11:25:19.000-0600 Please try the latest code from the 1.8 branch as I think this was just fixed. Alternatively, you can wait for Asterisk 1.8.1-rc1 and try that. By: Marcin Kowalczyk (kowalma) 2010-11-11 17:36:47.000-0600 Hi Russell, Any specific SVN release? By: Stefan Schmidt (schmidts) 2010-11-14 13:12:18.000-0600 higher than 294640 looks like a good idea to me, cause this also solves a deadlock situation with statechanges. but you just can try the latest svn rev. best regards stefan By: Marcin Kowalczyk (kowalma) 2010-11-29 04:03:32.000-0600 I'm running SVN-branch-1.8-r295078 for some time and since upgrade I havn't noticed deadlock problem anymore. I've processed approx 500k+ calls on this load By: Kun Liu (knkcn) 2010-12-13 00:22:54.000-0600 The 1.8.1-rc1 runs well for me. I think it is fixed. By: Leif Madsen (lmadsen) 2010-12-16 10:30:27.000-0600 Closed per the reporter. |