Summary: | ASTERISK-25404: segfault/crash in chan_pjsip_hangup ... at chan_pjsip.c | ||
Reporter: | Chet Stevens (cwstevens) | Labels: | |
Date Opened: | 2015-09-19 10:04:34 | Date Closed: | 2015-10-08 13:50:01 |
Priority: | Critical | Regression? | |
Status: | Closed/Complete | Components: | Channels/chan_pjsip |
Versions: | 13.1.0 13.6.0 | Frequency of Occurrence | Frequent |
Related Issues: | |||
Environment: | certified/13.1-cert3-rc1 (snapshot asterisk-20702e0 dated 17 Sep 2015. Also experienced with snapshot asterisk-5b06b6f dated 7 Sep 2015). PJSIP with PJProject 2.4.5 DAHDI 2.10.2 libpri version: 1.4.15 Digium Phone Module for Asterisk Version 13.0_2.2.0 Digium Phone firmware 2_0_1_0_74452 Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-43-generic x86_64) ProLiant DL380 Gen9. 16 GB memory Wildcard AEX2400: wctdm24xxp+ | Attachments: | ( 0) backtrace.txt ( 1) backtrace2.txt ( 2) backtrace3.txt ( 3) debug.zip ( 4) debug3.zip |
Description: | We are experiencing frequent crashes of Asterisk (6 times on 9/18/15). kern.log shows segfault with chan_pjsip.so:
{noformat} Sep 18 19:15:36 0651-Facilities-Audit-Campus kernel: [6164166.327465] asterisk[18107]: segfault at 0 ip 00007f00a7f3e7f4 sp 00007effaf26eb00 error 4 in chan_pjsip.so[7f00a7f37000+f000] {noformat} A backtrace and debug for the minute of the crash with previous 100k lines will be attached. | ||
Comments: | By: Asterisk Team (asteriskteam) 2015-09-19 10:04:35.604-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Chet Stevens (cwstevens) 2015-09-22 12:26:15.400-0500 I thought these problems were due to some router issues we were having but I was wrong. The router hardware has all been replaced but we have experienced another segfault crash in chan_pjsip.so. By: Rusty Newton (rnewton) 2015-09-24 18:36:47.873-0500 Would you be able to get non-optimized traces on this one? DONT_OPTIMIZE and BETTER_BACKTRACES like usual. Any ideas on reproduction? By: Chet Stevens (cwstevens) 2015-09-25 07:30:47.805-0500 Ok, I have the system running for backtraces with debug now. You might have a better idea but it looks like from the debug log that the last thing that occurred before the segfault was a call inbound via SIP trunk, rang a group of phones, that went to voicemail: {noformat} [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound:1] NoOp("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound:2] Set("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "CDR(userfield)=7027995472") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound:3] Goto("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "inbound_extensions,7027995472,1") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Goto (inbound_extensions,7027995472,1) [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound_extensions:1] NoOp("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound_extensions:2] Set("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "CALLERID(name)=702-799-5472") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound_extensions:3] Dial("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "PJSIP/5749_line&PJSIP/5750_line&PJSIP/5751_line&PJSIP/5752_line&PJSIP/5753_line&PJSIP/5755_line&PJSIP/5756_line&PJSIP/5761_line&PJSIP/5775_line&PJSIP/5777_line&PJSIP/5782_line&PJSIP/5785_line,20,i") in new stack [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5749_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5750_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5751_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5753_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5755_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5756_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5761_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5775_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5777_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5782_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: Called PJSIP/5785_line [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5751_line-00000003 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5756_line-00000006 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5753_line-00000004 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5750_line-00000002 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5749_line-00000001 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5775_line-00000008 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5755_line-00000005 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5761_line-00000007 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5782_line-0000000a connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5777_line-00000009 connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5785_line-0000000b connected line has changed. Saving it until answer for PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5753_line-00000004 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5756_line-00000006 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5761_line-00000007 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5749_line-00000001 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5750_line-00000002 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5777_line-00000009 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5751_line-00000003 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5782_line-0000000a is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5755_line-00000005 is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5785_line-0000000b is ringing [Sep 18 19:15:11] VERBOSE[17799][C-00000001] app_dial.c: PJSIP/5775_line-00000008 is ringing [Sep 18 19:15:31] VERBOSE[17799][C-00000001] app_dial.c: Nobody picked up in 20000 ms [Sep 18 19:15:31] VERBOSE[17799][C-00000001] pbx.c: Executing [7027995472@inbound_extensions:4] VoiceMail("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "5750@default,u") in new stack [Sep 18 19:15:31] VERBOSE[17799][C-00000001] res_rtp_asterisk.c: 0x7f01342579a0 -- Probation passed - setting RTP source address to 10.250.82.22:10010 [Sep 18 19:15:31] VERBOSE[17799][C-00000001] file.c: <PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000> Playing '/var/spool/asterisk/voicemail/default/5750/unavail.slin' (language 'en') [Sep 18 19:15:31] VERBOSE[18107][C-00000001] pbx.c: Executing [7027995472@inbound_extensions:4] VoiceMail("PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000", "5750@default,u") in new stack [Sep 18 19:15:31] VERBOSE[18107][C-00000001] file.c: <PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000> Playing '/var/spool/asterisk/voicemail/default/5750/unavail.slin' (language 'en') [Sep 18 19:15:34] VERBOSE[17799][C-00000001] res_rtp_asterisk.c: 0x7f01342579a0 -- Probation passed - setting RTP source address to 10.250.82.22:10010 [Sep 18 19:15:36] VERBOSE[17799][C-00000001] pbx.c: Spawn extension (inbound_extensions, 7027995472, 4) exited non-zero on 'PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000' [Sep 18 19:15:36] VERBOSE[18107][C-00000001] pbx.c: Spawn extension (inbound_extensions, 7027995472, 4) exited non-zero on 'PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000' {noformat} Could it be something with how the caller disconnected while in voicemail I wonder? By: Chet Stevens (cwstevens) 2015-09-26 08:48:02.584-0500 I am not sure if this will have anything useful because I am not sure what state the system was in at the time but I am attaching another backtrace from 9/21. {noformat} Sep 21 05:38:01 0651-Facilities-Audit-Campus kernel: [69266.348517] asterisk[29765]: segfault at 8 ip 00007f5fb9d8fa57 sp 00007f5eb9486cb0 error 4 in chan_pjsip.so[7f5fb9d88000+f000] {noformat} By: Chet Stevens (cwstevens) 2015-09-26 08:51:52.322-0500 So far I have not been able to reproduce this in the lab using the same dialplan, versions, etc of the system. By: Chet Stevens (cwstevens) 2015-09-28 15:13:33.321-0500 New backtrace attached. For reference: {noformat} Sep 28 12:54:18 0651-Facilities-Audit-Campus kernel: [629693.178116] asterisk[31615]: segfault at 8 ip 00007feeee8e4a57 sp 00007feee1a24cb0 error 4 in chan_pjsip.so[7feeee8dd000+f000] {noformat} By: Mark Michelson (mmichelson) 2015-09-30 11:41:07.909-0500 It looks like somehow, the {{PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000}} channel has two threads that think they are the controllers of that channel. Looking in debug.txt, you can see the following as a warning indicator that things are about to go south: {noformat} [Sep 18 19:15:31] WARNING[18107] pbx.c: PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000 already has PBX structure?? {noformat} It looks like the problem is that right after the call is connected, both Asterisk and the calling party attempt to send reinvites to each other. The reinvites result in both parties sending 491 responses to each other. Shortly after, the calling party sends another INVITE to Asterisk. Now, when this arrives, we are not properly detecting this reinvite as a reinvite inside of chan_pjsip. Instead, we think it's a new INVITE because the inv_session's state is "CONNECTING" instead of "CONFIRMED". The result is that a second thread is spawned in order to route the call. From this point on, you start seeing messages like the following occasionally: {noformat} [Sep 18 19:15:31] DEBUG[18107][C-00000001] channel.c: Thread 0x7effaf26f700 is blocking 'PJSIP/0110-Instr-Dsgn-Prof-Lrng-Ctr-00000000', already blocked by thread 0x7f01589c3700 in procedure ast_waitfor_nandfds {noformat} Once one of the controlling threads hangs up the channel, then the other attempts to do the same, and unfortunately structures have already been destroyed at that point. It appears that the main problem here is our method of detecting that an incoming INVITE request is a reinvite. We try to use the session state, but apparently the session state can be re-set from CONFIRMED back to CONNECTING when dealing with certain reinvite situations. If we fix the way that we detect reinvites, then we will not have conflicting threads, and we won't have crashes. By: Richard Mudgett (rmudgett) 2015-10-06 16:15:36.728-0500 Patches up on gerrit: https://gerrit.asterisk.org/#/c/1393/ v13.1-cert https://gerrit.asterisk.org/#/c/1394/ v13 https://gerrit.asterisk.org/#/c/1395/ master https://gerrit.asterisk.org/#/c/1398/ testsuite test to check the patch |