Summary:ASTERISK-21737: [patch] - Crash during transfer from DAHDI/SIP to SIP/SIP in ast_format_cap_append called from remote bridge loop
Reporter:Alec Davis (alecdavis)Labels:patch
Date Opened:2013-04-29 22:56:15Date Closed:
Status:Open/NewComponents:Channels/chan_sip/Transfers Core/Bridging
Versions:Frequency of
One Time
duplicatesASTERISK-23411 Asterisk segfault sip_get_codec
is related toASTERISK-25073 Asterisk crash by ast_format_cap_append
Environment:Asterisk SVN-branch-11-r385594 Attachments:( 0) ASTERISK-21737-1.8.diff
( 1) ASTERISK-21737-11.diff
( 2) bug_apr30.diff.txt
( 3) GDB_output_sample_10_01_2014.txt
( 4) gdb.txt
( 5) gdb-apr30.txt
( 6) xivo_workaround_remote_bridge_crash
Description:DAHDI call into queue.
SIP agent answers, then transfer to SIP0007.

Crashed just after/during a transfer.
Comments:By: Alec Davis (alecdavis) 2013-04-29 22:58:04.900-0500

Attached bug_apr30.diff.txt that should avoid segfault.

But doesn't get to the root cause of the issue.

By: Matt Jordan (mjordan) 2013-05-01 21:58:09.285-0500

Wow. There's a number of interesting things going on here.
* The two SIP channels are entering into a remote bridge loop and this blows out while we're looking at the glue callbacks to determine if they're compatible
* Somehow, the following line resulted in a NULL pointer:
ast_format_cap_append(result, ast_format_cap_is_empty(p->peercaps) ? p->caps : p->peercaps);
The freaky part there is that the SIP pvt is always supposed to have something in its {{caps}} pointer - if we don't allocate something in it, we fail creation of the {{sip_pvt}}. I'm not sure how that ended up with NULL.

By: Alec Davis (alecdavis) 2013-06-17 16:54:30.465-0500

We've been running the attached bug_apr30.diff.txt patch for 6 weeks with a slight mod where {{sip_get_codec()}} emits a warning 'Avoided segfault'

+       } else {
+               ast_log(LOG_ERROR, "ALEC Avoided segfault\n");

But never was the 'Avoided segfault' path taken????

By: Etienne Lessard (hexanol) 2013-10-31 12:49:18.855-0500

I've met the same problem between chan_sip and a SCCP channel driver (not chan_skinny, but external to the asterisk project) on asterisk 11.6.0 (and earlier version).

The scenario is the following:

Given a SIP channel SIP/A-1 is bridged (remote bridge, i.e. the pbx thread is running remote_bridge_loop) with a SCCP channel SCCP/B-1
And the SIP channel SIP/A-1 (clone) is masqueraded into another SCCP channel SCCP/B-2 (original)
Then asterisk sometimes segfault in sip_get_codec

It segfaults because sip_get_codec is called with SCCP/B-2<ZOMBIE> channel, and this channel pvt is NOT a sip_pvt *, so this lead to garbage.

As far as my asterisk knowledge goes, it seems to me that there's a race condition between the first and second pbx threads; the remote bridge happens in the first thread, the masquerade (ast_do_masquerade) happens in the second, and if you are not lucky, the masquerade happens between the "/* check if anything changed */" and call to "glueX->get_codec".

Or something like that.

This bug is similar in a certain way to ASTERISK-17378

By: Nuno Borges (nerbos) 2014-01-10 06:17:53.817-0600


Etienne Lessard assumption is correct, i have attached a GDB sample for the same scenario in an asterisk I wonder what is the best approach. In my opinion if the tech_pvt is NULL at this moment we should exit with a AST_BRIDGE_RETRY in remote_bridge_loop, instead of only avoiding this condition.

Best regards.

By: Matt Jordan (mjordan) 2014-02-22 11:08:06.993-0600

I've attached two patches that I believe should resolve the issue - one for 1.8 and one for 11.

The problem here is that the channels are unlocked when get_codec is called. Because they are unlocked, it is possible for a masquerade (from a transfer) to sneak in and blow away the PVT between the check that it is non-NULL and the actual call to get_codec.

I've put the patch up for review here: https://reviewboard.asterisk.org/r/3247/

By: Alec Davis (alecdavis) 2014-02-24 23:36:59.737-0600

now testing the asterisk-11 patch.

Since I reported this, I've had the test in place and been debug logging since (yuk), where it would emit a warning in sip_get_codec() if either was null, but never has one since.

By: Matt Jordan (mjordan) 2014-02-25 06:37:28.924-0600

Usually, the only time I've seen this happening is on a really slow build agent. Even then, I'd say it only happens very rarely - maybe one out of every 100 or so runs of tests that do a REFER transfer. You may never see it again :-)

By: Matt Jordan (mjordan) 2014-02-27 06:33:38.698-0600

Hey Alec - I'm going to go ahead and commit the patch, since I'm fairly confident it won't cause any harm at the least. The locks are in a location where the actions taken by the functions they wrap won't cause a deadlock, as those calls don't lock anything else or invert any locking order. I'll close this issue for now as a result - but if you run into the same crash, we can always reopen this.

By: Etienne Lessard (hexanol) 2014-02-27 14:29:55.054-0600

Sorry for the late reply, but just to let you know that I don't think it this solves my issue (but I must say, I kind of hijacked the original thread).

In my problem, the tech_pvt of the zombie channel is non-NULL. The problem is that glue1 expect a channel with a "SCCP" tech_pvt, but it's instead getting a (zombie) channel with a "SIP" tech_pvt.

I guess I should open another ticket ?

By: Alec Davis (alecdavis) 2014-03-05 05:15:40.618-0600

crash again, after applying patch 'ASTERISK-21737-11.diff', and removing my original patches.

[2014-03-05 14:25:32.600013] DEBUG[32457][C-000013b1] chan_sip.c: SIP transfer: Four channels to handle
[2014-03-05 14:25:32.600061] VERBOSE[32457][C-000013b1] res_musiconhold.c:     -- Stopped music on hold on DAHDI/i1/272763329-90c
[2014-03-05 14:25:32.600157] DEBUG[32457][C-000013b1] channel.c: Set channel DAHDI/i1/272763329-90c to write format alaw
[2014-03-05 14:25:32.600167] DEBUG[29189][C-000013b4] res_rtp_asterisk.c: Remote address is null, most likely RTP has been stopped
[2014-03-05 14:25:32.600195] DEBUG[32457][C-000013b1] channel.c: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[2014-03-05 14:25:32.600273] VERBOSE[32457][C-000013b1] res_musiconhold.c:     -- Stopped music on hold on SIP/GXP0028-000015e9
[2014-03-05 14:25:32.600352] DEBUG[32457][C-000013b1] channel.c: Set channel SIP/GXP0028-000015e9 to write format alaw
[2014-03-05 14:25:32.600382] DEBUG[32457][C-000013b1] channel.c: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[2014-03-05 14:25:32.600425] DEBUG[32457][C-000013b1] channel.c: Planning to masquerade channel DAHDI/i1/272763329-90c into the structure of SIP/GXP0013-000015e8
[2014-03-05 14:25:32.600491] DEBUG[32457][C-000013b1] channel.c: Done planning to masquerade channel DAHDI/i1/272763329-90c into the structure of SIP/GXP0013-000015e8
[2014-03-05 14:25:32.600605] DEBUG[32457][C-000013b1] chan_sip.c: Trying to put 'NOTIFY sip:' onto UDP socket destined for
[2014-03-05 14:25:32.601025] DEBUG[32457][C-000013b1] res_config_mysql.c: MySQL RealTime: Connection okay.
[2014-03-05 14:25:32.601189] DEBUG[32457][C-000013b1] res_config_mysql.c: MySQL RealTime: Insert SQL: INSERT INTO queue_log (`time`, `callid`, `queuename`, `agent`, `event`, `data1`, `data2`, `data3`, `data4`, `data5`) VALUES ('2014-03-05 14:25:32.600896', '1393982652.9197', 'cs_queue1', 'SIP/GXP0013', 'TRANSFER', 's', 'macro-procexten', '5', '46', '1')
[2014-03-05 14:25:32.602483] DEBUG[32457][C-000013b1] res_config_mysql.c: MySQL RealTime: row inserted on table: queue_log, id: 7186371
[2014-03-05 14:25:32.602988] DEBUG[32457][C-000013b1] channel.c: Putting channel DAHDI/i1/272763329-90c in alaw/alaw formats
<b>[2014-03-05 14:29:48.691160] Asterisk SVN-branch-11-r385594M built by root @ astrid on a i686 running Linux on 2013-04-17 23:19:24 UTC</b>

By: Alec Davis (alecdavis) 2014-03-05 05:20:37.621-0600

backtrace attached gdb.txt

Call came in from DAHDI into a queue and handled by agent SIP/GXP0028
who then transferred to SIP/GXP0013

By: Etienne Lessard (hexanol) 2014-03-31 12:03:09.226-0500

I've made a (specific) patch for my problem, I'm attaching it on this ticket although it's (probably) too hackish to be merged in the asterisk code base.

More precisely, it locks the channel (if it is using a "sip" or "sccp" rtp glue) and test that is not a zombie before calling functions related to the glue. This works fine as long as the channel driver implementing the glue callbacks can work with the channel locked (i.e. that it doesn't create new deadlock conditions).

The patch has been put in production 2 weeks ago and there has been no crash since then (and no deadlock neither) which is an improvement in my case.

By: Etienne Lessard (hexanol) 2014-04-10 12:25:16.138-0500

I've attached a newer version of the patch that check that the pvt has not changed instead of checking that the channel is not a zombie. Although I did not understand why, we had a crash recently with a non-zombie channel.