ASTERISK-20929: Core-dump on SIP BYE for an invalid call transaction

[Home]

Summary: ASTERISK-20929: Core-dump on SIP BYE for an invalid call transaction

Reporter: Kristopher Lalletti (kris2k) Labels:

Date Opened: 2013-01-12 11:45:54.000-0600 Date Closed: 2013-02-15 12:47:58.000-0600

Priority: Major Regression? Yes

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 11.1.0 11.1.1 11.1.2 Frequency of
Occurrence Constant

Related
Issues:

Environment: Linux/2.6.32-279.19.1.el6.i686 #1 SMP Wed Dec 19 04:30:58 UTC 2012 i686 i686 i386 GNU/Linux Attachments: ( 0) ast.core.31716.txt
( 1) ast.log.full.txt
( 2) ast.log.mmlog.txt
( 3) ASTERISK-20929.patch
( 4) core-20130111-01.txt
( 5) sip.proto.pcap.txt

Description: We've been noticing core-dumps on Asterisk since we migrated from 1.8.x to 11.1.x late last December, which caused us to rollback to 1.8.19.1 in order to return into a stable state.

We noticed a very specific (mis)behaviour of our soft-switch to which when it signaled a SIP BYE message to ASterisk with an invalid Call-ID, Asterisk would systematically segfault.

When we backtracked to 1.8.19.1, at least Asterisk will return a "481 Call leg/transaction does not exist" and not die.

Included is the core-dump backtrace showing the SIP BYE transaction causing Asterisk 11.1.1 to segfault.

PS: I also have a system running 11.1.2 which is equally having the same symptoms, but I have yet to capture a tangible backtrace.

Comments: By: Kristopher Lalletti (kris2k) 2013-01-12 11:51:17.674-0600

Output of the following:
gdb -se "asterisk" -ex "bt full" -ex "thread apply all bt" --batch -c /tmp/core > /tmp/output.txt

File: core-20130111-01.txt
By: Matt Jordan (mjordan) 2013-01-13 16:12:33.835-0600

So this is a bit odd. From the backtrace, it appears as if {{sip_queue_hangup_cause}} has an owning channel that has already been reclaimed - the fact that name appears to have garbage in it when it strdup'd the channel name onto the stack if a fairly bad sign.

Could you include a pcap of the SIP message traffic that reproduces this? The fact that it isn't crashing in 1.8 is indicative of something rather subtle in the channel reference counting, and a pcap will help us try to reproduce the exact sequence of events that leads up to this.
By: Kristopher Lalletti (kris2k) 2013-01-13 20:37:49.690-0600

Included is a retake on the bug on my lab environment with Asterisk 11.1.2

The SIP BYE message that segfaults Asterisk comes-in after Jan 13 21:14:48 from 192.x.x.4. The result of this message is from human intervention (on my behalf) when I observe that there is a residual channel left-open following the call that was previously hung-up by both endpoints.

(agreed, there is an interop issue, but this should not be the focus of this bug)

Let me know if you stand on having the binary PCAP of the dump since I converted to ASCII format.

Files:
- ast.core.31716.txt
- ast.log.full.txt
- ast.log.mmlog.txt
- sip.proto.pcap.txt
By: Mark Michelson (mmichelson) 2013-01-24 16:38:36.477-0600

The operation that is causing problems is the transfer using the INVITE with replaces. For brevity's sake, we'll refer to the dialogs here as dialog A and dialog B. In actuality, Dialog A has call-ID MTE3NDcwNzFjYjA1MDcxODQzMGQ2MmEyMDE1MDVhNDI. and Dialog B has call-ID NmM0MjQwOGJlOGFjZjY5MjJhZDc0N2NkZDY2N2QxM2Y.

Dialog A is an incoming call to Asterisk that gets set up with no issues. This includes reinvites to get direct media between Dialog A's endpoint and whoever he is connected to.

Dialog B sends an INVITE with Replaces in order to take the place of Dialog A in the conversation he's having. This causes Asterisk to set up a channel masquerade to place Dialog B's channel information into Dialog A's channel.

Later a BYE arrives on Dialog A and this causes the crash to occur, because it appears as though the owner channel on the dialog has been killed. Presumably, we're accessing freed memory here. You can see in the log the line:

{noformat}
[Jan 13 21:14:13] DEBUG[31759][C-00000000] chan_sip.c: SIP Fixup: New owner for dialogue MTE3NDcwNzFjYjA1MDcxODQzMGQ2MmEyMDE1MDVhNDI.: SIP/pre-3cx-tst-00000000<ZOMBIE> (Old parent: SIP/pre-3cx-tst-00000002)
{noformat}

This means that Dialog A is now pointing at a zombie channel that is about to be freed. The problem though is that the dialog is still alive after this point, so the BYE isn't rejected with a 481. Instead, the BYE is accepted, the channel is accessed, and that causes a meltdown.

I'm going to use your pcap to build a couple of SIPp scenarios that should allow for me to reproduce the scenario. I'll run with some debugging tools and see if I can figure out what the problem is.
By: Mark Michelson (mmichelson) 2013-01-24 17:35:35.224-0600

I can reliably reproduce this issue. Just FYI. Working on understanding it a bit better so I can write a patch for it.
By: Mark Michelson (mmichelson) 2013-01-24 18:34:25.192-0600

I've added ASTERISK-20929.patch. This, in my tests, solves the issue as presented. There is no longer a crash, but you'll need to let me know if the traffic is as you would expect it to be.
By: Kristopher Lalletti (kris2k) 2013-01-24 23:13:26.877-0600

I just ran a comparative scenario between before/after the patch, everything seems to be behaving like 1.8.x

Given the simplicity of the patch, I'll add it to my remaining 11.1.x system and see how it fairs-out after 1-2 weeks of runtime and follow-up if needed.
By: Mark Michelson (mmichelson) 2013-01-28 08:46:45.438-0600

Okay that's fine. Just be sure to let us know how it runs.
By: Kristopher Lalletti (kris2k) 2013-01-31 18:12:11.907-0600

After 1 week; the patch is fairing-out well. No core-dumps, or anomalies found.
By: Kristopher Lalletti (kris2k) 2013-04-04 19:35:59.085-0500

FYI, I noticed some random crashes which potentially pertain to the same call flow pattern, but they were so random and so barely recurring that I haven't given priority to this.

I noticed in 11.3 that there appears to be some significant robustness around channel handling (hopefully breaking the pattern of assuming calls legs exist, when they are are gone, etc..). So I guess this issue might go-away by these robustness enhancements.