Summary:ASTERISK-24471: Crash - assert_fail in libc in pjmedia_sdp_neg_negotiate from /usr/local/lib/libpjmedia.so.2
Reporter:yaron nahum (yaronna)Labels:Security
Date Opened:2014-10-30 01:54:26Date Closed:2014-11-20 09:47:31.000-0600
Status:Closed/CompleteComponents:pjproject/pjsip Resources/res_pjsip
Versions:12.2.0 12.6.1 Frequency of
is duplicated byASTERISK-24493 Seg fault in pjsip_inv_send_msg from /usr/local/lib/libpjsip-ua.so.2
Environment:centos 6Attachments:( 0) backtrace_10_11_18_19.txt
( 1) backtrace_6_11_12_23.txt
( 2) backtrace.txt
( 3) config.log.lab
( 4) config.log.live
( 5) pjsip-disconnected.diff
( 6) yaron.debug.1
Description:We have had several crashes recent days on servers running 12.2.0rc1. We upgraded one of the servers to 12.6.1 and it crashed also. The crashes occur every couple of days per server.
Our service uses AGI's for external logic. It also uses spandsp for fax receiving and sending.

*** Yaron -
Going over the debug messages I can see that the last call before the reboot was an incoming that was canceled in the same second of the initial INVITE. In between I can see that we activated a certain AGI, and then  after we got the cancel we jumped to h extension there we activated another AGI.
We are trying to simulated this scenario with SIPP and see if the reboot occurs.
One more thing I forgot to mention is that we have a lot of zombie processes running all the time related to Asterisk. It might be related to the reboot.
Comments:By: yaron nahum (yaronna) 2014-10-30 01:59:18.951-0500

This is a backtrace done on the core file.

By: yaron nahum (yaronna) 2014-10-30 04:27:48.119-0500

This is debug taken during one of the crashes.

By: yaron nahum (yaronna) 2014-11-05 09:59:51.991-0600

While investigating this issue and comparing all the debug and core files taken during the crashes, I found that there is one  scenario that seems to be repeating:
- In all the core files the following routine was involved :
#9  0x00007f5205293741 in answer (data=0x7f52118e1708) at chan_pjsip.c:454
       status = <value optimized out>
       packet = 0x7f5210fa9fe8
       session = 0x7f52118e1708d)
       __PRETTY_FUNCTION__ = "answer"
- In all the debug taken just before the crashe happened the following message appeared:
Hangup of channel PJSIP/KAMnet_CBS-0000b5d3 detected in answer routine
- Only before the crash this message was generated after a CANCEL message was received (in other cases BYE message was received).

It seems to me that there is a bug related to the case where a call in being answered (Answer function in the dialplan), and while handling the answer a CANCEL is received (before the 200OK is sent).
I haven't been able to reproduce this issue.

Hope that someone would look at it soon.

By: Rusty Newton (rnewton) 2014-11-05 17:12:12.380-0600

Linking to ASTERISK-24340 as this issue may duplicate it. Traces look very similar (to my non-expert eyes).

By: Rusty Newton (rnewton) 2014-11-05 17:17:17.380-0600

Yaron - I want to verify - are you able to reproduce this issue?

If not, but you get the crash frequently enough that you expect it to happen again , can you go ahead and recompile pjproject with debug symbols so that we will have that information for your next backtrace?

To investigate further we need you to provide another backtrace, but with debug symbols for pjproject.

When recompiling pjproject, you'll need to specify the CFLAGS param -g in addition to any other configure options or parameters.

./configure CFLAGS='-g'

Be sure to press Enter Feedback or Send Back when replying.

By: yaron nahum (yaronna) 2014-11-06 04:34:51.498-0600

This is a backtrace taken from the lab system.

By: yaron nahum (yaronna) 2014-11-06 04:40:41.609-0600

Hi Rusty,
I have recompiled both PJPROJECT and ASTERISK - with -g for PJPROJECT and on the menuselect I added DONT_OPTIMZE.
I did it on one of my live servers and on the lab server.
Meanwhile I prepared a SIPP script that just sends an INVITE to my lab system and disconnects the call with CANCEL just before the 200OK should be sent - in order to simulate the scenario I suspect that causes the issue.
It took a while but then the ASTERISK crashed. I have attached the core file.
Meanwhile I am waiting for my live system to crash also - it might take a day or two.
Hope you find the issue.

By: Rusty Newton (rnewton) 2014-11-07 14:26:14.153-0600

The trace you posted appears to be from a different crash. You may want to file a separate issue for it. If it is the same crash, then you may have multiple pjproject installs on the same machine.. something to watch out for.

There are still no pjproject symbols with your trace. Can you attach your config.log from pjproject? That will show what options you ran configure with.

Additionally - what version of pjproject did you use, and can you show the modified dates of the library files to verify you are using the most recently installed ones?

By: yaron nahum (yaronna) 2014-11-07 23:45:52.023-0600

Hello Rusty,
I might have 2 versions of PJPROJECT. Initially I used version 2.1.0, and after I found the issue I have upgraded my lab and one of the live servers to 2.2.1. How do I know if both of them are running? which one the asterisk use?
I will send you the output you required on sunday.

By: yaron nahum (yaronna) 2014-11-09 01:56:09.521-0600

These are the config logs of both my lab and my live servers.
I have reinstalled my lab server and removed old pjproject files.
I haven't touched the live server - I would like to make sure first that I have 2 pjprojects installed - how can check that?
Meanwhile, the live server hasn't reboot yet since I have added the compiler flags.
I will try to run the test again on my lab server and if I get another crash I will upload it.

By: yaron nahum (yaronna) 2014-11-10 11:23:37.111-0600

This is a backtrace taken from the live server running PJPROJECT 2.2.1  and ASTERISK 12.6.1.
I have add the -g compilation flag on PJPROJECT and DONT-OPTIMZE & BETTER_BACKRACE compilation flags on ASTERISK.
I have all the information required.

By: yaron nahum (yaronna) 2014-11-10 11:27:54.171-0600

Hi rusty,
I have attached backtrace file from the live server that just crashed. It have compiled the applications with the necessary flags - I hope everything is in place. Please look into it.
As I have already mentioned - the problem seems to be related to CANCEL being received during ANSWER routine. I have't got a debug this time, but I am quite sure that the scenario is similar.
Hope you find the problem.

By: yaron nahum (yaronna) 2014-11-11 08:30:19.277-0600

Hi Rusty,
We had 2 more crashes on 2 different servers with similar core file.
Going over the code it seems to me that the problem is with line 3158 in sip_transaction.c:
       PJ_ASSERT_RETURN(event->type == PJSIP_EVENT_TX_MSG &&
                        event->body.tx_msg.tdata == tsx->last_tx,

Probably the expression is false. The first part of it is fine - in the core file you can see that 'event->type == PJSIP_EVENT_TX_MSG'.
However, the second part I am not sure. Maybe because that transaction is completed 'tsx->last_tx' is no longer valid, or maybe the whole expression is incorrect.
It seems to very simple to fix - could you please look into it.
Thank you.

By: Rusty Newton (rnewton) 2014-11-13 12:49:50.600-0600

Yaron, thanks for all the data. Joshua Colp has looked over the traces and feels we have enough information to investigate the issue. I've opened it up so that a developer can take it on when they are available.

By: Joshua C. Colp (jcolp) 2014-11-17 13:36:16.255-0600

Please try the attached patch. This adds additional checks so we will no longer attempt to do certain things (such as answer) when the session has already been disconnected.