[Home]

Summary:ASTERISK-11537: Segmentation fault in chan_sip.c
Reporter:Egbert Groot (wegbert)Labels:
Date Opened:2008-02-28 09:50:17.000-0600Date Closed:2008-03-14 11:50:30
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12098.patch
( 1) 12098-ok.patch
( 2) 12098-putnopvut.patch
( 3) 12098-putnopvutv2.patch
( 4) 12098-putnopvutv3.patch
( 5) bt_full.txt
( 6) gdb.txt
( 7) gdb2.txt
( 8) info.txt
( 9) locks.txt
(10) log.txt
(11) thread_apply_all_bt.txt
(12) valgrind.txt
(13) valgrind2.txt
Description:Using svn version from yesterday (svn download reported revision 104534, 'show version on cli reports 'SVN-branch-1.4-r104334'), asterisk crashed with an segmentation fault:

Program terminated with signal 11, Segmentation fault.
0 0xb7509bfe in __ast_pthread_mutex_lock (filename=0xb755bdf4 "chan_sip.c", lineno=1900, func=0xb755c318 "retrans_pkt",
   mutex_name=0xb755c324 "&pkt->owner->lock", t=0x0) at /usr/src/asterisk-1.4/include/asterisk/lock.h:302
302 int canlog = strcmp(filename, "logger.c") & t->track;
Comments:By: Joel Vandal (jvandal) 2008-03-10 14:16:56

I have exact same problem using Asterisk 1.4.18 and SVN branches/1.4.

I uploaded a gdb.txt file that include all details (bt, thread all, etc)

Test with "pbx-test-0.1.0" to generate calls to the server, in fact to reproduce the problem I dial 3-4 times to a queue that have about 50 agents (SIP).

By: Joel Vandal (jvandal) 2008-03-10 15:19:15

I will made more intensive test during next hours but for now I suspect that a patch manually applied to Asterisk on my server cause this error.

I've test with a clean SVN branches/1.4 version and all worked, no crash for now but I let you know ASAP about progress.

By: Joel Vandal (jvandal) 2008-03-10 15:36:52

Ok, same problem with unpatched Asterisk, but this time I the new backtrace (gdb2.txt) must contain more detailled informations.

By: Joel Vandal (jvandal) 2008-03-10 19:26:39

Please ignore the 12098.patch, have made a typo on the patch.

If you want to test, please use 12098-ok.patch

The problem, at my understanding, is that the t->track and t->mutex are unset for an unknown reason. My fix is to check if the variable is defined before do any other action.

This fix sound simple but I have very low C knowledge, so I hope I'm on the good way =)



By: Joel Vandal (jvandal) 2008-03-10 20:28:58

Hmmm.... the patch is imcomplete and do nothing .

The problem is probably elsewhere and I dont know how to fix it :(

By: Joel Vandal (jvandal) 2008-03-11 10:23:09

My server crash each 30-45 minutes and always same backtrace. The server have lots of extensions (about 1000 SIP friends) and have test script that generate call to the server 24/7 (thanks to pbx-test).


If you need access to the server, please let me know, I'm open to test any patches, etc....

Does it can be a good thing to run it using valgrind ?

By: Joel Vandal (jvandal) 2008-03-11 16:33:37

I have upload the 'valgrind.txt' output...

valgrind --leak-check=full --log-file=valgrind.txt asterisk -vvvvcg -U scopserv -G scopserv 2>malloc_debug.txt

By: Mark Michelson (mmichelson) 2008-03-11 17:30:24

jvandal: What version of Asterisk were you using when you got that valgrind output? If it was from a specific svn checkout, which one was it? I just would like to know so that the line numbers from the valgrind log will align with what I have checked out locally.

By: Joel Vandal (jvandal) 2008-03-11 17:38:06

Revision: 106860

but have some patch used on this... I retest with latest 1.4 branches without any patches

By: Joel Vandal (jvandal) 2008-03-11 18:10:10

Hi,

New valgrind output uploaded.

Asterisk SVN Revision 107789
URL: http://svn.digium.com/svn/asterisk/branches/1.4

By: Mark Michelson (mmichelson) 2008-03-11 19:05:23

I love valgrind :)

So here's what's happening: Apparently, the scheduler has a packet in it that is queued for retransmission, but the packet has already been freed by chan_sip. When it comes time for the scheduler to act on the packet, it is reading from and writing to freed memory.

I suspect that if you had debugging turned on on your console, you'd probably be seeing a message that says:

"Unable to cancel schedule ID x. This is probably a bug."

Right now I suspect that this is happening due to a race condition. I'll look further into this soon to see what the appropriate fix is.

By: Mark Michelson (mmichelson) 2008-03-12 09:20:52

I've uploaded 12098-putnopvut.patch for testing. I am confident that it will cure the crash you have experienced, but I'm not 100% sure that this is the absolute best way to go about doing this. Test results while using this patch would be very convincing, though :)

By: Joel Vandal (jvandal) 2008-03-12 15:24:43

Path look good, server doesn't crash but have others deadlock issues (see ticket 12196)

By: Mark Michelson (mmichelson) 2008-03-13 14:30:58

Okay, the deadlock issues you reported in 12196 ARE related to this patch. I'll have a new patch up soon which should take care of the problem.

As such, I'm also going to close 12196 since it is caused by the patch here.

By: Mark Michelson (mmichelson) 2008-03-13 14:43:52

I closed issue 12196 and have uploaded a new patch here. Please give this a test. Thanks!

Edit: For clarity, the new patch is called 12098-putnopvutv2.patch



By: Joel Vandal (jvandal) 2008-03-13 15:31:53

I'm testing the new patch and now have tons of locks on SIP channels.

=== Thread ID: 2936552352 (pbx_thread           started at [ 2655] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 15728 restart_monitor &monlock 0x3bb600 (1)
=== --- ---> Locked Here: chan_sip.c line 15683 (do_monitor)
=== -------------------------------------------------------------------

(this repeat about 300-400 times, only thread id changes)

By: Mark Michelson (mmichelson) 2008-03-13 16:35:32

The problem with v2 was that there was some recursive locking done which meant that the attempt to yield the lock in __sip_ack didn't actually help, and led to an infinite loop still. I have attempted to eliminate the recursive locking. Please give this new patch a try. Thanks for bearing with us on this one.

By: Joel Vandal (jvandal) 2008-03-13 17:24:37

Excellent, now have a total of 0 locks but have a problem with channels that look zombie, please check 'log.txt'.

I see this error repeat lots of time.

[Mar 13 18:27:20] DEBUG[878]: audiohook.c:233 audiohook_read_frame_both: Failed to get 160 samples from read factory 0xb0f90c0
[Mar 13 18:27:20] DEBUG[32451]: audiohook.c:233 audiohook_read_frame_both: Failed to get 160 samples from read factory 0x987b048
[Mar 13 18:27:20] DEBUG[32126]: audiohook.c:233 audiohook_read_frame_both: Failed to get 160 samples from read factory 0xb2395d50
[Mar 13 18:27:20] DEBUG[31748]: audiohook.c:233 audiohook_read_frame_both: Failed to get 160 samples from read factory 0x98909a8

By: Joel Vandal (jvandal) 2008-03-13 18:33:31

Upload info.txt that contain :

show channels concise
sip show channels
core show threads
core show locks


I have now 47 channels still "Up" but the server that generate calls (asterisk) is shut down, channels must kill if no RTP or whatever, no ? This is why I suspect a "hidden" locking/infinite loop problem.

I can 'soft hangup' the channel but get theses error (must be normal...) .

soft hangup SIP/1584-09ed1d20
Requested Hangup on channel 'SIP/1584-09ed1d20'
[Mar 13 19:35:54] DEBUG[28983]: channel.c:1384 ast_softhangup_nolock: Soft-Hanging up channel 'SIP/1584-09ed1d20'
[Mar 13 19:35:54] DEBUG[1340]: channel.c:3781 ast_generic_bridge: Didn't get a frame from channel: SIP/1584-09ed1d20
[Mar 13 19:35:54] DEBUG[1340]: channel.c:4115 ast_channel_bridge: Bridge stops bridging channels SIP/pbxtest-0a3d1990 and SIP/1584-09ed1d20
[Mar 13 19:35:54] DEBUG[1340]: channel.c:1483 ast_hangup: Hanging up channel 'SIP/1584-09ed1d20'
[Mar 13 19:35:54] DEBUG[1340]: chan_sip.c:3517 sip_hangup: Hangup call SIP/1584-09ed1d20, SIP callid 33f2531f7789ed4e124ff8e72853855b@192.168.99.19)
[Mar 13 19:35:54] DEBUG[1340]: devicestate.c:304 __ast_device_state_changed_literal: Notification of state change to be queued on device/channel SIP/1584-09ed1d20
[Mar 13 19:35:54] DEBUG[1340]: devicestate.c:304 __ast_device_state_changed_literal: Notification of state change to be queued on device/channel SIP/1584
[Mar 13 19:35:54] DEBUG[1340]: rtp.c:1511 ast_rtp_early_bridge: Channel '<unspecified>' has no RTP, not doing anything

By: Joel Vandal (jvandal) 2008-03-14 07:42:27

This morning, all channels are still 'Up' after 13h but all SIP traffic has been dropped.

> show channel SIP/pbxtest-b10235f0

    Name: SIP/pbxtest-b10235f0
          Type: SIP
      UniqueID: 1173826027.4325
     Caller ID: 4215554280
Caller ID Name: Linda Taylor
   DNID Digits: 5551003
         State: Up (6)
         Rings: 0
 NativeFormats: 0x4 (ulaw)
   WriteFormat: 0x4 (ulaw)
    ReadFormat: 0x4 (ulaw)
WriteTranscode: No
 ReadTranscode: No
1st File Descriptor: 189
     Frames in: 17
    Frames out: 2931
Time to Hangup: 0
  Elapsed Time: 13h56m2s
 Direct Bridge: SIP/1611-0a2cc140
Indirect Bridge: SIP/1611-0a2cc140
--   PBX   --
       Context: default-aa-ivr_04
     Extension: 1
      Priority: 2
    Call Group: 0
  Pickup Group: 0
   Application: Dial
          Data: SIP/1600&SIP/1601&SIP/1602&SIP/1603&SIP/1604&SIP/1605&
                SIP/1606&SIP/1607&SIP/1608&SIP/1609&SIP/1610&
                SIP/1611&SIP/1612&SIP/1613|60|o
   Blocking in: ast_waitfor_nandfds

By: Joel Vandal (jvandal) 2008-03-14 08:44:23

putnopvut: Problem I now have on my server is IMHO a different problems and "probably" unrelated to this tickets, does I must create a new ticket or you prefer to continue here ?

If you think that it can have a relation with this ticket, then will try to provide all needed informations.

By: Joel Vandal (jvandal) 2008-03-14 09:22:27

Hi,

To try to see if this is an error on dialplan or whatever, I have made it more simple and here the dialplan I use to made my test.

We call to 5551000 and press 2.

[default]
exten => 5551000,1,Goto(default-aa-ivr_01,s,1)

[default-aa-ivr_01]
exten => s,1,Answer
exten => s,2,Wait(1)
exten => s,3,Set(TIMEOUT(digit)=15)
exten => s,4,Set(TIMEOUT(response)=25)
exten => s,5,Set(CHANNEL(language)=fr)

exten => 2,1,NoOp()
exten => 2,2,Dial(SIP/1121&SIP/1122&SIP/1123&SIP/1124&SIP/1125&SIP/1126&SIP/1127&SIP/1128&SIP/1129&SIP/1130&SIP/1131&SIP/1132,60,o)



By: Joel Vandal (jvandal) 2008-03-14 10:41:17

Ok, I think that 12098 is REALLY fixed and can be closed.

The problem I describe on previous notes occur when I kill the "Call Generator", no packets to "end" call can be sent to Asterisk and all my channels keep open. Will made test w/ SIP rtptimeout, etc.

By: David Brillert (aragon) 2008-03-14 10:53:27

putnopvut:
This is starting to look really good as far as 12098
Do you think jvandal's latest problem with hung SIP channels is another bug?

By: Digium Subversion (svnbot) 2008-03-14 11:39:57

Repository: asterisk
Revision: 108737

U   branches/1.4/channels/chan_sip.c

------------------------------------------------------------------------
r108737 | mmichelson | 2008-03-14 11:39:56 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108737

By: Digium Subversion (svnbot) 2008-03-14 11:48:57

Repository: asterisk
Revision: 108738

_U  trunk/
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r108738 | mmichelson | 2008-03-14 11:48:54 -0500 (Fri, 14 Mar 2008) | 41 lines

Merged revisions 108737 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108738

By: Digium Subversion (svnbot) 2008-03-14 11:50:30

Repository: asterisk
Revision: 108739

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r108739 | mmichelson | 2008-03-14 11:50:28 -0500 (Fri, 14 Mar 2008) | 49 lines

Merged revisions 108738 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r108738 | mmichelson | 2008-03-14 11:52:51 -0500 (Fri, 14 Mar 2008) | 41 lines

Merged revisions 108737 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108737 | mmichelson | 2008-03-14 11:44:08 -0500 (Fri, 14 Mar 2008) | 33 lines

Fix a race condition in the SIP packet scheduler which could cause a crash.

chan_sip uses the scheduler API in order to schedule retransmission of reliable
packets (such as INVITES). If a retransmission of a packet is occurring, then the
packet is removed from the scheduler and retrans_pkt is called. Meanwhile, if
a response is received from the packet as previously transmitted, then when we
ACK the response, we will remove the packet from the scheduler and free the packet.

The problem is that both the ACK function and retrans_pkt attempt to acquire the
same lock at the beginning of the function call. This means that if the ACK function
acquires the lock first, then it will free the packet which retrans_pkt is about to
read from and write to. The result is a crash.

The solution:

1. If the ACK function fails to remove the packet from the scheduler and the retransmit
  id of the packet is not -1 (meaning that we have not reached the maximum number of
  retransmissions) then release the lock and yield so that retrans_pkt may acquire the
  lock and operate.

2. Make absolutely certain that the ACK function does not recursively lock the lock in
  question. If it does, then releasing the lock will do no good, since retrans_pkt will
  still be unable to acquire the lock.

(closes issue ASTERISK-11537)
Reported by: wegbert
(closes issue ASTERISK-11529)
Reported by: PTorres
Patches:
     12098-putnopvutv3.patch uploaded by putnopvut (license 60)
Tested by: jvandal


........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=108739