[Home]

Summary:ASTERISK-11576: Asterisk stops responding after quick hangup between asterisk boxes
Reporter:Kamil Czajko (kactus)Labels:
Date Opened:2008-03-05 04:11:19.000-0600Date Closed:2008-03-17 13:03:39
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20080312__bug12143__2.diff.txt
( 1) 20080312__bug12143.diff.txt
( 2) coreshowlocks.txt
( 3) coreshowlocksDeadlockDebug.txt
( 4) coreshowlocksDeadlockDebugAfterPatch.txt
( 5) database.txt
( 6) debugger_info.txt
( 7) MessageOnRestartingAsterisk.txt
( 8) res_odbc.txt
Description:Hello I've been playing around with a two server dynamic configuration and have been able to repeatedly lockup asterisk with several 1.6 builds beta 4 and svn 104866 (latest at time of testing) so that it does not even accept any further commands at the console or from sip devices. The only way to make asterisk respond after this is by killing the process or restarting the service but even this throws an error before being restarted by safe_asterisk.

What we are doing is probably unconventional but either way it is unlikely that a lockup of asterisk should result. I will attach backtrace information for you as well, but please let me know if you need more. I had collected more but a machine restart over the weekend meant I had to recreate even the information I am posting now.

We have two servers running asterisk 1.6 (voip01 & voip02) and two snom 360 handsets (0040 and  0041) with version 6+ firmware each registering to a different server, natting between the handsets and the servers while the servers sit on a local lan.

0040 ->nat-> SIP -> voip01 -> IAX -> voip02 -> SIP -> nat -> 0041

Normal calls work from 0040 to 0041 when dialing 141 and voice works both ways without issue.

However if you hang up right as the 0041 starts ringing it will continue to ring, when you pick up the line will be dead and from that point onwards asterisk no longer accepts sip requests, calls or even commands such as iax2 show channels until it is killed. Logs on the snom phones after reregister show multiple registration requests but no answers and there are no more screen output to the cli after 'extension is ringing'. You cannot even run shutdown now from the cli and have to kill the process. For the back trace we have done a kill -11 to force a segmentation fault to hopefully give you more information.

This only happens when storing the sip information in the database, we can not replicate this when only using flat files or when running extensions from the database but sip from flat files. This might be due to a timing delay in hanging up on time however as if you hang up too soon or too late the phone behaves normally with out issue. In our test setup if you hang up on the first indial ring on the second phone it causes this issue 99% of the time (about 1 second after initiating the call).


****** ADDITIONAL INFORMATION ******

I was previously able to do something similar with asterisk 1.4.17 but on a much more severe scale as I could get it to hang between two extensions on the same server but did not collect any debug information at the time as we upgraded to asterisk 1.6 to try and get around the issue.

This occurs both when using an IAX trunk or sip trunk between the servers but to remove the possibility of an issue with sip reinvite we are running it sip – iax – sip in this example.

We have also tested it with a different sip table that did not have the [msrepl_tran_version], [regserver], [extension] columns with out success.

We have tested this on both centos 5.1 and freebsd

we have tested with both unixodbc unixODBC 2.2.11 and 2.2.12 and are using the freetds driver v0.64

contents of res_odbc and the database tables will also be attached.
Comments:By: Tilghman Lesher (tilghman) 2008-03-10 12:23:42

Please recompile with DEBUG_THREADS and DONT_OPTIMIZE.  When the problem occurs, obtain the output of 'core show locks' and upload it as a file to the file upload area.

By: Kamil Czajko (kactus) 2008-03-11 22:14:04

Hello sorry it took me so long to get this info to you I was not able to get to this until now.

Hopefully this is the information that you are after.

Please let me know if you need any more information or if you need me to test a different svn, I have not updated since 104866 to prevent adding additional layers of complexity.

Cheers

By: Kamil Czajko (kactus) 2008-03-11 22:37:23

Hi again

The I've also just uploaded the message that apears when restarting asterisk after its hung just incase it is relevant though more than likely it isn't.

I've noted in the chan_sip.c its currently hiding teh deadlock information so I am now going to recompile with the show deadlocks compiler flag and repost the debug information.

Cheers -Kamil

By: Kamil Czajko (kactus) 2008-03-11 23:58:02

Hi Again

Just uploaded coreshowlocksDeadlockDebug.txt that shows the additional information with the debug deadlocks flag set.

Please let me know if you need anything else done.

By: Tilghman Lesher (tilghman) 2008-03-12 00:14:27

Okay, patch uploaded for you to try.

By: Kamil Czajko (kactus) 2008-03-12 01:31:44

Hello Thanks for the quick update

Have tested it and it still locks up but seems to be tighter. Have to hang up as soon as I make the call now where as before I had a seconds grace. Also it seems to be locking now in different bits of the code. coreshowlocksDeadlockDebugAfterPatch.txt has the infomation with "core show locks" part way down. PLease let me now when you want me to test the next bit and if you want me to remove the deadlock debug flag.

Cheers

By: Tilghman Lesher (tilghman) 2008-03-12 01:55:44

Okay, new patch uploaded.

By: Kamil Czajko (kactus) 2008-03-12 02:29:35

Brilliant that looks to do the trick. Tried multiple times and was unable to replicate the previous lockup.

By: Digium Subversion (svnbot) 2008-03-12 02:39:11

Repository: asterisk
Revision: 107998

U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r107998 | tilghman | 2008-03-12 02:39:07 -0500 (Wed, 12 Mar 2008) | 7 lines

Deadlock fixes
(closes issue ASTERISK-11576)
Reported by: kactus
Patches:
      20080312__bug12143__2.diff.txt uploaded by Corydon76 (license 14)
Tested by: kactus

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=107998

By: Digium Subversion (svnbot) 2008-03-12 02:40:24

Repository: asterisk
Revision: 107999

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r107999 | tilghman | 2008-03-12 02:40:23 -0500 (Wed, 12 Mar 2008) | 15 lines

Merged revisions 107998 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
r107998 | tilghman | 2008-03-12 02:43:03 -0500 (Wed, 12 Mar 2008) | 7 lines

Deadlock fixes
(closes issue ASTERISK-11576)
Reported by: kactus
Patches:
      20080312__bug12143__2.diff.txt uploaded by Corydon76 (license 14)
Tested by: kactus

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=107999

By: Digium Subversion (svnbot) 2008-03-17 13:03:38

Repository: asterisk
Revision: 109173

_U  team/murf/bug11210/
U   team/murf/bug11210/apps/app_chanspy.c
U   team/murf/bug11210/apps/app_mixmonitor.c
U   team/murf/bug11210/channels/chan_sip.c
U   team/murf/bug11210/funcs/func_config.c
U   team/murf/bug11210/include/asterisk/astobj2.h
U   team/murf/bug11210/include/asterisk/audiohook.h
U   team/murf/bug11210/include/asterisk/pbx.h
U   team/murf/bug11210/main/astobj2.c
U   team/murf/bug11210/main/audiohook.c
U   team/murf/bug11210/main/channel.c
U   team/murf/bug11210/main/pbx.c

------------------------------------------------------------------------
r109173 | murf | 2008-03-17 13:03:22 -0500 (Mon, 17 Mar 2008) | 105 lines

Merged revisions 107998,108032,108034,108084,108137 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r107998 | tilghman | 2008-03-12 01:43:03 -0600 (Wed, 12 Mar 2008) | 7 lines

Deadlock fixes
(closes issue ASTERISK-11576)
Reported by: kactus
Patches:
      20080312__bug12143__2.diff.txt uploaded by Corydon76 (license 14)
Tested by: kactus

................
r108032 | russell | 2008-03-12 11:02:57 -0600 (Wed, 12 Mar 2008) | 12 lines

Merged revisions 108031 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108031 | russell | 2008-03-12 11:59:07 -0500 (Wed, 12 Mar 2008) | 4 lines

Destroy the channel lock after the channel datastores.

(inspired by issue ASTERISK-11617)

........

................
r108034 | russell | 2008-03-12 11:06:37 -0600 (Wed, 12 Mar 2008) | 4 lines

- Add Tilghman to the copyright info ... he wrote the hard part :)
- Remove some magic in unload_module that isn't needed.  Module use counts
  already ensure that the function isn't going to be in use at this point.

................
r108084 | file | 2008-03-12 12:29:33 -0600 (Wed, 12 Mar 2008) | 12 lines

Merged revisions 108083 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108083 | file | 2008-03-12 15:26:37 -0300 (Wed, 12 Mar 2008) | 4 lines

Add a trigger mode that triggers on both read and write. The actual function that returns the combined audio frame though will wait until both sides have fed in audio, or until one side stops (such as the case when you call Wait).
(closes issue ASTERISK-11394)
Reported by: xheliox

........

................
r108137 | russell | 2008-03-12 13:59:05 -0600 (Wed, 12 Mar 2008) | 50 lines

Merged revisions 108135 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r108135 | russell | 2008-03-12 14:57:42 -0500 (Wed, 12 Mar 2008) | 40 lines

(closes issue ASTERISK-11617, reported by atis, fixed by me after some brainstorming
on the issue with mmichelson)

- Update copyright info on app_chanspy.

- Fix a race condition that caused app_chanspy to crash.  The issue was that
 the chanspy datastore magic that was used to ensure that spyee channels did
 not disappear out from under the code did not completely solve the problem.
 It was actually possible for chanspy to acquire a channel reference out of
 its datastore to a channel that was in the middle of being destroyed.  That
 was because datastore destruction in ast_channel_free() was done near the
 end.  So, this left the code in app_chanspy accessing a channel that was
 partially, or completely invalid because it was in the process of being free'd
 by another thread.  The following sort of shows the code path where the race
 occurred:

 =============================================================================
 Thread 1 (PBX thread for spyee chan)  ||   Thread 2 (chanspy)
 --------------------------------------||-------------------------------------
 ast_channel_free()                    ||
   - remove channel from channel list  ||
   - lock/unlock the channel to ensure ||
     that no references retrieved from ||
     the channel list exist.           ||
 --------------------------------------||-------------------------------------
                                       || channel_spy()
   - destroy some channel data         ||  - Lock chanspy datastore
                                       ||  - Retrieve reference to channel
                                       ||  - lock channel
                                       ||  - Unlock chanspy datastore
 --------------------------------------||-------------------------------------
    - destroy channel datastores       ||
       - call chanspy datastore d'tor  ||  
         which NULL's out the ds'      ||  - Operate on the channel ...
         reference to the channel      ||    
                                       ||
   - free the channel                  ||
                                       ||
                                       ||  - unlock the channel
 --------------------------------------||-------------------------------------
 =============================================================================

........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=109173