[Home]

Summary:ASTERISK-09438: SIP Transfers To Parking Lot From Grandstream GXP-2000 Locks Up SIP Channel
Reporter:Ken Williams (kenw)Labels:
Date Opened:2007-05-14 14:49:30Date Closed:2007-09-08 09:14:35
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Transfers
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 052107-console-trimmed.log
( 1) 080307cli-trimmed.txt
( 2) 9724.txt
( 3) 9724-1.2.txt
( 4) asterisk.conf
( 5) console072007.txt
( 6) core_show_locks_081307.txt
( 7) core_show_locks_081307a.txt
( 8) core.072007.tar.gz
( 9) extensions.conf
(10) features.conf
(11) full
(12) full052107a-trimmed.log
(13) full052107-trimmed.log
(14) full062207a-trimmed2.LOG
(15) gbd072307.txt
(16) gdb072707a.txt
(17) gdb072707b.txt
(18) gdb080207.txt
(19) iax.conf
(20) logger.conf
(21) modules.conf
(22) musiconhold.conf
(23) rtp.conf
(24) sip_channels_and_history.txt
(25) sip_console_output.txt
(26) sip_details_from_console.txt
(27) sip.conf
(28) sip072007.conf
(29) trimmed-full072007a.txt
(30) voicemail.conf
(31) zapata.conf
Description:Starting a couple of months ago about once a week Asterisk would stop responding to our SIP phones.  This has increased slowly to 3-5 times a day.  It's taken me a long time to (hopefully) narrow the problem down to something I could report.  I've looked everywhere online for help and I hope I didn't miss this being reported already.

I've been able to reproduce the problem with 1.4.1, 1.4.2, 1.4.4 & the latest SVN as of last week (63519), the version doesn't seem to affect the timing of the problem one way or another.

I've attached logs below including SIP Debug enabled, verbose & debug set to 4.  I've also included a console log of the state of things when it happens.  

We are using the TRNF button on the GXP-2000 phones to transfer to a parking lot.  It seems to be a transfer to the parking lot that causes SIP to stop responding, though I may be mistaken.  Asterisk appears fine but all phones get 'no response from server' when the lock up occurs.

We've got 3 boxes at 3 locations, the other 2 are working fine, this one is having fits, however, this box is also used 10x more then the other 2.  Last week I removed all asterisk folders and reinstalled 1.4.4 but the problem still occurs.

When the lockup occurs if I try to reload chan_sip I get no errors, if I try to unload it I get the following:
[May 14 11:40:47] WARNING[6314]: loader.c:458 ast_unload_resource: Soft unload failed, 'chan_sip.so' has use count 32

The server is running CentOS 4.4, has two TDM-400p cards with 6 FXS & 2 FXO cards.

Comments:By: Ken Williams (kenw) 2007-05-14 15:08:28

The hangup comes from the call answered at 13:27:31 in this particular log.  She answers the call, goes to transfer to park and the only fix is to stop & restart Asterisk.

By: Olle Johansson (oej) 2007-05-15 11:06:58

THank you for providing such a detailed report!

By: Olle Johansson (oej) 2007-05-15 11:08:33

Ok, we have all the information we need. Impressed. Karma!

By: Ken Williams (kenw) 2007-05-15 16:08:32

Woot my first Karma! :)

Unfortunately we've dumped 3 times so far today :(.

In the meantime I've instructed everyone to use blindxfer's within Asterisk instead of relying on the TRNF button from the phones.  It's a bit more cumbersome, but if I don't have this fixed very soon I'll likely be plugging our old PBX in...so please don't make me have to do that :D.

For clarification, our standard was to answer incoming call, hit TRNF followed by one of the BLF buttons on the GXP-2000 that corresponded to our parking lot extension.  This works great at our two other locations and has worked here but is steadily getting worse.

Now I've asked everyone to hit ##200 (our parking lot ext.) to put someone on park.  Obviously it's easier to hit 2 buttons instead of 5, not to mention there's a 3 second delay while res_features makes sure you're not inputting digits (maybe I'll lower this to 2).

Happy to help in anyway I can to get this resolved.

By: alcazar (alcazar) 2007-05-16 09:56:24

I can confirm this problem Asterisk 1.4.4-R.  However, it happens anytime the transfer button is pushed.  Parked calls make no difference.



By: Ken Williams (kenw) 2007-05-17 09:22:36

Using blindxfer through features.conf we had no lock ups yesterday save for two where people still used the TRNF button on the phone (hard to break old habits).  We'd all really prefer to use the TRNF button, but for now we're using blindxfer.

I should also note that every time it's locked up, when I do "SIP SHOW CHANNELS" the one who used the TRNF button shows "Init: REFER" for the Last Message.  Hope that helps as well.

Is it worth having another set of logs from another crash or do you have enough with what I've already supplied?



By: alcazar (alcazar) 2007-05-17 09:52:44

kenw if you get rid of call limits on a channel does this go away for you?  Basically if I have call limits I see all these hanging invites that will never go away.  Channels will show usage when I do a sip show inuse.  However, once I remove the call-limits I can use the trans button until my heart is content, and now more hung invites.  I am trying to find exactly where this bug is happening.  I need this squashed!

By: Ken Williams (kenw) 2007-05-17 10:26:33

I've removed all call-limits and we'll see.

I also have the same problem you're describing in regards to INVITE's that never clear & INUSE that aren't really INUSE.  

The reason I put call-limit in to begin with I thought was to get hints to work, there was a reason that wasn't related at all to call-limit.  I've removed them and hints are still working so I don't know if the original problem was fixed at some time or if I was using call-limit to fix some other problem.

Regardless, everything appears to be working fine right now with call-limits removed, I'll report back in a day or two if the system does not hang.

Thanks for the tip alcazar, mostly happy to have someone else who has experienced this problem :).

By: alcazar (alcazar) 2007-05-17 10:38:59

Well, I am not so sure now.  You will have to do a restart of asterisk for the call limits to go away.  If you do a sip show inuse I bet that it still is keeping tallys.  I have a customer now who is not using call limits and has 1 stuck invite so far.  I am keeping a watchful eye on this, but it does have to do with the stuck invites that are screwing things up.  Hopefully we can get someone to work on this with us.

Things I know for sure:
Stuck invites are screwing up call limits/presence
Transfer is related

By: Ken Williams (kenw) 2007-05-17 14:52:40

Well, it's been more than 24 hours since lock up which the system hasn't done in weeks.  This is since changing from the TRNF button to using ## via features.conf.

I've removed the call-limit and assuming everything works well the rest of the day, tomorrow I'll go back to the TRNF button and see if we're still lock up free.

By: alcazar (alcazar) 2007-05-18 07:50:53

Well, I removed call limits, etc... and we are still getting these stuck SIP invite channels.  It is in no doubt related in how the grandstream is doing transfers.  It does appear that this does not happen when using xfer via features.

By: alcazar (alcazar) 2007-05-18 08:44:39

I believe this is related to BUG ID 9235 take a look at the hanging INVITEs.  Of course I still don't have an answer yet.

By: Ken Williams (kenw) 2007-05-18 09:42:56

They definitely seem related.  Today I've asked everyone to go back to using the TRNF button but I've applied the patch from the other bug and removed call-limits.

I'm now crossing my fingers...

By: alcazar (alcazar) 2007-05-18 09:48:52

I applied the patch about an hour ago, and not to get you too excited but so far it appears that it fixes the problems!

By: Ken Williams (kenw) 2007-05-18 10:07:41

And we have lockup.

Back to ## for now.

By: alcazar (alcazar) 2007-05-18 10:10:01

You didn't happen to have a capture of sip show channels do you?

With verbosity on were you seeing messages like this after the patch?

[May 18 11:11:04] WARNING[13445]: chan_sip.c:2043 __sip_autodestruct: Autodestruct on dialog '77d5aed6146edae1@192.168.12.26' with owner in place (Method: ACK)



By: Ken Williams (kenw) 2007-05-18 10:11:43

Unfortunately I wasn't at my desk and had to reboot as fast as I could by the time I got back, no time to analyze :(

I haven't seen those messages yet.

I should note, I verified it was again when someone tried to use the TRNF button.  So for now we're back to ##.  If someone needs more logs or access to my server or whatever, I'm happy to arrange whatever and I'll just keep using the TRNF button until it crashes or whatever helps!



By: alcazar (alcazar) 2007-05-18 12:12:56

Are you using queues heavily?

By: Ken Williams (kenw) 2007-05-18 12:23:10

Not using queues at all.

My problem seems to be directly related to REFER SIP requests.  I'm going through logs deeper this morning trying to find exactly where the problem is.

Unfortunately I'm going to be out of town this weekend, because I'd like to come in off hours and force a crash over and over again without the extra traffic just by TRNF calls via the GXP-2000 phones (which is where the REFER comes in).

By: Olle Johansson (oej) 2007-05-18 13:09:00

The patch from ASTERISK-8969 will cause issues you don't want. Please do not apply that patch. Instead, update to latest 1.4 from svn and try that code - I've added a few patches.

By: alcazar (alcazar) 2007-05-18 13:29:43

Trying latest SVN now.  I will update you with the results.

By: Ken Williams (kenw) 2007-05-18 13:47:18

Tried to be tricky and use SVN, system dumped on answering of first incoming call from ZAP line.  I'll have to test the SVN later when I have time to find the cause.

By: alcazar (alcazar) 2007-05-18 14:07:23

No go with SVN-65124. I started getting hangs on the invites again.

XXXXXXXXX    XXXXXXXX  68e9ec7a00b  00102/00000  unkn  No       Init: INVITE

By: alcazar (alcazar) 2007-05-18 14:17:10

No go with SVN-65124. I started getting hangs on the invites again.

XXXXXXXXX    XXXXXXXX  68e9ec7a00b  00102/00000  unkn  No       Init: INVITE

By: Olle Johansson (oej) 2007-05-18 14:35:09

Turn on sip history and get me sip history for one of those calls. Are you using an unpatched version of Asterisk?

By: alcazar (alcazar) 2007-05-18 14:56:31

I will have to wait until the weekend to get you that output.  The company has left for the day.

By: alcazar (alcazar) 2007-05-19 09:28:43

It almost appears that when a phone becomes unreachable, or something along those lines this will happen and will never go away.  My sip peer was configged with qualify=no to get around their tempermental NAT.

74.212.30.14     keenan-102  3d97e8b0271  00102/00000  unkn  No       Init: INVITE  
 * SIP Call
1. NewChan         Channel SIP/keenan-102-08352b00 - from 3d97e8b027131a4e496c7791
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. ReTx            1000 INVITE sip:keenan-102@74.212.XXXX:1118 SIP/2.0
4. ReTx            2000 INVITE sip:keenan-102@74.212.XXXX:1118 SIP/2.0
5. ReTx            4000 INVITE sip:keenan-102@74.212.XXXX:1118 SIP/2.0
6. ReTx            8000 INVITE sip:keenan-102@74.212.XXXX:1118 SIP/2.0
7. ReTx            16000 INVITE sip:keenan-102@74.212XXXX:1118 SIP/2.0



By: Ken Williams (kenw) 2007-05-21 12:03:40

That revision didn't help unfortunately.  I'm attaching updated logs from this mornings occurence.  The below log shows where someone hits TRNF on the GXP-2000 and from this point on the server ignores incoming SIP packets.

The reason I say incoming is, when the system is in this state a new call can come in and all SIP phones ring, but no one can answer the call.  In addition, anyone currently on the phone is okay until they disconnect.  Hope this helps.

[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 0: REFER sip:6782553661@10.200.26.202 SIP/2.0 (42)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 1: Via: SIP/2.0/UDP 10.200.26.131:5060;branch=z9hG4bK5946ca300c00ffec (66)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 2: From: <sip:731@10.200.26.131:5060>;tag=e23b44165df86673 (55)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 3: To: "EIS INC" <sip:6782553661@10.200.26.202>;tag=as22028614 (59)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 4: Contact: <sip:731@10.200.26.131:5060> (37)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 5: Supported: replaces (19)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 6: Refer-To: <sip:200@10.200.26.202> (33)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 7: Referred-By: <sip:731@10.200.26.202> (36)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 8: Call-ID: 151dba617b76b64146da6ca06d1060f2@10.200.26.202 (55)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 9: CSeq: 34244 REFER (17)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 10: User-Agent: Grandstream GXP2000 1.1.1.14 (40)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 11: Max-Forwards: 70 (16)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 12: Allow: INVITE,ACK,CANCEL,BYE,NOTIFY,REFER,OPTIONS,INFO,SUBSCRIBE,UPDATE,PRACK,MESSAGE (85)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 13: Content-Length: 0     (21)
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Header 14:  (0)
[May 21 08:59:18] VERBOSE[1436] logger.c: --- (14 headers 0 lines) ---
[May 21 08:59:18] DEBUG[1436] chan_sip.c: **** Received REFER (9) - Command in SIP REFER
[May 21 08:59:18] VERBOSE[1436] logger.c: Call 151dba617b76b64146da6ca06d1060f2@10.200.26.202 got a SIP call transfer from caller: (REFER)!
[May 21 08:59:18] VERBOSE[1436] logger.c: SIP transfer to extension 200@from-internal by 731@10.200.26.202
[May 21 08:59:18] DEBUG[1436] chan_sip.c: SIP blind transfer: Transferer channel SIP/731-086a6fc8, transferee channel Zap/7-1
[May 21 08:59:18] DEBUG[1436] chan_sip.c: Got SIP transfer, applying to bridged peer 'Zap/7-1'
[May 21 08:59:18] DEBUG[1436] chan_sip.c: SIP transfer to parking: trying to park Zap/7-1. Parked by SIP/731-086a6fc8
[May 21 08:59:18] DEBUG[1436] channel.c: Planning to masquerade channel Zap/7-1 into the structure of Parking/Zap/7-1
[May 21 08:59:18] DEBUG[1436] channel.c: Done planning to masquerade channel Zap/7-1 into the structure of Parking/Zap/7-1



By: Ken Williams (kenw) 2007-05-21 12:11:12

full052107-trimmed.log = first SIP lockup this morning from TRNF transfer

full052107a-trimmed.log = second SIP lockup this morning, not quite as trimmed - tried to delete but don't have access :(

052107-console-trimmed-log = console log from first lockup, don't have one from second lockup



By: Ken Williams (kenw) 2007-05-21 18:03:33

Below is SIP History from when it just now locked up, I can attach logs for this lockup as well if desired.

1. Rx              INVITE / 2875 INVITE / sip:201@10.200.26.202
2. AuthChal        Auth challenge sent for  - nc 0
3. TxRespRel       SIP/2.0 / 2875 INVITE - 407 Proxy Authentication Required
4. SchedDestroy    32000 ms
5. Rx              ACK / 2875 ACK / sip:201@10.200.26.202
6. Rx              INVITE / 2876 INVITE / sip:201@10.200.26.202
7. CancelDestroy
8. Invite          New call: 5d4639126efd6ac7@10.200.26.121
9. AuthOK          Auth challenge succesful for 721
10. NewChan         Channel SIP/721-b71584d0 - from 5d4639126efd6ac7@10.200.26.121
11. TxResp          SIP/2.0 / 2876 INVITE - 100 Trying
12. TxRespRel       SIP/2.0 / 2876 INVITE - 200 OK
13. Rx              ACK / 2876 ACK / sip:201@10.200.26.202
14. Rx              INVITE / 2877 INVITE / sip:201@10.200.26.202
15. Hold            INVITE
16. ReInv           Re-invite received
17. TxRespRel       SIP/2.0 / 2877 INVITE - 200 OK
18. Rx              ACK / 2877 ACK / sip:201@10.200.26.202
19. Rx              REFER / 2878 REFER / sip:201@10.200.26.202
20. Xfer            REFER to call parking.

By: Ken Williams (kenw) 2007-05-21 18:42:27

A "good" transfer results in the following SIP history, you can see it's getting cut off on the xfer REFER to call parking:

1. NewChan         Channel SIP/717-08b48a38 - from 019c44693130d06d532658cb507e843
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. Rx              SIP/2.0 / 102 INVITE / 100 Trying
4. Rx              SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx              SIP/2.0 / 102 INVITE / 200 OK
6. TxReq           ACK / 102 ACK - -UNKNOWN-
7. Rx              INVITE / 51278 INVITE / sip:8019187318@10.200.26.202
8. Hold            INVITE
9. ReInv           Re-invite received
10. TxRespRel       SIP/2.0 / 51278 INVITE - 200 OK
11. Rx              ACK / 51278 ACK / sip:8019187318@10.200.26.202
12. Rx              REFER / 51279 REFER / sip:8019187318@10.200.26.202
13. Xfer            REFER to call parking.
14. Masq            Old channel: SIPPeer/SIP/717-08b48a38<ZOMBIE>
15. Masq (cont)     ...new owner: SIP/717-08b48a38
16. TxResp          SIP/2.0 / 51279 REFER - 202 Accepted
17. SIPpark         Parked call on 201
18. TxReqRel        NOTIFY / 103 NOTIFY - -UNKNOWN-
19. SchedDestroy    32000 ms
20. TxReqRel        BYE / 104 BYE - -UNKNOWN-
21. RTCPaudio       Quality:ssrc=1176748443;themssrc=317473915;lp=0;rxjitter=0.0000
22. Rx              SIP/2.0 / 103 NOTIFY / 200 OK
23. Ignore          Ignoring this retransmit
24. Rx              SIP/2.0 / 104 BYE / 200 OK
25. ReTx            1000 NOTIFY sip:717@10.200.26.117:5060;transport=udp SIP/2.0
26. Rx              SIP/2.0 / 103 NOTIFY / 481 No Such Call
27. Ignore          Ignoring this retransmit
28. ReTx            2000 NOTIFY sip:717@10.200.26.117:5060;transport=udp SIP/2.0
29. Rx              SIP/2.0 / 103 NOTIFY / 481 No Such Call
30. Ignore          Ignoring this retransmit
31. ReTx            4000 NOTIFY sip:717@10.200.26.117:5060;transport=udp SIP/2.0
32. Rx              SIP/2.0 / 103 NOTIFY / 481 No Such Call
33. Ignore          Ignoring this retransmit
34. ReTx            4000 NOTIFY sip:717@10.200.26.117:5060;transport=udp SIP/2.0
35. Rx              SIP/2.0 / 103 NOTIFY / 481 No Such Call
36. Ignore          Ignoring this retransmit
37. ReTx            4000 NOTIFY sip:717@10.200.26.117:5060;transport=udp SIP/2.0
38. Rx              SIP/2.0 / 103 NOTIFY / 481 No Such Call
39. Ignore          Ignoring this retransmit

By: Ken Williams (kenw) 2007-05-23 10:21:52

I've been allowed to offer a bounty to get this fixed, how does that work?  E-mail me directly (ken@intermountainelectronics.com) or call me at 435-637-7160.

Thanks,
Ken

By: Ken Williams (kenw) 2007-05-24 11:08:26

Problem exists in 1.2.18.............................

Back to 1.2.12 (the original version of the software I setup).  

Hoping and praying.

By: Ken Williams (kenw) 2007-05-24 18:26:22

The downgrade (took it to 1.2.13 not .12), knock on a lot of wood, seems to not have the issue.

At this point I'm happy to have a phone system that's been up for 8+ hours under normal load, something I haven't been able to say for over a month.  I won't be able to do any testing for a while as my users are about ready to hang me, but I would very much like to get back to the 1.4 product if you happen to come up with a fix for this issue.

Keep in mind it's not just a 1.4 issue as it happened within 2 hours of running 1.2.18.  If I dared I'd narrow down the exact revision, but if the phones lock up again I've been told we're going back to our old PBX which would make me cry! :/

Thanks for all you guys do and I hope this little issue doesn't come up for anyone else and we're just an odd duck.

By: Ken Williams (kenw) 2007-05-31 11:40:03

Alas, after one week of uptime (barely, 1 week 10 minutes) SIP locked up and stopped responding when someone tried to transfer.

So, while 1.2.13 doesn't have the problem nearly as often as 1.2.18 or 1.4.x, it still occurs.

I didn't have any extra debugging enabled as it seemed to be working just fine, but it was a REFER that hung the channel again, so I'll turn debugging back on and go from there.

By: Ken Williams (kenw) 2007-06-13 09:39:23

This problem still exists in 1.2.13, it just happens MUCH less often.  We're down from locking up 3-5 times a day to once a week.

So, while it's a huge improvement it still isn't acceptable for a phone system to just stop working once a week on a random transfer.  

My offer of a bounty still exists, I need to know how to go about getting this fixed.

By: Ken Williams (kenw) 2007-06-22 10:49:02

I have FINALLY found a way to reproduce this problem.

All this time I've thought it was a random lockup.  Turns out it's happening the same way every time (more than likely) and no one put 2 & 2 together.

Here's the scenario:
1. Outside call comes in and gets TRNF parked.
2. 45 seconds goes by and it rings back to the transferer (the timeout specified in features.conf)
3. Outside call again gets TRNF parked.
4. 45 seconds goes by, it rings back to the transferer
5. Outside call again gets TRNF parked.  Entire SIP channel dies.

Every time the same call is TRNF parked for the 3rd time, SIP dies and the only resolution is a reboot.

I'll be trimming logs of the 1.4.5 build I put on a few days ago and attaching them shortly.

By: Ken Williams (kenw) 2007-06-22 14:16:13

I've uploaded the trimmed log, it took about 1 hour to go through and get a clean log without all the calls in between (amazing how many calls happen in 3 minutes some days).  I've added a couple of notes marked by *******.  Basically you can watch the call come in, watch it get REFER'd to park(201), watch it ring back, get REFER'd again, ring back, REFER'd again and die.  The 3rd REFER hangs up at:

[Jun 22 09:40:28] DEBUG[15793] channel.c: Planning to masquerade channel Zap/2-1 into the structure of Parking/Zap/2-1
[Jun 22 09:40:28] DEBUG[15793] channel.c: Done planning to masquerade channel Zap/2-1 into the structure of Parking/Zap/2-1

Where as the REFER's before that continue on.

Now that I have a reproducable problem if someone would like access to my box (it would have to be after hours though) I'm happy to open it up, just let me know how you want to go about it.

Thanks!

By: Ken Williams (kenw) 2007-07-19 09:26:53

*UPDATE*

So I had tried damn near everything and was about to throw the whole damn system out when it appears I may have finally found the real problem.

First, things I have done between last post and now.

1. Remove all signs of asterisk & reinstall latest 1.4.x - no good
2. Remove all signs of asterisk & reinstall latest 1.2.x - no good
3. Setup new server, clean install of everything, latest 1.4.x, copied config from existing server - no good (this killed me)
4. Redo configuration from scratch - no good (this REALLY killed me)

Then it dawned on me the only common problem (aside from some network infrastructure possibility that was affecting only the phone system) was the moving of the TDM400 cards (I've got 2 of them).  

I had an extra card, I swapped it out with the card that gets the most usage.  I spent a solid hour last night doing nothing but transfers different ways on version 1.4.6 without a lockup.

This is huge, as 1.4.x has always amplified the SIP lockup more.  

Another thing that I noticed when I created the new config, is a call in on a specific Zap channel despite having the proper context & start info was coming into the dialplan saying it couldn't find (from-zaptel,s,1) trying from-zaptel,s,1, can't find it, trying default.  So I may only have a bad FXO card.  Once I swapped cards this problem went away as well.

When it's been a week with no lockups I'll post back, but I'm about 95% sure we're good to go and the problem is coming from a flaky TDM400 card/module.

By: Ken Williams (kenw) 2007-07-19 14:23:36

Ugh!

One of the other servers I setup that had an uptime of 14 weeks called me from a cell phone.  They had transferred a call to park and the system froze.

I logged into the server and sure enough, exact same symptoms on this box.

I just hope it's as easy as hardware, but it doesn't speak very well when 2 of the 4 TDM400 cards I've bought appear to have this issue.

By: Ken Williams (kenw) 2007-07-20 11:05:32

I give.

It died again, twice today.

I've offered a bounty, no response.  I seem to be the only one in the world affected by this.  I've replaced everything, redone everything, reconfigured everything.  Yet this nightmare continues to haunt me.  It's been months and I can't get the system to stay up for more than a few days.

Regardless of whatever is causing the problem, nothing should stop the SIP channel from responding completely, and I can't take it anymore.

Here's a copy of all my configs rewritten, please, someone look and give me a clue!

By: Ken Williams (kenw) 2007-07-20 13:19:43

Okay, so I am attaching once more a console & full log file of the event.  Beings we're on 1.4.6, these are a bit different then previous logs attached and I think (and hope) they offer more insight as to what's going on.  Couple notes of interest:

1.  When system SIP channel locks up I can do a "SIP SHOW CHANNELS" followed by "SIP SHOW HISTORY xxxxxxxx" and get details
2.  Once locked up, if I do SHOW CHANNELS I get a LOT of spam that says the following repeated for hundreds of lines:

�[Kasterisk-price*CLI> SIPPeer/SIP/717-0973 s@from-internal:1    Down    (None)                        
�[Kasterisk-price*CLI> Parking/Zap/3-1      SIP/717@park-dial:1  Down    (None)                        

3. Once locked up and I've done SHOW CHANNELS, any command repeats the spam from above

4.  The log file shows hundreds of lines repeating:

[Jul 20 12:19:44] DEBUG[10870] channel.c: Avoiding deadlock for channel '0x9768088'
[Jul 20 12:19:44] DEBUG[10870] channel.c: Avoiding deadlock for channel '0x9768088'
[Jul 20 12:19:44] DEBUG[10870] channel.c: Avoiding deadlock for channel '0x9768088'
[Jul 20 12:19:44] DEBUG[10870] channel.c: Avoiding deadlock for channel '0x9768088'

Hope this helps!!!

By: Russell Bryant (russell) 2007-07-20 14:31:52

To debug this, we need some information using gdb.  First, build Asterisk with the following options.

$ make menuselect
Compiler Flags
-> DONT_OPTIMIZE
-> DEBUG_THREADS

Hit 'x' to save and quit

$ make
# make install


Then, get the system to deadlock.  Afterwords, run the following command to attach to Asterisk using gdb, and post all of the output here.

# gdb /usr/sbin/asterisk `pidof asterisk`

(gdb) bt
(gdb) bt full

By: Russell Bryant (russell) 2007-07-20 14:54:07

Note that these instructions are only good for Asterisk 1.4.  If someone can find the problem there, it's likely the same problem in 1.2, and we can backport the fix.

By: Ken Williams (kenw) 2007-07-20 15:46:36

Here are two files from it happening today.  The first was an uncontrolled occurence (it just happened to lock up quite quickly after I had reinstalled).

The second is a controlled occurence (the way that I can make it happen).

By: Ken Williams (kenw) 2007-07-21 14:21:26

lol

That's all I'll say....while I upload this file...

By: Russell Bryant (russell) 2007-07-21 14:23:44

Ha ... I'm sorry, I gave you the wrong gdb commands.  Those would have been right if we were dealing with a crash.  :(  The correct ones would be:

(gdb) thread apply all bt
(gdb) thread apply all bt full

By: Ken Williams (kenw) 2007-07-23 18:34:54

Here ya go!

By: Ken Williams (kenw) 2007-07-26 17:07:55

Server has been up on 1.4.6 for nearly 2 days now, a record on the 1.4.x.

I'm 99.9% certain I've found the problem, and I'm extremely embarassed & ashamed at what it was.  When it's still up come Monday I'll post the problem and see how you guys can help me fix it.

By: Ken Williams (kenw) 2007-07-30 10:48:45

Well, the problem was a modification I was making to the parking lot transfers (go figure).  Why or how I never thought of this before last week is beyond me (thus the embarassed & ashamed part).

When we put someone on hold, the way I have the phones setup we don't care which parking lot extension they get transferred to, not to mention the way the Grandstream phones do the TRNF you don't get to here the parking lot extension play back anyway.  All it does on Grandstream phones is take 2-3 seconds to put a call on hold, which is a bit annoying.  So, I had found the source for SAY_DIGITS and remarked it out.  Specifically line 426 in the 1.4.6 version of res_features.c.  

       if (peer && pu->parkingnum != -1) /* Only say number if it's a number */
               ast_say_digits(peer, pu->parkingnum, "", peer->language);

Apparently taking this delay to say the digits out randomly causes a channel lockup.  I'd like to remark this out so our transfers are instant, but for now we're living with the longer transfer for the sake of a phone system that doesn't randomly lock.

So, any thoughts?  Should I open a new request ticket?

(BTW, server up time 5 days on 1.4.6, never happen in the 1.4.x before.)

By: Russell Bryant (russell) 2007-07-30 12:50:23

1) We do not support patched systems.

2) We do not accept feature requests on the bug tracker.  However, if you have a patch you would like to submit, please post it in another report.  If you would like to discuss a change, I would recommend doing it on the asterisk-dev mailing list.

By: Ken Williams (kenw) 2007-08-02 09:46:37

Okay, so I was mistaken.  The server died Monday evening and I was hoping it was a fluke.  I removed all signs of Asterisk & installed 1.4.9 Monday night as a clean install, no modification to source whatsoever.

SIP just locked up again, here is the gdb information, I hope it helps.

By: Russell Bryant (russell) 2007-08-02 10:12:04

Thanks for the update.

I recently made some improvements for deadlocking debugging that should make figuring this out a lot easier.  I'm going to port it to 1.4 right now so that you can use it on your system.  I'll leave a note when it is ready.

By: Russell Bryant (russell) 2007-08-02 11:04:44

Ok, download and install the following branch.  It is just 1.4 with some additional debugging code.

svn co http://svn.digium.com/svn/asterisk/team/russell/debug_threads_1.4

Enable DONT_OPTIMIZE and DEBUG_THREADS
$ make menuselect
-> Compiler Flags
'x' to save and quit

When this happens, grab the output of the "core show locks" CLI command in additional the gdb output you are already getting.

That should tell us exactly which locks are causing the problem, as well as which threads are holding them.

By: Russell Bryant (russell) 2007-08-03 14:25:59

These DEBUG_THREADS changes are now in the 1.4 branch, so you don't have to use my branch anymore.

By: Ken Williams (kenw) 2007-08-03 17:15:13

Attached is the cli from the core show locks from today's lockup.

By: Russell Bryant (russell) 2007-08-03 17:39:45

I can't seem to open that last file you uploaded ...

By: Russell Bryant (russell) 2007-08-03 17:41:02

Nevermind, I got it.  It's just not quite plain text.  :)

By: Russell Bryant (russell) 2007-08-09 11:10:00

I have tried and tried and can't seem to reproduce this ...

By: Ken Williams (kenw) 2007-08-09 11:16:42

Well, I'm happy to arrange something with regards to remote access to the box while I continuously park calls until it locks up.

The 'problem' is it can be up for 5 days or lock up 3 times in a day.  So when I'm trying to make it fail it can take a while.  

I'm willing to do whatever it takes to get it fixed though.

By: Russell Bryant (russell) 2007-08-10 10:41:43

Could you please update, build with DEBUG_THREADS, and get the "core show locks" output again?  I realized that I messed something up which made the output not very useful ...

By: Ken Williams (kenw) 2007-08-10 10:59:34

I only see 1.4.10 for download, I noticed on one of the other bugs similar to mine that you asked to download 1.4.10.1.

By: Russell Bryant (russell) 2007-08-10 11:12:30

Sorry.  I mentioned the release here before finishing the release.  :)  It's posted now ...

By: Ken Williams (kenw) 2007-08-13 16:33:09

Here's one from a lockup we just had, hope it helps.

Followed by one 10 minutes later.



By: Russell Bryant (russell) 2007-08-23 15:52:54

Give the attached patch a try.  I made it against the latest code in the 1.4 branch but it should also apply to the latest release.

By: Ken Williams (kenw) 2007-08-29 14:06:20

Been up since I applied it last Thursday, 6 days is a record in 1.4.x.  I'm going to wait 2-3 weeks before I say it's really fixed, but for now it seems better.

By: callguy (callguy) 2007-08-29 14:34:40

is there any functional difference between the 8/23 and 8/27 patch or is it just formatting? I'm assuming we should test with the latter in any event.

By: callguy (callguy) 2007-08-29 14:43:11

disregard - i see the 2nd patch is for 1.2

By: Ken Williams (kenw) 2007-09-06 15:23:01

Two weeks of no SIP lockups.

I'd call it success.

By: Digium Subversion (svnbot) 2007-09-06 17:10:35

Repository: asterisk
Revision: 81832

------------------------------------------------------------------------
r81832 | russell | 2007-09-06 17:10:34 -0500 (Thu, 06 Sep 2007) | 16 lines

(closes issue ASTERISK-9438, closes issue ASTERISK-10017)
Reported by: kenw
Patches:
     9724.txt uploaded by russell (license 2)
Tested by: kenw, russell

Resolve a deadlock that occurs when doing a SIP transfer to parking.  

I come across this type of deadlock fairly often it seems.  It is very important
to mind the boundary between the channel driver and the core in respect to the
channel lock and the channel-pvt lock.  Channel drivers lock to lock the
pvt and then the channel once it calls into the core, while the core will do
it in the opposite order.  The way this is avoided is by having channel drivers
either release their pvt lock while calling into the core, or such as in this
case, unlocking the pvt just long enough to acquire the channel lock.

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-09-06 17:11:54

Repository: asterisk
Revision: 81839

------------------------------------------------------------------------
r81839 | russell | 2007-09-06 17:11:53 -0500 (Thu, 06 Sep 2007) | 24 lines

Merged revisions 81832 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81832 | russell | 2007-09-06 17:28:57 -0500 (Thu, 06 Sep 2007) | 16 lines

(closes issue ASTERISK-9438, closes issue ASTERISK-10017)
Reported by: kenw
Patches:
     9724.txt uploaded by russell (license 2)
Tested by: kenw, russell

Resolve a deadlock that occurs when doing a SIP transfer to parking.  

I come across this type of deadlock fairly often it seems.  It is very important
to mind the boundary between the channel driver and the core in respect to the
channel lock and the channel-pvt lock.  Channel drivers lock to lock the
pvt and then the channel once it calls into the core, while the core will do
it in the opposite order.  The way this is avoided is by having channel drivers
either release their pvt lock while calling into the core, or such as in this
case, unlocking the pvt just long enough to acquire the channel lock.

........

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-09-08 09:14:35

Repository: asterisk
Revision: 81970

------------------------------------------------------------------------
r81970 | murf | 2007-09-08 09:14:33 -0500 (Sat, 08 Sep 2007) | 260 lines

Merged revisions 81683,81712,81730,81742,81744-81745,81777,81779,81781-81785,81813-81814,81827,81839,81849,81858,81873,81891,81910,81924,81953 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r81683 | mmichelson | 2007-09-06 09:21:45 -0600 (Thu, 06 Sep 2007) | 13 lines

Merged revisions 81682 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81682 | mmichelson | 2007-09-06 10:20:36 -0500 (Thu, 06 Sep 2007) | 5 lines

Fixes a memory leak

(closes issue ASTERISK-10247, reported and patched by Ivan)


........

................
r81712 | rizzo | 2007-09-06 09:43:49 -0600 (Thu, 06 Sep 2007) | 8 lines

various changes to the documentation, and redefinition of
ao2_hash_fn and ao2_callback_fn typedefs, in preparation
to more cleanup of the _search_flags

Please do not merge this change to 1.4 yet - there are no
functional changes anyways.


................
r81730 | mmichelson | 2007-09-06 10:31:30 -0600 (Thu, 06 Sep 2007) | 14 lines

Merged revisions 81713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81713 | mmichelson | 2007-09-06 11:25:40 -0500 (Thu, 06 Sep 2007) | 6 lines

Fixes an issue where valid DTMF had to be pressed twice to exit a queue if a member's phone
was ringing.

(closes issue ASTERISK-10245, reported by strider2k, patched by me)


........

................
r81742 | mattf | 2007-09-06 10:38:54 -0600 (Thu, 06 Sep 2007) | 1 line

Patch on 10575.  Add support for unequipped CIC (UCIC) message as well as improve some of our CIC flags in chan_zap
................
r81744 | tilghman | 2007-09-06 10:57:20 -0600 (Thu, 06 Sep 2007) | 2 lines

Incorporate the ability to log output of safe_asterisk to syslog (closes issue ASTERISK-9588)

................
r81745 | phsultan | 2007-09-06 11:00:58 -0600 (Thu, 06 Sep 2007) | 9 lines

Merged revisions 81743 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81743 | phsultan | 2007-09-06 18:56:29 +0200 (Thu, 06 Sep 2007) | 1 line

Various string length fixes. Removed an unused variable in aji_client structure (context)
........

................
r81777 | file | 2007-09-06 13:43:20 -0600 (Thu, 06 Sep 2007) | 15 lines

Merged revisions 81776 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81776 | file | 2007-09-06 16:40:37 -0300 (Thu, 06 Sep 2007) | 7 lines

(closes issue ASTERISK-9809)
Reported by: stevefeinstein
Patches:
     meetme-unmute-manager.diff uploaded by qwell (license 4)
Tested by: stevefeinstein
After looking over the code I agree with Qwell. Setting the file descriptor to conference each time just causes a fight back and forth.

........

................
r81779 | qwell | 2007-09-06 14:00:08 -0600 (Thu, 06 Sep 2007) | 10 lines

Merged revisions 81778 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81778 | qwell | 2007-09-06 14:59:07 -0500 (Thu, 06 Sep 2007) | 2 lines

This should fix a build issue that people building against uClibc were seeing with the addition of astobj2

........

................
r81781 | russell | 2007-09-06 14:05:50 -0600 (Thu, 06 Sep 2007) | 2 lines

Fix the syntax of declaring a hint with a name to be compatible with trunk

................
r81782 | file | 2007-09-06 14:16:02 -0600 (Thu, 06 Sep 2007) | 6 lines

(closes issue ASTERISK-10019)
Reported by: mvanbaak
Patches:
     chan_skinny_info.diff uploaded by mvanbaak (license 7)
Add skinny show device, skinny show line, and skinny show settings CLI commands.

................
r81783 | russell | 2007-09-06 14:24:18 -0600 (Thu, 06 Sep 2007) | 4 lines

Merge HINT() dialplan function from my sandbox branch into trunk.  This function
will let you retrieve the list of devices or name associated with a hint.
(inspired by issue ASTERISK-10226)

................
r81784 | russell | 2007-09-06 14:27:26 -0600 (Thu, 06 Sep 2007) | 4 lines

Rename the DEVSTATE() function to DEVICE_STATE() to better conform to how other
functions are named.
(inspired by issue ASTERISK-10226)

................
r81785 | russell | 2007-09-06 14:27:53 -0600 (Thu, 06 Sep 2007) | 2 lines

s/DEVSTATE/DEVICE_STATE/

................
r81813 | russell | 2007-09-06 14:54:07 -0600 (Thu, 06 Sep 2007) | 5 lines

Add EXTENSION_STATE() function that can retrieve the state of an extension that
has a hint.

(closes issue ASTERISK-10226, adamgundy)

................
r81814 | file | 2007-09-06 15:01:10 -0600 (Thu, 06 Sep 2007) | 2 lines

Initialize iax_frames variable to NULL, keeps valgrind happy.

................
r81827 | qwell | 2007-09-06 16:06:17 -0600 (Thu, 06 Sep 2007) | 9 lines

Merged revisions 81826 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81826 | qwell | 2007-09-06 17:05:02 -0500 (Thu, 06 Sep 2007) | 1 line

We added COPTS for ASTCFLAGS additions, but not LDOPTS for ASTLDFLAGS.  This adds LDOPTS
........

................
r81839 | russell | 2007-09-06 16:30:15 -0600 (Thu, 06 Sep 2007) | 24 lines

Merged revisions 81832 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81832 | russell | 2007-09-06 17:28:57 -0500 (Thu, 06 Sep 2007) | 16 lines

(closes issue ASTERISK-9438, closes issue ASTERISK-10017)
Reported by: kenw
Patches:
     9724.txt uploaded by russell (license 2)
Tested by: kenw, russell

Resolve a deadlock that occurs when doing a SIP transfer to parking.  

I come across this type of deadlock fairly often it seems.  It is very important
to mind the boundary between the channel driver and the core in respect to the
channel lock and the channel-pvt lock.  Channel drivers lock to lock the
pvt and then the channel once it calls into the core, while the core will do
it in the opposite order.  The way this is avoided is by having channel drivers
either release their pvt lock while calling into the core, or such as in this
case, unlocking the pvt just long enough to acquire the channel lock.

........

................
r81849 | russell | 2007-09-06 16:32:03 -0600 (Thu, 06 Sep 2007) | 1 line

fix the build ... oops
................
r81858 | file | 2007-09-06 16:34:44 -0600 (Thu, 06 Sep 2007) | 2 lines

Fix memory issue that crept up with Russell's testing. It is *not* proper to free the frame we get in ast_write.

................
r81873 | file | 2007-09-07 06:32:04 -0600 (Fri, 07 Sep 2007) | 2 lines

Don't check for epoll support when cross compiling.

................
r81891 | mmichelson | 2007-09-07 09:29:23 -0600 (Fri, 07 Sep 2007) | 11 lines

Merged revisions 81886 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81886 | mmichelson | 2007-09-07 10:25:19 -0500 (Fri, 07 Sep 2007) | 3 lines

Moving the explanation for joinempty to a more appropriate place


........

................
r81910 | qwell | 2007-09-07 10:13:57 -0600 (Fri, 07 Sep 2007) | 4 lines

Add an optional reason parameter to PauseQueueMember/UnpauseQueueMember applications and manager events.

Issue 8738, patch by rgollent

................
r81924 | qwell | 2007-09-07 13:53:30 -0600 (Fri, 07 Sep 2007) | 14 lines

Merged revisions 81923 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

(closes issue ASTERISK-10255)
........
r81923 | qwell | 2007-09-07 14:48:00 -0500 (Fri, 07 Sep 2007) | 5 lines

Allow the MEMBERINTERFACE variable to be used as the mixmonitor filename.
This moves the setting of the MEMBERINTERFACE variable to before mixmonitor.

Issue 10671, patch by sim.

........

................
r81953 | russell | 2007-09-08 07:45:40 -0600 (Sat, 08 Sep 2007) | 19 lines

Merged revisions 81952 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r81952 | russell | 2007-09-08 08:42:26 -0500 (Sat, 08 Sep 2007) | 11 lines

(closes issue ASTERISK-10256)
Bump the cleancount so that a "make clean" will be forced.  This is needed
because my fix in revision 81599 made a change to a data structure in file.h,
and since file dependency tracking is only on with dev-mode enabled, file
format modules that don't get rebuilt may crash, as is the case with this issue.

This makes me wonder - how much faster does the code build without the file
dependency tracking enabled?  If it doesn't make much of a difference, then it
may be worth just keeping it on all of the time, or perhaps just not in release
tarballs, so that this type of issue is avoided.

........

................

------------------------------------------------------------------------