|Summary:||ASTERISK-06637: SIP channels hang in semi-active state|
|Reporter:||Roy Sigurd Karlsbakk (rkarlsba)||Labels:|
|Date Opened:||2006-03-28 08:26:16.000-0600||Date Closed:||2007-02-07 10:59:46.000-0600|
|Environment:||Attachments:||( 0) oej-sip-2006-06-20.patch|
( 1) res_agi_patch.txt
( 2) sipshowdownstuff.txt
With a setup with ATAs <-sip-> asterisk sip/sip gw <-sip-> sip/pstn gws, we keep seeing hanging SIP channels on the sip/sip gw. Measuring the total channels on the pstn gateways and the total sip channels shows this 'leaks', just a few, 5-10 channels a day or so.
I'm unsure of how to debug this...
|Comments:||By: Andrey S Pankov (casper) 2006-03-28 10:06:59.000-0600|
Can you show at least 'sip show channels'?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-28 10:33:24.000-0600
May I email you that? I don't know what to look for, and I do not want to disclose all that. sorry.
By: Andrey S Pankov (casper) 2006-03-28 11:06:11.000-0600
Wipe out IP addresses from the 1st column and past the output here (attach if the output is too large).
By: Olle Johansson (oej) 2006-03-28 19:03:30.000-0600
...and a "sip show channel <name>" for the hanging channel, possibly also sip history. Thanks.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-03-31 05:22:05.000-0600
These 'hanging channels' show up as these
SIP/pstngw2-311d (None) Down AppDial((Outgoing Line))
SIP/1005781-5a5f 62424300@default:1 Ring Dial(SIP/pstngw2/62424300)
See attached file sipshowdownstuff.txt for output from 'show channel ...' for these two
By: Olle Johansson (oej) 2006-03-31 17:54:46.000-0600
Hmm. I asked for "sip show channel" and got "show channel"....
By: Olle Johansson (oej) 2006-03-31 17:55:01.000-0600
Both would be excellent!
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-02 06:24:56
I can't see how sip show channels can be relevant when this one shows one hanging channel's details, that being AFAICS the channel pair is hanging on ast_waitfor_nandfds
By: Olle Johansson (oej) 2006-04-03 15:15:08
Roy, that it hangs there does not tell me anything on what happened before or how the SIp channel was configured...
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-11 05:10:42
anymore news here, olle?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-13 11:04:05
we just had a crash outside working hours (bug 6936), and it seems this no longer happens after the crash. after the crash, however, there has been no app_queue activity since it's been night and national norwegian holiday, so it _could_ be something only happening when sip channels are queued up...
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-24 09:51:00
This is still happening at all times of the day
Are anyone woring on this?
By: Andrey S Pankov (casper) 2006-04-24 11:05:45
Can you upload your extensions.conf file please?
By: Janahan (jv) 2006-04-25 18:37:43
I'm seeing similar behavior. We're using fastAGI to answer a call and record voice. Once in a while the fastAGI request never returns even though the remote user has hung up.
I just applied the patch for 6647, I'm wondering if the root cause of this bug is related to 6647. Unfortunately because of the infrequency of this problem on our system I can't verify for sure.
Roy have you applied 6647 to your system?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-04-26 07:31:01
casper: extensions.conf shouldn't really be of any relevance. It's a rather massive AGI script that does it all.
However - we had some problems with a database server and the problem is quite less visible after upgrading the db server, so it looks like it's related to db timeouts etc.
By: Andrey S Pankov (casper) 2006-04-26 12:08:42
I mean there may be problems if you do not call Hangup() or use Busy() or Congestion() without a timeout.
By: Andrey S Pankov (casper) 2006-04-26 12:11:53
jv: 6647 is one of my fixes. I'm happy it helped you.. ;)
rkarlsba: it would be nice if when reporting a bug you describe the environment. It may be AGI related, not SIP...
By: Janahan (jv) 2006-04-27 16:48:29
I'm adding this note here because it might be the cause of rkarlsba's issue as well (he's using AGI).
In res/reg_agi.c, the function 'handle_recordfile' invokes 'ast_waitfor' with '-1' as the timeout even if a valid timeout value is passed in. Any ideas why this is?
In my setup, sometimes the SIP 'BYE' doesn't get delievered to me(poor interconnection). At this point the remote end stops sending RTP packets as well. This causes the channel to block forever because there's nothing to kick it out of 'ast_waitfor_nandfds'. While I agree that the root cause is really a problem on the remote end, I don't think it should cause the channel to hang.
One option would be to set the timeout in to ast_waitfor to what ever is passed in by the caller. But I'm not sure if there are any side effects. I tried this on my system and things work well.
P.S. thanks casper....:-)
By: Serge Vecher (serge-v) 2006-05-04 13:40:47
jv: can you please produce a patch with your change to be tried?
By: Joshua C. Colp (jcolp) 2006-05-08 10:44:13
You could also use rtptimeout so that the channel will go away if no RTP traffic is received.
By: Janahan (jv) 2006-05-08 14:14:15
sorry for the delay. I just uploaded the file. plese let me know if the format is incorrect.
joshnet: does a non-zero 'rtptimeout' override the the timeout value used inside of ast_waitfor_nandfds?
By: Andrey S Pankov (casper) 2006-05-08 14:27:12
vechers: if this is an AGI issue, could you please ask junky to comment on the issue (I can't post a reminder from here). He is a big AGI specialist...
By: Serge Vecher (serge-v) 2006-05-08 14:36:30
junky: please take a look at this patch here. Thanks.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-05-09 03:07:12
joshnet: i'm using rtptimeout, but the leak's still there....
i'll try to see if the patch fixes the problem
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-05-10 07:31:18
* still leaks after applying the res_agi patch....
By: Andrey S Pankov (casper) 2006-05-15 14:52:25
Please add relation to ASTERISK-6911 and change category to AGI in both. Thanks!
By: Serge Vecher (serge-v) 2006-05-19 14:51:22
roy: what is the revision of 1.2 you have tried and did you narrow down the module that leaks memory?
By: Andrey S Pankov (casper) 2006-05-19 16:44:44
> did you narrow down the module that leaks memory?
nothing "leaks memory" here... there are hang channels in semi-active state.
By: Andrey S Pankov (casper) 2006-05-23 19:47:44
It seems like it is not SIP related at all... rather AGI related...
or related to a specific AGI script
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-05-24 08:26:02
In any case, this never happened on 1.0
By: Serge Vecher (serge-v) 2006-05-24 08:30:19
>In any case, this never happened on 1.0
Well, I don't think anyone at this point will try to narrow down the exact "breakage" point by examining daily cvs versions between 1.0 and 1.2. Unless you will.
By: Serge Vecher (serge-v) 2006-06-02 10:29:19
royk: can you please turn sip history on and post the log here. Also, please specify the latest version this is exhitibited on. Thanks.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-19 04:20:22
Currently running 184.108.40.206
This channel is hung
sipgw1 *CLI> show channel SIP/pstngw2-8aba
-- General --
Caller ID: 9593185
Caller ID Name: (N/A)
DNID Digits: (N/A)
State: Down (0)
1st File Descriptor: 412
Frames in: 9817
Frames out: 100
Time to Hangup: 0
Elapsed Time: N/A
Direct Bridge: <none>
Indirect Bridge: <none>
-- PBX --
Call Group: 0
Pickup Group: 0
Data: (Outgoing Line)
Blocking in: ast_waitfor_nandfds
sipgw1 *CLI> sip show history firstname.lastname@example.org
* SIP Call
1. TxReqRel INVITE / 102 INVITE
2. ReTx 2 INVITE sip:email@example.com SIP/2.0
3. ReTx 4 INVITE sip:firstname.lastname@example.org SIP/2.0
4. ReTx 8 INVITE sip:email@example.com SIP/2.0
5. ReTx 16 INVITE sip:firstname.lastname@example.org SIP/2.0
6. ReTx 32 INVITE sip:email@example.com SIP/2.0
7. Rx SIP/2.0 / 102 INVITE
9. Rx SIP/2.0 / 102 INVITE
11. Rx SIP/2.0 / 102 INVITE
13. Rx SIP/2.0 / 102 INVITE
15. Rx SIP/2.0 / 102 INVITE
17. Unhold SIP/2.0
By: Olle Johansson (oej) 2006-06-19 05:15:50
Why is it saying "Unhold" but not "hold"... Weird...
By: Olle Johansson (oej) 2006-06-20 05:35:13
Patched 1.2 and svn trunk in different ways. Please confirm whether this fixes your issue or not.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-20 07:09:21
hi. after patching up with oej-sip-2006-06-20.patch (attached), I still see hanging channels......
* SIP Call
1. TxReqRel INVITE / 102 INVITE
2. Rx SIP/2.0 / 102 INVITE /100 Trying
4. Rx SIP/2.0 / 102 INVITE /183 Session Progress
6. Unhold SIP/2.0
7. Rx SIP/2.0 / 102 INVITE /200 OK
9. Unhold SIP/2.0
10. TxReq ACK / 102 ACK
11. Rx SIP/2.0 / 102 INVITE /200 OK
13. Unhold SIP/2.0
14. TxReq ACK / 102 ACK
15. Rx SIP/2.0 / 102 INVITE /200 OK
17. Unhold SIP/2.0
18. TxReq ACK / 102 ACK
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-22 11:31:33
Are these the same patches you sent me?
If not, may I have this as a separate patch, since I'd rather not upgrade production rig to something in svn....
By: Olle Johansson (oej) 2006-06-22 12:47:53
The stuff I committed is the same as what I sent you for testing, apart from the patch that added a lot of history that only you have.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-06-22 15:34:01
Then the effect is negative, or null
Sorry about that
Same sort of sip history from the dead channels
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-07-07 09:12:14
What more can be done here to debug this further? As I've said, the system's in production, and a full SIP DEBUG will quite probably be impossible.
By: Serge Vecher (serge-v) 2006-08-01 08:15:11
does 7459 fix this issue as well?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-08-01 09:43:40
Is there a patch available? I really don't want to run svn in production if I can help it, not even 1.2
By: Serge Vecher (serge-v) 2006-08-01 16:19:21
the patch was committed directly to 1.2 and trunk. You can make your own patch from the following information http://lists.digium.com/pipermail/svn-commits/2006-July/015544.html
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-08-04 04:47:56
Patch applied yesterday. The bug usually takes a few days showing, so we'll have to wait a little.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-08-14 05:36:55
It might look like this is fixed now, so just close this for now. I'll reopen it if the error reappears.
By: Serge Vecher (serge-v) 2006-08-14 09:54:42
presumably fixed by r38611 in 1.2. Please reopen if not the case.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-08-30 08:56:26
This still seems to be a problem, guys
By: Serge Vecher (serge-v) 2006-09-05 11:05:09
interesting, because 7664 was fixed by 1.2.11 ...
By: Serge Vecher (serge-v) 2006-09-27 13:38:00
rkarlsba: what's the deal with this bug. An issue in 220.127.116.11?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-09-27 19:35:10
Give me a week or so, and I'll report back. This bug takes a little time tracking
By: Olle Johansson (oej) 2006-10-29 10:40:55.000-0600
...and a week turned into a month.
By: Olle Johansson (oej) 2006-10-29 10:41:20.000-0600
And please test with latest 1.2 in subversion. Thanks.
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-10-29 13:30:27.000-0600
reproduced with 18.104.22.168, yes, but it's a little bit harder now due to ASTERISK-7869 crashing the system regularly. Also, I really do not want to run svn on systems in production.
By: Olle Johansson (oej) 2006-11-24 02:09:18.000-0600
Any updates? Do you still see this on latest 1.2 trunk ?
By: Roy Sigurd Karlsbakk (rkarlsba) 2006-11-24 05:34:23.000-0600
I've seen it with the latest release, although not lately, since asterisk has been crashing regularly, and this takes some time to reproduce. Asterisk now looks stable, so I may see it again in a week's time or so
By: Serge Vecher (serge-v) 2007-01-09 13:07:59.000-0600
please reopen if you are able to reproduce this with 1.2.14 without backported patches.
By: Roy Sigurd Karlsbakk (rkarlsba) 2007-01-09 15:26:46.000-0600
The backported patch (the jitterbuffer submitted to 1.4) does not relate to the issue - how could it?
Also, I obviously have no possibility to remove the jitterbuffer, as that involves losing customers
Have anyone else running a high-volume asterisk server tried monitoring over days/weeks with mrtg as described here? http://karlsbakk.net/asterisk/
By: Serge Vecher (serge-v) 2007-01-09 15:48:51.000-0600
Via a deductive analysis: since you are the only one seeing the issue and since you run a system with complex backported patches, the problem must be caused by the patches. On of the ways to verify that the patches are not the cause of the stuck channels would be to reproduce the problem on a system without the patches, but since you refuse to do that, this report itself is 'stuck' in the cycle of opening and reopening.
By: Roy Sigurd Karlsbakk (rkarlsba) 2007-01-10 13:19:24.000-0600
well, then just close it, and i'll just restart asterisk every now and then. we really cannot afford removing the jitterbuffer, and at the moment we can't upgrade to 1.4, not before it's been more thoroughly tested.
By: Serge Vecher (serge-v) 2007-01-10 13:28:07.000-0600
Thanks. Please reopen the report if new information is available as per note 0057453.
By: Leif Madsen (lmadsen) 2007-02-01 07:42:26.000-0600
Reopened per request of 'burnproof' in IRC.
By: Ronald Chan (loloski) 2007-02-01 07:53:19.000-0600
This is reproducible to me anytime, i'm using SVN-branch-1.2-r52954 with no external or unsupported patches just a plain asterisk, i will post a new data today or tommorow as soon as the call center start operating.
what's weird with this bugs was there's an RTP activity is going on here, even the softphone or hardware phone was offline, the tech guy on the provider is also wondering why we have 1 1/2 call duration on their CDR's and i said the maximum call should be last for 5 to 10 minutes / call.
As soon this bug emerge i will post new data here. for evaluation
By: Ronald Chan (loloski) 2007-02-06 05:57:54.000-0600
Please close this for now, the problem doesn't resurface since i ask to reopen this, the only significant change i made from my end is changing our codec from gsm to ulaw since bandwidth is not a problem according to client.
sorry for wasting your time :)