[Home]

Summary:ASTERISK-08969: [patch] Asterisk leaves lingering UDP ports until no more ports available
Reporter:Bluce Ree (tasker)Labels:
Date Opened:2007-03-08 10:27:37.000-0600Date Closed:2010-03-15 11:16:21
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) history_bad_call
( 1) history_good_call
( 2) hung_bye.diff.txt
( 3) hung_cancel_log.txt
( 4) lingering_udp_chan_sip.patch8-against-1.2.18.txt
( 5) lingering_udp_chan_sip.patch8-against-1.4.4.txt
( 6) sip_show_channels
( 7) sip_show_channels2
Description:After a couple of days with lots of traffic (moderate load), running "netstat -a" shows hundreds of lingering UDP ports, which eventually saturates the available handles and asterisk stops accepting media connections.

This occurs in all 1.4.x versions and versions after 1.2.14. This does not occur in 1.2.13 and prior versions.
Comments:By: Serge Vecher (serge-v) 2007-03-08 11:00:31.000-0600

which channel driver(s) do you use? Do you think you can narrow down the 'breakage' to a specific revision between 1.2.13 and .14 releases?

By: Bluce Ree (tasker) 2007-03-08 12:15:35.000-0600

Started at 1.2.14 and 1.4.0. I have not tried any trunk versions prior to 1.4 but current svn trunk also has the problem.

This is using the SIP channel driver. I have no means to test H.323 or IAX.

By: Serge Vecher (serge-v) 2007-03-08 12:20:58.000-0600

do you have any patches on your system, or these installations are straight from the tarball?

By: Bluce Ree (tasker) 2007-03-08 13:40:16.000-0600

Straight from tarball. No patches.

By: Bluce Ree (tasker) 2007-03-10 20:06:42.000-0600

I just queried netstat -a on a system running 1.2.13 and 1.4.1 after running both in parallel for last 24 hours. Both systems are identical blades running FC4.

Here's the status taken right now with 0 calls online.

The first (1.2.13) has no lingering UDP ports.

The second (1.4.1) has 398 UDP ports lingering.

By: Serge Vecher (serge-v) 2007-03-12 09:40:20

see if you notice any differences in the sip show channels output.

By: Bluce Ree (tasker) 2007-03-13 01:07:21

That's what I did in my last note.

"Here's the status taken right now with 0 calls online.

The first (1.2.13) has no lingering UDP ports.

The second (1.4.1) has 398 UDP ports lingering."

The same 1.4.1 box right now:

bld-6*CLI> show uptime
System uptime: 3 days, 11 hours, 55 minutes, 44 seconds
Last reload: 3 days, 5 hours, 47 minutes, 3 seconds

A "show channels" shows 2 active calls and 4 active channels.

A "netstat -a" shows 854 lingering UDP ports. If I restart Asterisk at this point, the lingering gradually ports disappear.

By: Bluce Ree (tasker) 2007-03-15 13:21:39

*bump*

By: Joshua C. Colp (jcolp) 2007-03-15 14:30:43

What does sip show channels show? It is possible for an Asterisk channel to not be around but a SIP dialog to still be up occupying an RTP and RTCP port. The question would be, why did they not get destroyed.

By: Serge Vecher (serge-v) 2007-03-15 14:34:20

ok, since you've narrowed down the breakage to between the 1.2.13 and .13 releases, the next step is to examine specific revisions between these releases. The releases were done for the following revisions of the 1.2 branch:
1.2.13/ 45413
1.2.14/ 48468

If you check out and test specific revisions (start with changes in chan_sip.c and rtp.c first http://svn.digium.com/view/asterisk/branches/1.2/channels/chan_sip.c?rev=58847&view=log)
that should hopefully provide the revision that "broke it".

By: Bluce Ree (tasker) 2007-03-16 22:21:48

A "sip show channels" shows 35 open dialogs, "netstat -a" shows around 100 open UDP ports and "show channels" shows 0 calls.

By: Anthony LaMantia-2 (anthonyl) 2007-03-17 17:50:59

can you provide any infomation about your current sip setup. do you have many peers? are you sending out many MWI notifcations..

By: Joshua C. Colp (jcolp) 2007-03-19 11:36:01

The actual output and content of sip show channels was what I was looking at... to see what state the SIP dialogs are in, and a sip show channel <name> of one would also be useful. Add as an attachment. Thanks!

By: Anthony LaMantia-2 (anthonyl) 2007-03-19 12:51:16

hi file, quick question.. is sip_alloc() the only place a new udp sockets (it looks like 2 via ast_rtp_new_bindaddr) are bound? .. i was scanning over calls to sip_alloc() to make sure they were either destoryed after the function which called it destoryed the session or sechd'ed it for dectruction and most all of the cases pretty much seem to do so..

By: Bluce Ree (tasker) 2007-03-19 18:58:14

Issuing "sip show channels" currently shows 1164 sip channels. Issuing "show channels" shows only 138 open channels.

The system has no peers. It accepts inbound calls and checks the originating IP in real-time against a database in an AGI application.

Here's partial output:

Peer             User/ANR    Call ID      Seq (Tx/Rx)  Form  Hold     Last Message

216.xx.xx.69   9739681582  12b07824096  00103/00102  unkn  No       Tx: ACK
216.xx.xx.69   9739254877  3fa55777561  00103/00102  unkn  No       Tx: ACK
216.xx.xx.69   6315258681  7e45151d37e  00103/00102  unkn  No       Tx: ACK
216.xx.xx.69   9739311309  7b36bed0537  00103/00103  unkn  No       Tx: ACK
216.xx.xx.69   9144885200  0cd2bef4774  00103/00102  unkn  No       Tx: ACK
216.xx.xx.69   6319682316  4af8fb995be  00103/00102  unkn  No       Tx: ACK
216.xx.xx.69   8626868904  047779d5406  00103/00103  unkn  No       Tx: ACK

Shouldn't these open dialogs time out and get destroyed?



By: Serge Vecher (serge-v) 2007-03-20 08:37:27

Alright, we'll need to see some logs here. Pick one IP and in the instruction below, replace "sip debug" with "sip set debug ip aaa.bbb.ccc.ddd"
1) Prepare test environment (reduce the amount of unrelated traffic on the server);
2) Make sure your logger.conf has the following line:
  console => notice,warning,error,debug
3) restart Asterisk with the following command:
  'asterisk -Tvvvvvdddddngc | tee /tmp/verbosedebug.txt'
4) Enable SIP transaction logging with the following CLI commands (1.4/trunk commands in parenthesis):
set debug 4 (core set debug 4)
set verbose 4 (core set verbose 4)
sip debug (sip set debug)
5) Reproduce the problem
6) Trim startup information and attach verbosedebug.txt to the issue.

By: Serge Vecher (serge-v) 2007-03-20 08:50:47

linking to bug 8036 based on note 0061087 there.

tasker: where you able to work on narrowing down a specific revision as per my note 0060940.

By: Bluce Ree (tasker) 2007-03-20 19:57:22

Unfortunately, this is happening in a production environment. I have both time and resource constraints.

By: Serge Vecher (serge-v) 2007-03-26 12:07:49

tasker: still need to see those logs ...

By: cmaj (cmaj) 2007-03-26 13:48:14

I'm seeing the exact same problems in all 1.4 releases, including 1.4.2.  "sip show channels" shows hundreds of channels with last message as "Init: INVITE"

I tried turning on more intense SIP logging, but it crashed * and this is a production server, so I can't turn the logging back on right now.

Also, I'm running this on a server behind another server running the iptables firewall with SIP connection tracking enabled.



By: Bluce Ree (tasker) 2007-04-07 14:22:07

After 4 days:

"sip show channels - 4153 active sip channels"

Screen also full of these:

"[Apr  7 15:07:55] ERROR[15759]: rtp.c:1913 ast_rtp_new_with_bindaddr: No RTP ports remaining. Can't setup media stream for this call.
[Apr  7 15:07:55] WARNING[15759]: udptl.c:819 ast_udptl_new_with_bindaddr: No UDPTL ports remaining"

Asterisk rejects calls at this point and required a reboot.

I maintain my recommendation that this be bumped to critical.

By: Badalian Vyacheslav (slavon) 2007-04-09 08:53:10

SVN-trunk-r60763 - bug still here

By: cmaj (cmaj) 2007-04-09 09:32:30

I second the move to critical.  I've disabled SIP on my production server because of this problem.

I want to rule out my firewall setup.  To others experiencing this problem -- are you behind a firewall ?  Using SIP connection tracking with iptables ?  Thanks.

By: Serge Vecher (serge-v) 2007-04-09 12:14:46

Guys, if this is an urgent production issue, I would suggest putting a bounty offer together and contacting one of the Asterisk consultants (oej could be one) to resolve this issue, otherwise things will proceed until one of sip developers has a chance to do that.

By: Joel Jacobson (joelj) 2007-04-10 02:26:51

I am having the same problem.

I am willing to contribute with 500 USD to the bounty if this problem can be solved within 5 days.

Thanks.

By: Clod Patry (junky) 2007-04-10 10:07:14

tasker: could ya attach a snip of ur sip show channels (with all totals), same for core show channels please.

Thanks.

By: cmaj (cmaj) 2007-04-10 11:54:19

I've been running 5-10 concurrent SIP calls thru one provider for several hours today after applying the attached file lingering_udp_chan_sip.patch.txt and not seen this problem.  Can somebody try it under a little more load ?  (Patch is against 1.4.2 release tarball). Thanks.



By: cmaj (cmaj) 2007-04-10 17:00:55

It looks like this problem only crops up when you reach more than 10 concurrent SIP calls.  My patch is not solving it.

By: cmaj (cmaj) 2007-04-11 18:04:47

My bad on that patch yesterday.  The 'patch2' is holding up much better today.  I really sat down and cranked thru some logs to get at it.

There's two cases: one with calls stuck in INVITE and the other in an ACK.  Both are the result of retransmission failures.  When the server finally comes back with a response, the call is already hungup.

Problem is in the first case that the 100 TRYING received from the server cancels the destruction of the SIP dialog.  patch2 attempts to fix that.

In the second case, a 487 received from the server fails to schedule an auto-destruction of the SIP dialog.  patch2 also attempts to fix that.

To help test this patch, besides 'netstat -an' output, do 'core set verbose 4' on the CLI then after an hour of heavy use 'cat /var/log/asterisk/messages | grep Retransmitting'.  If you see a lot of retransmit counts of 5, then you are probably going to run into this bug and should try patch2 above.

By: Octavio Ruiz (tacvbo) 2007-04-12 14:00:13

You will obtain a more accurate metric with:

# lsof | grep asterisk | grep UDP

By: cmaj (cmaj) 2007-04-12 17:15:38

Can anybody else test patch2 ?  This is my second day of running it in production.  But it's not on a very busy server.  Only 10-20 concurrent calls.  Still, no signs of hung UDP.

By: daniel g (revolution) 2007-04-13 11:27:12

i can test it on both of our production systems here...

one normally only has between 10 - 20 concurrent calls; while the other usually gets to around 40 - 50 during peak calling time.

I'll apply the patch tonight and let you know how it's working over the weekend.

By: Joel Jacobson (joelj) 2007-04-13 17:39:54

Why is the patch against 1.4.2?
Wouldn't it be better to fix this bug for the latest version?
I tried to find the piece of code to patch and apply the patch manually for 1.4.12, but the block of relevant code is totally missing in 1.4.12?
No special handler for response 101 exists in 1.4.12, it is threated like 183.

/* RFC3261 says we must treat every 1xx response (but not 100)
that we don't recognize as if it was 183.
*/

By: cmaj (cmaj) 2007-04-13 18:19:08

1.4.12 ???  Where did you get that ?  The last tarball I see is 1.4.2

That said, this was broke between 1.2.13 and 1.2.14, and for all 1.4.x

I took a diff between 1.2.13 and 1.2.14 -- the patch2 should apply with a little fuzz on 1.2.x, but I haven't tried.

About that comment in the code you cited, the 2 lines right below that comment are somewhat confusing:

if (resp > 100 && resp < 200 && resp!=101 && resp != 180 && resp != 183)
   resp = 183;

But they are the same as:

if (resp > 101 && resp < 200 && resp != 180 && resp != 183)
   resp = 183;

So 100 and 101 were not handled as 183.  However, both 100 and 101 are handled the same way in the switch statement below that.

I'm attaching patch3 which tries to make it a little clearer and only handles the case of 100, which is a little more honest because I only saw the bug with 100 and never with 101.  Please let me know if that clears things up.

By: cmaj (cmaj) 2007-04-13 18:22:43

Okay, I see now, you pulled 1.2.17

My patch doesn't work against that.

Let me fix it.

By: cmaj (cmaj) 2007-04-13 18:31:47

Alright 2 versions of patch3 are up now, one each for 1.2.17 and 1.4.12, clearly marked.  I haven't tested with 1.2.17, but it compiles.

(The first 3 patches can be deleted by a bug marshal.)



By: Joel Jacobson (joelj) 2007-04-14 07:13:33

OK, thank you.

I have upgraded to 1.4.2 and applied the patch for it now.

This is the result after only a couple of minutes of uptime.

Will retest later today and let you know the result.

(Data such as IP addresses and phone numbers have been removed below.)

voip*CLI> sip show channels
Call ID Seq (Tx/Rx) Form Hold
3435fa7b08d 00102/00000 g729 No Init: INVITE
8D30164E-E9 00101/00101 g729 No Rx: INVITE
3e2fdf392fa 00102/00000 g729 No Tx: ACK
88A1BC4A-E9 00101/00101 g729 No Rx: ACK
427754d3717 00102/00000 g729 No Tx: ACK
7B8003E1-E9 00101/00101 g729 No Rx: ACK
3cba3fbe0da 00102/00000 unkn No Tx: ACK
34f9302e699 00102/00000 unkn No Tx: ACK
584740b81cf 00102/00000 g729 No Tx: ACK
4943A8A0-E9 00101/00101 g729 No Rx: ACK
2527a63c469 00102/00000 g729 No Tx: ACK
4875FA95-E9 00101/00101 g729 No Rx: ACK
39863fb6144 00102/00000 g729 No Tx: ACK
6DCA1A76-E9 00101/00101 g729 No Rx: ACK
7020851f15d 00102/00000 g729 No Tx: ACK
67CBB8D0-E9 00101/00101 g729 No Rx: ACK
5128fd8c15f 00102/00000 g729 No Tx: ACK
635108-3385 00101/00001 g729 No Rx: ACK
636295e8581 00102/00000 g729 No Tx: ACK
4B22480A-E9 00101/00101 g729 No Rx: ACK
20 active SIP channels

voip*CLI> show channels
State Application
Ringing AppDial
Ring Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
18 active channels
9 active calls

root@voip:~# lsof | grep asterisk | grep UDP | wc -l
44

By: cmaj (cmaj) 2007-04-14 09:38:24

Are you bridging SIP-to-SIP ?  If so, then that looks pretty good !  From what I can tell, the SIP dialogs (from "sip show channels") should auto-destruct in at most, 32 seconds after the last message was received.  So even after a call is hungup (not seen in "core show channels"), the open UDP ports will still linger for half a minute.

By: daniel g (revolution) 2007-04-14 13:52:23

OK... after running over night... some DEFINITE improvements are seen... however it seems to still be lingering on Invites...

03eb6fc5215  00102/00000  unkn  No       Init: INVITE
3a8403fe412  00102/00000  unkn  No       Init: INVITE
0f61724419a  00102/00000  unkn  No       Init: INVITE
0451ad5f34c  00102/00000  unkn  No       Init: INVITE
765355af622  00102/00000  unkn  No       Init: INVITE
7160affa282  00102/00000  unkn  No       Init: INVITE
6dbdef3c375  00102/00000  unkn  No       Init: INVITE
7e23a040752  00102/00000  unkn  No       Init: INVITE
7914bfaf6ce  00102/00000  unkn  No       Init: INVITE
456e67c87e1  00102/00000  unkn  No       Init: INVITE
2dcce62b63a  00102/00000  unkn  No       Init: INVITE
1f3e34341f7  00102/00000  unkn  No       Init: INVITE
0678a58e225  00102/00000  unkn  No       Init: INVITE
3893ca636e2  00102/00000  unkn  No       Init: INVITE
284bf2a106e  00102/00000  unkn  No       Init: INVITE

These have remained in tehre for a few minutes at least -- suggestions? ideas?

THAT said... The number of channels reported in "sip show channels" has dropped from around 500+ after a night of usage to closer to 200.

Any additional information you need?

By: Joel Jacobson (joelj) 2007-04-14 13:57:55

revolution: What patch did you use? I am using the patch3 for 1.4.2.

I have been running Asterisk for 7 hours now.

Results below. Look good, I think?

voip*CLI> show channels
State Application
Down AppDial
Ring Dial
Ringing AppDial
Ring Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Bridged
Up Dial
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
Up Bridged
Up Dial
26 active channels
13 active calls

voip*CLI> sip show channels
Call ID Seq (Tx/Rx) Form Hold
198cc12f032 00102/00000 g729 No Init: INVITE
DC27D0DF-E9 00101/00101 g729 No Rx: INVITE
2ccf5965331 00102/00000 unkn No Tx: ACK
2bc546c55cd 00295/00000 unkn No  
04b1ccf1079 00295/00000 unkn No  
0e22c84a14a 00102/00000 g729 No Tx: ACK
CEF946CD-E9 00101/00101 g729 No Rx: ACK
7fd2b2b82d3 00102/00000 g729 No Tx: ACK
62666c29304 00102/00000 unkn No Tx: ACK
CD80B3CA-E9 00101/00101 g729 No Rx: ACK
461ed7e5380 00102/00000 g729 No Tx: ACK
8E1DBA79-E9 00101/00101 g729 No Rx: ACK
2a65fa3800a 00102/00000 g729 No Tx: ACK
661238-3385 00101/00001 g729 No Rx: ACK
0e66e3c4467 00102/00000 g729 No Tx: ACK
1c10f7fe440 00102/00000 g729 No Tx: ACK
6316697B-E9 00101/00101 g729 No Rx: ACK
6307C35A-E9 00101/00101 g729 No Rx: ACK
0c49310b72f 00102/00000 g729 No Tx: ACK
62673ED2-E9 00101/00101 g729 No Rx: ACK
36af22f937b 00102/00000 g729 No Tx: ACK
46BA7D48-E9 00101/00101 g729 No Rx: ACK
229cec5876e 00102/00000 g729 No Tx: ACK
6A9AECF3-E9 00101/00101 g729 No Rx: ACK
6f8093137ea 00102/00000 g729 No Tx: ACK
2790EC0B-E9 00101/00101 g729 No Rx: ACK
0b7bd73407f 00102/00000 g729 No Tx: ACK
660480-3385 00101/00001 g729 No Rx: ACK
5995122e55f 00102/00000 g729 No Tx: ACK
3DF6BF92-E9 00101/00101 g729 No Rx: ACK
26 active SIP channels

root@voip:~# lsof | grep asterisk | grep UDP | wc -l
58



By: cmaj (cmaj) 2007-04-14 14:16:45

revolution: Can you post output similar to joelj, to compare "core show channels", "sip show channels", and "lsof | grep asterisk | grep UDP | wc -l" ?

Also, some output from "sip show channel XXX" would be great.  That will reveal more state information on those stuck channels.

By: daniel g (revolution) 2007-04-14 16:17:12

not being during peak times makes logs easier :)
keep in mind that this is active so it might be a slight change between logs...

lsof | grep asterisk | grep UDP | wc -l

79

core show channels

Up      Bridged Call
Up      Dial
Up      Bridged Call
Up      Dial
Up      Bridged Call
Up      Dial
Up      Bridged Call
Up      Dial
Up      Bridged Call
Up      Dial
10 active channels
5 active calls

sip show channels

740d343e260  00102/00000  ulaw  No       Init: INVITE
52ae679c00c  00101/22215  g729  No       Rx: INVITE
6190c022069  00102/00000  g729  No       Init: INVITE
HOUMGC01200  00101/00001  g729  No       Rx: INVITE
1c939d11180  00102/00000  ulaw  No       Init: INVITE
c9db0f64ac2  00101/51104  g729  No       Rx: INVITE
     00102/00000  g729  No       Tx: ACK
     00101/00001  g729  No       Rx: ACK
     00102/00000  ulaw  No       Tx: ACK
da84f7d617e  00101/37565  g729  No       Rx: ACK
7f9d111918e  00102/00000  ulaw  No       Tx: ACK
4e75ca45bfb  00101/61239  ulaw  No       Rx: ACK
1f3e2d326dd  00102/00000  ulaw  No       Tx: ACK
ORLMGC01200  00101/00001  g729  No       Rx: ACK
7a8fdd36453  00102/00000  ulaw  No       Tx: ACK
 3134758561@  00101/04803  g729  No       Rx: ACK
4bd432a92f2  00102/00000  ulaw  No       Tx: ACK
a9aba496ff2  00101/10089  g729  No       Rx: ACK
 269c95ff595  00102/00000  unkn  No       Init: INVITE
38c9fa9148d  00102/00000  unkn  No       Init: INVITE
03eb6fc5215  00102/00000  unkn  No       Init: INVITE
3a8403fe412  00102/00000  unkn  No       Init: INVITE
0f61724419a  00102/00000  unkn  No       Init: INVITE
0451ad5f34c  00102/00000  unkn  No       Init: INVITE
765355af622  00102/00000  unkn  No       Init: INVITE
7160affa282  00102/00000  unkn  No       Init: INVITE
6dbdef3c375  00102/00000  unkn  No       Init: INVITE
7e23a040752  00102/00000  unkn  No       Init: INVITE
7914bfaf6ce  00102/00000  unkn  No       Init: INVITE
456e67c87e1  00102/00000  unkn  No       Init: INVITE
2dcce62b63a  00102/00000  unkn  No       Init: INVITE
1f3e34341f7  00102/00000  unkn  No       Init: INVITE
0678a58e225  00102/00000  unkn  No       Init: INVITE
3893ca636e2  00102/00000  unkn  No       Init: INVITE
284bf2a106e  00102/00000  unkn  No       Init: INVITE

sip show channel 284bf2a106e

 * SIP Call
 Curr. trans. direction:  Outgoing
 Call-ID:                284bf2a106ebdd176632dd9863c24cc3
 Owner channel ID:       <none>
 Our Codec Capability:   256
 Non-Codec Capability (DTMF):   1
 Their Codec Capability:   0
 Joint Codec Capability:   256
 Format:                 0x0 (nothing)
 MaxCallBR:              384 kbps
 Theoretical Address:    Public IP:5060
 Received Address:       Public IP:5060
 SIP Transfer mode:      closed
 NAT Support:            Always
 Audio IP:               Local IP
 Our Tag:                as1040ea60
 Their Tag:
 SIP User agent:
 Username:              
 Peername:              
 Original uri:           sip:   @192.168.0.12
 Need Destroy:           0
 Last Message:           Init: INVITE
 Promiscuous Redir:      No
 Route:                  N/A
 DTMF Mode:              rfc2833
 SIP Options:            (none)

any thing else you need -- feel free to ask. So far this patch has cut down on a LOT of the lingering channels... in one of the servers (the one running between 10 - 20 concurrent calls typically) -- it seems to have cut them out entirely. These come from a much larger box... typically running around 40 - 50 (and sometimes *rarely* as high as 60) concurrent calls

Patch being used is patch3-against-1.4.2 and it has been running since about 9 P.M. EST last night (which makes it around 20 hours). Progress is definitely being made... thanks for that cmaj.



By: Olle Johansson (oej) 2007-04-16 09:31:49

cmaj: Please confirm that you have a disclaimer on file, so I can check your patches. Thanks.

By: Olle Johansson (oej) 2007-04-16 09:38:35

Ok guys. To debug these channels, we need sip history on them. Turn on siphistory and dumphistory in the configuration file, restart. Then show history on the hung channels so we know what's going on. Thanks.

By: cmaj (cmaj) 2007-04-16 10:05:16

Disclaimer on file.  I have no rights when I submit patches :)

oej: Do you feel that dialogs should be destroyed if they are call-related but have no channel owner ?  That is the idea behind my patches.  These additional cases of hung channels can be solved by applying that idea in more places.

By: daniel g (revolution) 2007-04-16 12:00:55

I can enable *history later tonight (when I don't have 44 active calls running) and will submit either tomorrow or wednesday (whichever will be most useful to you all).

thanks olle/cmaj!

(p.s. my active SIP channels are STILL lower than they were prior to cmaj's patches)

By: cmaj (cmaj) 2007-04-16 14:07:34

patch4 expounds on the idea that you can not cancel SIP dialogs related to a channel if said channel no longer exists.

By: cmaj (cmaj) 2007-04-16 14:15:56

joelj & revolution: Are y'all registering with a SIP server in sip.conf ?  If so, are those stuck channels only with that server and no other SIP endpoints ?

By: Badalian Vyacheslav (slavon) 2007-04-16 14:35:01

- /*! XXX: we should also check here does the other side supports t38 at all !!! XXX */
+ /*! XXX: we should also check here does the other side supports t39 at all !!! XXX */

Intresting change )

By: daniel g (revolution) 2007-04-16 15:28:39

Negative. In my case, these are calls going to off-site ATAs.

I will apply patch4 asap and let you know the results tomorrow when I post the sip history.

Running sip show history <channel> shows me that * attempted (five times) to send an Invite to an ATA which was behind a router (in the case of this channel).

In the case of some other channels, the ATA has been on a public IP but may have gone offline while Asterisk was attempting to send the Invite (or may not have -- just tested a call to an ATA I thought was offline and found out it was not!).

By: Joel Jacobson (joelj) 2007-04-16 16:00:21

I don't register with any SIP peers.

sip.conf:

[general]
port=5060
bindaddr=0.0.0.0
dtmfmode=rfc2833
canreinvite=no
deny=all
allow=g729
nat=yes

[outbound]
fromuser=voip
type=peer
host=x.x.x.x
port=5060
qualify=no

[inbound]
type=peer
accountcode=xxxx
context=xxxx
insecure=very
host=x.x.x.x

extensions.conf:
[general]
writeprotect=yes

[globals]

[default]
exten => s,1,Answer()
exten => s,2,Playback(demo-thanks)
exten => s,3,Hangup()

[xxxx]
exten => _XXXXXXXXXXX,1,AGI(route.agi)
exten => h,1,DeadAGI(hangup.agi|${DIALSTATUS}|${ANSWEREDTIME})

By: Joel Jacobson (joelj) 2007-04-16 16:17:18

After having applied the new patch4, is it sufficient to recompile and just reload asterisk and the sip module? Like this:

rm -rf asterisk-1.4.2
tar xvzf asterisk-1.4.2.tar.gz
cd asterisk-1.4.2
patch -p0 < ../lingering_udp_chan_sip.patch4-against-1.4.2.txt
./configure --prefix=/usr
make
make install
asterisk -r
voip*CLI> reload
voip*CLI> module reload chan_sip.so

Or do I need to do "stop now" and then restart asterisk?

By: cmaj (cmaj) 2007-04-16 16:28:28

In theory, the "reload module XXX" should work, but I always use "restart now".

By: Joel Jacobson (joelj) 2007-04-16 17:00:48

Will active calls be killed if I do a restart now?

By: cmaj (cmaj) 2007-04-16 17:13:26

Yes.  You could do "restart gracefully" to wait for the last call to hangup.  During that time, no new calls will be accepted by the server.

By: daniel g (revolution) 2007-04-17 08:46:24

some history... it seems that the channels on only hanging now on certain peers... but this is an ATA with a public IP that is currently accessible. The other hung channels look fairly similar to this (with only the obvious differences -- IP/Chan ID/etc). Is this what you wanted oej?

sip show history 3f5c72a645dcd822250ea4eb47f204cf@sip._.com

1. NewChan         Channel SIP/0-0829aef0 - from 3f5c72a645dcd822250ea4eb4
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. ReTx            1000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
4. ReTx            2000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
5. ReTx            4000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
6. ReTx            8000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
7. ReTx            16000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0

By: cmaj (cmaj) 2007-04-17 09:36:33

I didn't know about "sip history" until this bug.  That is extremely helpful.

That call you posted is never scheduling a destruction of the SIP dialog.  My patch only cancels the cancel of a pending destruction.  It doesn't set one up in the first place.

Back to the drawing board for a bit...

By: daniel g (revolution) 2007-04-17 12:36:13

Understandable. I've found with some peers... an Invite isn't responded to in a timely manner (the peer may've gone off-line mid Invite... or a poorly-designed/too-restrictive firewall may be blocking or any number of things) and this causes a hang in the sip channels.

Progress has DEFINITELY been made as I said before cmaj... I don't really have to worry about restarting * every couple of days now ;)

By: Olle Johansson (oej) 2007-04-18 04:54:13

Are all of these calls *outbound* calls, initiated from the PBX where the INVITE failed and we did not cancel them and clean up?

By: Joel Jacobson (joelj) 2007-04-18 05:58:00

>Are all of these calls *outbound* calls, initiated from the PBX where the INVITE failed and we did not cancel them and clean up?

Yes, all of the "unkn No Init: INVITE" sip channels are outgoing, that is, the peer IP is the IP of the SIP server receiving the call.

I have not observed this problem for incoming channels.

By: Olle Johansson (oej) 2007-04-18 06:37:33

I've been able to reproduce this issue by just placing a call to a server that has nothing on port 5060
with dial(sip/a.b.c.com) - no timeouts or anything. Autocongest kicks in, but the calling channel (inbound) doesn't seem to issue a hangup. Wonder if different inbound channels will behave
differently - I can only test SIP here.

Will continue to dig.

By: daniel g (revolution) 2007-04-18 11:18:13

From my end -- yes these are all outbound. I've not noticed this problem with inbound either...

By: Joel Jacobson (joelj) 2007-04-18 13:43:08

Once again, I see a lot of "unkn No Init: INVITE", but this time I noticed something I didn't think of before.

I have two different groups of SIP peers that I connect two, located at two different colocation sites.

The problems only happens to one of them. Same hardware on both sites. The difference is that one of them is on a quite unreliable ADSL connection, while the other one is on a dedicated T1.

All errors occurs to the SIP peers on the ADSL connection.

I did a ping on the IPs, and they sometimes are offline.

The Asterisk have probably sent a lot of SIP INVITEs to the SIP peers on that connection that sometimes fail because of the unreliable Internet connection.

By: daniel g (revolution) 2007-04-18 15:35:03

I would agree with you joelj... in most cases -- it would seem that the peer has lost connectivity mid-invite (how's that for poor timing?) -- but asterisk should still destroy the channel instead of just leaving it hanging after the 5 invites fail... right?

By: cmaj (cmaj) 2007-04-18 16:15:40

It should totally do that.

By calling an IP that is up but not listening for SIP, I am able to reproduce the error reliably now.  This should speed things up.

Do you see an auto-congest at the end of "sip show history xxx" ?  I am scheduling a dialog destruction in the auto_congest function for my next patch, amongst other places.

By: daniel g (revolution) 2007-04-18 16:33:20

negative... from sip show history xxxx you get exactly what I posted earlier (with the peers on my system at least). It sends an Invite... retransmits a few times... then gives up and leaves the channel hung.

By: cmaj (cmaj) 2007-04-18 17:50:35

patch5 is a drastic change, but it's much simpler than earlier patches.  If it doesn't fix this bug, then I think combining some elements from patch4 will help.

Thanks for the good feed back on those connectivity issues with your endpoints.

By: cmaj (cmaj) 2007-04-18 17:58:20

Hold on with patch 5. It was too simple.  I see a problem with it.  Just a few minutes...

By: cmaj (cmaj) 2007-04-18 18:18:11

I don't understand how something as simple as patch 6 is fixing this problem, but it is.  Please test with just patch 6.

If it doesn't solve everything with patch 6, but it seems to solve the other half of the problems not solved with patch 4, then apply patch 4 on top of patch 6.



By: daniel g (revolution) 2007-04-18 19:02:25

patch applied... I actually saw the active channels get down to around 100 (some of our original ATA had been set to register WAY too frequently thus result in hung registration channels... your patch seems to be clearing most of those out as well...)

I'll let you know the results later tonight or tomorrow morning... so far so good though!!

By: daniel g (revolution) 2007-04-19 07:02:50

OK. Applied *JUST* patch 6 last night... still have some hung channels (but not outbound as far as I can tell -- though I admit to not having checked every single channel... just a few), so I'm going to apply patch 4 right now and test and post the results later today.

edit -- OK, some testing has been done with both patches 4 and 6 applied (only about two hours of uptime thus far) and I see no hung channels. I've tried to recreate random failures (rather than doing it programmatically) by taking the power cord out of the phone... taking the cat5 cable out of the phone... disabling the DS3... etc. and so far I still have no hung sip channels.

Will continue to run with patches 4 and 6 until close of business today and report back (if there are hung channels). If there are no hung channels by C-O-B, i'll check after 24 full hours have elapsed and post then.

(p.s. in case you were wondering, we've already had 30 concurrent calls running this morning... )



By: daniel g (revolution) 2007-04-19 10:27:29

ok the story behind this one... is irrelevent (although still sorta amusing)

1. NewChan         Channel SIP/0a-08288e18 - from 77770544077a33693a9b2abf2
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. Rx              SIP/2.0 / 102 INVITE / 100 Trying
4. Rx              SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx              SIP/2.0 / 102 INVITE / 200 OK
6. TxReq           ACK / 102 ACK - -UNKNOWN-
7. Rx              INVITE / 26194 INVITE / sip:+0b@aaa.bbb.ccc.ddd:5060
8. TxResp          SIP/2.0 / 26194 INVITE - 488 Not acceptable here
9. Rx              ACK / 26194 ACK / sip:+0b@aaa.bbb.ccc.ddd:5060
10. Rx              INVITE / 26195 INVITE / sip:+0b@aaa.bbb.ccc.ddd:5060
11. TxResp          SIP/2.0 / 26195 INVITE - 488 Not acceptable here
12. SchedDestroy    32000 ms
13. CancelDestroy

In this case it seems as though a destruction was scheduled until * got the message to cancel it. Now I can fix the situation that caused this (essentially a person was trying to dial their number at another location and the box was automatically switching the codec which was a disallowed codec on the receiving end... thus the 488)... but it still brings up a point of interest as to whether any other situations could possibly cause this same thing.

By: Bluce Ree (tasker) 2007-04-19 14:55:39

How about just eliminating CancelDestroy? Unless there is an active voice stream on a channel I don't see a reason to cancel a destroy.

By: daniel g (revolution) 2007-04-19 15:18:48

I could be wrong and it may not ACTUALLY mean Cancel the Destroy.

It could be Cancel/Destroy... but from looking at the chan_sip.c file, the comment is Cancel destruction of SIP Dialog.

I'm assuming it's there for a reason -- so who can shed some light on this?

By: cmaj (cmaj) 2007-04-19 15:36:59

I can't speak for the original authors, and I think this is my first patch in chan_sip, but SIP is not just for voice.  That seems to be part of the concern in the code.  You might be sending a text message via SIP (SIMPLE) with no voice at all.  Registrations, notifications, etc., never make a sound.

That said, I think the CancelDestroy's -- cancel the destruction of a SIP dialog -- are sprinkled too liberally in the code.  My patches beat those CancelDestroy's with a large stick.  Perhaps it is too large, but for the most part, I agree that the code should err on the side of the voice phone call.

Worst case scenario, my final patches for this bug are modified to only operate if some option in sip.conf like "babysitdeadIPaddresses=yes" is set.

I will add patches 4 & 6, and try to fix the funky voice codec situation in upcoming patch 7, but if there are more corner cases it would be good to find them ASAP.

By: daniel g (revolution) 2007-04-19 16:26:34

Well cmaj I can say this much for your patches -- the ONLY lingering channel thus far is the one I posted earlier. All others have cleared out.

edit : just so you know... I restarted * last night (to clear out the one lingering channel after I changed the codecs to both be ulaw -- which is what they were supposed to be anyways)... and I have NO lingering channels at this point. * has been running for 18 hours now... and all SIP channels have cleared properly.

I feel like working out the CancelDestroy may be the solution to all the rare unique cases. :) Patch7 (which I suppose would be a combo of patches 4 & 6 and maybe something to address the canceldestroy) should hopefully be your last one cmaj.



By: Ronald Chan (loloski) 2007-04-24 09:37:35

cmaj: could you create a backport for 1.2 branch for your patch 7 also :) thanks

By: Joel Jacobson (joelj) 2007-04-24 13:06:00

I have run Asterisk with patch4+patch6 since the 19th of April, with peak 24 concurrent calls. This is the result:

root@voip:~# lsof | grep asterisk | grep UDP | wc -l
46
root@voip:~# asterisk -r
Asterisk 1.4.2, Copyright (C) 1999 - 2006 Digium, Inc. and others.
Created by Mark Spencer <markster@digium.com>
Asterisk comes with ABSOLUTELY NO WARRANTY; type 'core show warranty' for details.
This is free software, with components licensed under the GNU General Public
License version 2 and other licenses; you are welcome to redistribute it under
certain conditions. Type 'core show license' for details.
=========================================================================
Connected to Asterisk 1.4.2 currently running on voip (pid = 25828)
voip*CLI> show channels
Channel              Location             State   Application(Data)
SIP/xxxxxxxxx xxxxxxx@xxxx Down    AppDial((Outgoing Line))
SIP/xxxxxxxxx xxxxxxx@xxxx Ring    Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(OOH323/x.x.x.x
OOH323/x.x.x.x xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(SIP/xxxxxxxxx
SIP/xxxxxxxxx xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(SIP/xxxxxxxxx
SIP/xxxxxxxxx xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(SIP/xxxxxxxxx
SIP/xxxxxxxxx xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(SIP/xxxxxxxxx
SIP/xxxxxxxxx xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(OOH323/x.x.x.x  
OOH323/x.x.x.x xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(OOH323/x.x.x.x  
OOH323/x.x.x.x xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
SIP/xxxxxxxxx (None)               Up      Bridged Call(SIP/xxxxxxxxx
SIP/xxxxxxxxx xxxxxxx@xxxx Up      Dial(SIP/xxxxxxxxx
18 active channels
9 active calls
voip*CLI> sip show channels
Peer             User/ANR    Call ID      Seq (Tx/Rx)  Form  Hold     Last Message
x.x.x.x    xxxxx  0d6379b3780  00102/00000  g729  No       Init: INVITE
x.x.x.x     (None)      4CD78C74-F1  00101/00101  g729  No       Rx: INVITE
x.x.x.x    xxxxx  1124692724e  00102/00000  g729  No       Tx: ACK
x.x.x.x     (None)      47D49F1B-F1  00101/00101  g729  No       Rx: ACK
x.x.x.x    xxxxx  280ce54f593  00102/00000  unkn  No       Tx: ACK
x.x.x.x    xxxxxxxxx  11988b5106b  00102/00000  g729  No       Tx: ACK                  
x.x.x.x    xxxxx  2ab967a202d  00102/00000  unkn  No       Tx: ACK
x.x.x.x    xxxxxxxxx  3c1464c7765  00102/00000  unkn  No       Tx: ACK
x.x.x.x    xxxxx  19bb31190cd  00102/00000  g729  No       Tx: ACK
x.x.x.x     6153158959  810748-3386  00101/00001  g729  No       Rx: ACK
x.x.x.x    xxxxxxxxx  1a7e6d425c7  00102/00000  g729  No       Tx: ACK
x.x.x.x     (None)      E0154D5B-F1  00101/00101  g729  No       Rx: ACK
x.x.x.x    xxxxxxxxx  671ab04d587  00102/00000  g729  No       Tx: ACK
x.x.x.x     (None)      848875CF-F1  00101/00101  g729  No       Rx: ACK
x.x.x.x    xxxxxxxxx  07301e04081  00102/00000  g729  No       Tx: ACK                        
x.x.x.x     (None)      723F30BB-F1  00101/00101  g729  No       Rx: ACK
x.x.x.x    xxxxxxxxx  6d7238376c1  00102/00000  g729  No       Tx: ACK
x.x.x.x    xxxxxxxxx  6f1c1d7c579  00102/00000  g729  No       Tx: ACK
x.x.x.x    xxxxxxxxx  65bd428e781  00102/00000  g729  No       Tx: ACK
x.x.x.x     (None)      17389D67-F1  00101/00101  g729  No       Rx: ACK
20 active SIP channels
voip*CLI>

The problem seems to have disappeared.

By: Joel Jacobson (joelj) 2007-04-24 14:18:58

oej: Btw, would it be possible for you to take a look at my suggested solution to another problem? I have waited many days now, but noone has commented on my idea. Noticed you were involved in the bug/feature. Bug # 4825
(I know this is probably a bad place to discuss other bugs, but I don't know of any way to contact a specific user in the forum, is there a way?)

By: Olle Johansson (oej) 2007-04-27 03:37:01

You never ever auth a CANCEL message, so the latest patch is very bad.

By: cmaj (cmaj) 2007-04-30 10:06:38

I put together two versions of patch 8; against 1.4.4 & 1.2.18.  The 1.4.4 version should apply to 1.4.2 with a little fuzz.  I would guess the 1.2.18 version would apply with similar fuzz for recent 1.2.x.

I have not tested it yet.  It's just an addition of patches 4 & 6 (patch 7 has the t39 typo; ignore it).  With no significant problems reported from the combo of 4 & 6, other than the one odd codec issue experienced by revolution, I didn't see a need to do much else other than back port to 1.2.18

oej: Were you looking at the first patch by accident ?  That is old and should be ignored.

serge-v: Can you delete all patches *except the two patches labeled "patch 8"* ?

joelj & revolution: Any change in status with patches 4 & 6 applied ?  Any other weird SIP issues cropping up that you think might be related to the patches ?  Is it possible to upgrade to 1.4.4 and try that out with patch 8 ?

loloski: Please let me know how the 1.2.18 patch works and/or if it works on another revision of 1.2.x.  I've only just gotten it to compile against 1.2.18 and have no idea if it works in production, nor do I have an environment to test it.  The differences between 1.2.x and 1.4.x are significant, and thus so are the differences between the patches.  It will need thorough testing.

By: daniel g (revolution) 2007-04-30 10:58:13

cmaj -- moved to 1.4.4 on one of the systems earlier, I'll do some testing with that and see how it goes. Other than the odd codec issue, I didn't see any other invites hanging. Give me a couple of days and I'll let you know the results.

By: Ronald Chan (loloski) 2007-05-01 03:18:53

cmaj,

Ok i will do that, i'll let you know how is it going, i experience this mostly with reinvites turn on with Queues without having chan_agent involve via incoming SIP/ZAP call with 20+ members/queue, thanks for the patch anyway i will report my progress here.

By: Olle Johansson (oej) 2007-05-02 01:27:59

cmaj: I think you are on the right track, but change way too many places. The big question is where in the code we cause a CancelDestroy on the sip history posted here. We send a 488 then for some unknown reason cancels destruction after a scheduled destruction - with no incoming data.

Do we have more examples on "sip history" on calls that hang or do they all look like this one?

By: Badalian Vyacheslav (slavon) 2007-05-02 08:08:40

asterisk*CLI> sip show history 0de646f20c113ad364f43f2a75c3b878@87.255.0.218
asterisk*CLI>
 * SIP Call
1. ReTx            1000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
2. ReTx            2000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
3. ReTx            4000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
4. ReTx            8000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0

asterisk*CLI> sip show history 7a96d228339b547b3e3cea24102b03fa@87.255.0.218
asterisk*CLI>
 * SIP Call>
1. ReTx            1000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
2. ReTx            2000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
3. ReTx            4000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
4. ReTx            8000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0
5. ReTx            16000 INVITE sip:a74956444456-4@10.10.134.6:5061 SIP/2.0

asterisk*CLI> core show version
Asterisk SVN-trunk-r62242 built by root @ asterisk on a i686 running Linux on 2007-04-28 05:52:04 UTC

No patch. Added file sip_show_channels. Maybe help for u work. Thanks

By: cmaj (cmaj) 2007-05-02 08:48:06

oej: Looking at slavon's and revolution's "sip show history" logs, I don't think the Destroy is ever scheduled in a lot of those hung channel situations.  So there's no CancelDestroy to worry about.  But that is not the only reason for the hung channels.

You might be right that this patch is overkill.  The improvements in patch 8 vs. patch 2/3 might be all that is required.  But considering the number of bug reports filed related to this one, and the reports of this patch fixing the problem for all those whom have tried it, why not just apply it now and let future generations worry about re-factoring it to reduce complexity ?

By: Ronald Chan (loloski) 2007-05-03 03:58:01

cmaj: Thanks your patch is working great, even reinvites turn on i don't see any channel now stuck in INVITE,ACK state. thanks

By: Olle Johansson (oej) 2007-05-03 06:04:09

slavon: What's the dial string in the dial plan for these calls? Do you match a peer at all or just dial out?

cmaj: I would really like to find the error in the code, the calls that cause this, not just apply a blind fix that makes me worried that something else might change in this messy code.

By: Badalian Vyacheslav (slavon) 2007-05-03 06:11:23

oej

exten => 6444456,1,ChanIsAvail(SIP/a74956444456-1|j)
exten => 6444456,2,GotoIf($["${SIPPEER(a74956444456-1:status):0:2}" != "OK"]?102:3)
exten => 6444456,3,Dial(SIP/a74956444456-1,9,)
exten => 6444456,4,Goto(102)

exten => 6444456,102,ChanIsAvail(SIP/a74956444456-2|j)
exten => 6444456,103,GotoIf($["${SIPPEER(a74956444456-2:status):0:2}" != "OK"]?203:104)
exten => 6444456,104,Dial(SIP/a74956444456-2,9,)
exten => 6444456,105,Goto(203)

exten => 6444456,203,ChanIsAvail(SIP/a74956444456-3|j)
exten => 6444456,204,GotoIf($["${SIPPEER(a74956444456-3:status):0:2}" != "OK"]?304:205)
exten => 6444456,205,Dial(SIP/a74956444456-3,9,)
exten => 6444456,206,Goto(304)

exten => 6444456,304,ChanIsAvail(SIP/a74956444456-4|j)
exten => 6444456,305,GotoIf($["${SIPPEER(a74956444456-4:status):0:2}" != "OK"]?405:306)
exten => 6444456,306,Dial(SIP/a74956444456-4,9,)
exten => 6444456,307,Goto(405)

exten => 6444456,405,Dial(SIP/a74956444456-1&SIP/a74956444456-2&SIP/a74956444456-3&SIP/a74956444456-4,60,)
exten => 6444456,406,Busy
exten => 6444456,407,Hangup

By: cmaj (cmaj) 2007-05-03 10:19:32

oej: Did you really mean to call my code a "blind fix" ?

Disregarding the fact I hired a team of recently unemployed circus monkeys who squatted for weeks on my keyboard to create the patch: it works, as verified by myself and others.  I moved 10k minutes of SIP yesterday on 1.4.4 + patch8 and woke up to 0 hung channels when previously I would have crashed under that load.

I don't see how I should bear responsibility for all the "messy code" that is chan_sip -- last time I checked, you were the gatekeeper in this area, and I didn't see any notices of pending re-writes.

Considering that this bug is in both the most recent 1.2.x and 1.4.x "stable" versions of Asterisk, but not in the original 1.2.x releases, then it would appear a previous bug fix is at fault.  I did not work on that previous fix.  But I'm assuming that owing to your authority in this area, you probably did, or at least you are more aware of it than I.  I think you should consider that the previous bug fix was perhaps only half of what was necessary.

My point is this: when I see comments in the code like "Do we need a timer here if we don't hear from them at all?" in chan_sip, and I put a timer in, and it works, what is the problem with that ?

I guess this is just another one of those patches I carry around in my store of "Asterisk Survival Supplies".

By: Olle Johansson (oej) 2007-05-03 10:52:27

cmaj: My apologies if that remark hurted. But my way of reasoning, especially for released code, is that we need to know what type of call that causes this before we fix.

I haven't seen anything in this bug report specific enough to tell me what types of calls that cause this issue so far, so for myself, I'm working in the blind. If someone provides me access to a system where it happens frequently or provides me with more detailed information, I will look into it and evaluate all patches here.

There is an issue here that needs to be fixed, definitely.

By: Leif Madsen (lmadsen) 2007-05-03 12:12:53

I can reproduce this for you oej if you wanna login to one of our miami boxes (just co-ordinate a time with me). I'm doing this simply by setting up a bunch of channels via SIPp --> Asterisk (hung channels) --> Asterisk (media generator).

After about 500 simultaneous calls are setup (maybe around 3k-4k worth of calls) you'll see at least 10+ hung channels that don't tear down.

By: Edoardo Serra - WeBRainstorm S.r.l. (webrainstorm) 2007-05-04 09:10:01

oej, if you want I have a machine on which it occurs regurarly
show channels doesn't show up channels but lsof and netstat show a lot of RTP ports opened.

By: Edoardo Serra - WeBRainstorm S.r.l. (webrainstorm) 2007-05-05 01:49:42

Just to inform you that the attached patch works for me.
My 2 cents

By: drrt (drrt) 2007-05-05 07:04:03

actually i can provide with log files which can display behavior of the system and details about a calls which froze sip channels.
i ve been noticed the channels get stucked when any of SIP-clients lost it`s connectivity with asterisk.



By: drrt (drrt) 2007-05-10 02:33:53

2 oej any updates about the issue?

here is CDRs of calls that hangs sip channels. There are 3 strange callswith "no answer" state  in the example below and ordinary one. I tried to snif on client side and found nothing special. The calls was initiated by Asterisk.

59101 | 2007-05-10 10:26:50+03 |            |            | s    | outcome    | SIP/phone86-0841bb88      |          |         |                 |        1 |       0 | NO ANSWER   |        3 |             | 1178782010.7943 |
 59099 | 2007-05-10 10:26:32+03 |            |            | s    | outcome    | SIP/phone93-084877d0      |          |         |                 |        5 |       0 | NO ANSWER   |        3 |             | 1178781992.7940 |
 59098 | 2007-05-10 10:26:32+03 |            |            | s    | outcome    | SIP/phone92-08490c70      |          |         |                 |        5 |       0 | NO ANSWER   |        3 |             | 1178781992.7939 |
 59102 | 2007-05-10 10:26:25+03 | 4012703704 | 4012703704 | 6678 | techqueued | SIP/**.**.**.**-08441b40 | SIP/phone91-084470d0 | Queue   | tech|t|||120    |       59 |      59 | ANSWERED    |        3 |             | 1178781985.7937 |

By: Olle Johansson (oej) 2007-05-11 02:01:45

I would really like access to the systems. I could repeat this yesterday on one of my systems. What I see is that chan_sip send a congestion control frame to APP_DIAL, but app_dial does not react to this. I added a timer so that the inbound channel was hung up, but APP_DIAL did not react to that either. If that's the case on your systems too, this is something that goes deeper than the channel driver or something is wrong in the way the channel driver communicates with the PBX.

By: drrt (drrt) 2007-05-11 02:18:04

Will help you with pleasure. Would you to communicate with my by email(junior_removethis_kanet.ru), jabber(junior_removethis_kanet.ru) or icq(171985179). What is your actual timezone by the way?



By: Edoardo Serra - WeBRainstorm S.r.l. (webrainstorm) 2007-05-11 02:20:37

hi oej, send me an email (edoardo.serra at webrainstorm.it) and we can arrange that

By: Olle Johansson (oej) 2007-05-11 16:34:39

Will mail you during the weekend. I'm in Sweden - GMT+1 :-)

By: drrt (drrt) 2007-05-12 10:14:04

i will look forward for your message. My actual timezone is GMT+3. It makes much easy to cooperate.

By: Edoardo Serra - WeBRainstorm S.r.l. (webrainstorm) 2007-05-12 13:51:05

I'm in GMT+1 too (Italy) ;)

By: Olle Johansson (oej) 2007-05-14 08:09:34

Ok, the weekend was spent on other private issues... Find me on IRC in the #asterisk-bugs channel or send me email so we can localize and fix this bug. THanks.


E-mail: oej@edvina.net

By: Olle Johansson (oej) 2007-05-18 07:52:53

Still haven't got access to any system that has this problem. Thanks drrt for repeated contacts and providing me with information.

Yet another set of questions:
- Do you do any processing in the "h" extension?
- What CDR backend do you use? Do you have CDR batch support on in cdr.conf?

By: Olle Johansson (oej) 2007-05-18 07:55:13

Logs from drrt indicate what I've seen. We do send a hangup from the SIP channel, but after that the call is still hanging around for some reason, we're not getting a call to sip_hangup from the core.

By: Olle Johansson (oej) 2007-05-18 08:02:32

On tuesday, I will look into this with other developers at the dev meeting. This is a highly prioritized bug that I need to get my hands on to analyze. Anything you can add to this bug report is welcome.

By: Olle Johansson (oej) 2007-05-18 08:25:10

Ok, we have two cases with call queues involved. Anyone else? It might not be app_dial, but app_queue being the culprit here.

By: Olle Johansson (oej) 2007-05-18 10:21:13

Tested on one of the systems that where using Queue. Obviously calls that we sent CANCEL on without any owner where hanging. Fixed this in svn 1.2 rev 65075, 1.4 rev 65076 and trunk 65077.

Please test these new versions on your systems - without any other patches. Thank you, webrainstorm, for giving me access to your system.

By: Leif Madsen (lmadsen) 2007-05-18 11:07:24

FYI: I was able to reproduce this by using the following topology:

SIPp --> Asterisk A --> Asterisk B

SIPp calls into Asterisk A.
Asterisk A performs a Dial() over SIP to Asterisk B.
Asterisk B plays back some audio.
Hangup.

oej: you asked if we were doing any post call processing with 'h', and I am. No queuing involved in the above topology.

By: daniel g (revolution) 2007-05-18 12:21:09

oej --

In my case I have one system using Queue and one not using Queue. I will update the system using Queue to svn later tonight (when people aren't using the phone) and post the results in a few days.

By: Olle Johansson (oej) 2007-05-18 14:28:45

Ok, committed a second patch for the case where we are sending 200 OK and not receive an ACK.

Please try latest svn on a system with no other patches applied to chan_sip. Thanks.

If you still have calls hanging, make sure you upload sip history for those calls and information how the call was places (dial or queue) and if you called a local peer registered with Asterisk or something else ( a service provider, a SIP proxy, a pair of red boots). Thanks.

By: Olle Johansson (oej) 2007-05-18 14:28:46

Ok, committed a second patch for the case where we are sending 200 OK and not receive an ACK.

Please try latest svn on a system with no other patches applied to chan_sip. Thanks.

If you still have calls hanging, make sure you upload sip history for those calls and information how the call was places (dial or queue) and if you called a local peer registered with Asterisk or something else ( a service provider, a SIP proxy, a pair of red boots). Thanks.

By: daniel g (revolution) 2007-05-22 08:22:24

1. NewChan         Channel SIP/0-081ec3f8 - from 451829f215a6b4ca3ec7f42e0
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. ReTx            1000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
4. ReTx            2000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
5. ReTx            4000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0
6. ReTx            8000 INVITE sip:0@aaa.bbb.ccc.ddd:5060 SIP/2.0

Placed using Dial and to a remote peer registered with *. It seems that this peer may have gone offline during the Invite and thus never responded.

By: daniel g (revolution) 2007-05-22 14:28:42

1. NewChan         Channel SIP/0-081e5bf0 - from 4c10fc8c03acbce668e680c451bc18
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. ReTx            1000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62b
4. ReTx            2000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62b
5. ReTx            4000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62b
6. ReTx            8000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62b
7. ReTx            16000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62
8. ReTx            32000 INVITE sip:0@aaa.bbb.ccc.ddd:19186;rinstance=c9c6c6ac2c62
9. Cong            Auto-congesting (timer)

sip peer is local to *. The network did not go down in this case. (however it IS a softphone). Updated to SVN. Call Definitely came from Dial (in this case, the term/orig provider initated call to Asterisk and customer chose to dial by extension)



By: drrt (drrt) 2007-05-22 16:37:50

2 oej.
i dont use hangup extension in the dialplan. pgsql is my only cdr`s backend.

By: bcoppens (bcoppens) 2007-05-23 05:34:08

Olle - I did sent you traces and logs by email (to 'oej@edvina.net')

The following is still a problem case

SIP INVITE from A -> ASTERISK
SIP INVITE from ASTERISK -> B
B not responding.....
CANCEL from A -> ASTERISK



By: Badalian Vyacheslav (slavon) 2007-05-23 08:57:46

New info. maybe intresting

SVN-trunk-r65395

Subsctibe:
asterisk*CLI> sip show history NTQxNGJiYWIyZjRhMzk2NzA2NjI0YWJlZmYyZjBmNDg.
asterisk*CLI>
 * SIP Call
1. AuthChal        Auth challenge sent for  - nc 0
2. AuthChal        Auth challenge sent for  - nc 0
3. AuthChal        Auth challenge sent for  - nc 0
4. AuthChal        Auth challenge sent for  - nc 0
5. AuthChal        Auth challenge sent for  - nc 0
6. AuthChal        Auth challenge sent for  - nc 0
7. AuthChal        Auth challenge sent for  - nc 0
8. AuthChal        Auth challenge sent for  - nc 0
9. AuthChal        Auth challenge sent for  - nc 0
10. AuthChal        Auth challenge sent for  - nc 0
11. AuthChal        Auth challenge sent for  - nc 0


asterisk*CLI> sip show history 0f6a4b097bb812366839c8053483c0f3@87.255.0.218
asterisk*CLI>
 * SIP Call
1. ReTx            1000 INVITE sip:a74957976827@10.10.17.13:5060 SIP/2.0
2. ReTx            2000 INVITE sip:a74957976827@10.10.17.13:5060 SIP/2.0
3. ReTx            4000 INVITE sip:a74957976827@10.10.17.13:5060 SIP/2.0
4. ReTx            8000 INVITE sip:a74957976827@10.10.17.13:5060 SIP/2.0
5. ReTx            16000 INVITE sip:a74957976827@10.10.17.13:5060 SIP/2.0



Added file sip_show_channels2

By: drrt (drrt) 2007-05-25 04:04:06

the problem is still here for me. i wonder how can i show my debug files(about 200mb) to developers?

By: Olle Johansson (oej) 2007-05-29 03:15:02

slavon: That's the weirdest call. We do challenge something repeatedly but get no incoming request? What started this? Can you try to debug a bit more to find out what's going on?

By: Badalian Vyacheslav (slavon) 2007-05-29 03:36:33

oej how i can put debug info into file in realtime? I have produce system and can't stop service and do 'asterisk -Tvvvvvdddddngc | tee /tmp/verbosedebug.txt'. Can i setup "logger" to generate full debug file and send 1 day full log to you?

I can give to you access to asterisk, but system must work 24/7 (access without experements;) )

By: Olle Johansson (oej) 2007-05-29 04:27:14

slavon: You can't give me any more information on that? Maybe access could help. Please contact me on IRC or via e-mail. Thanks.

By: Olle Johansson (oej) 2007-05-29 04:28:26

slavon: Please confirm exactly which version of Asterisk you are using now. Also, if you are using any modules or patches outside of the general distribution, like asterisk-addons. Thanks.

By: Leif Madsen (lmadsen) 2007-05-29 08:12:18

Here is a pastebin of a bunch of stuck calls:

http://www.pastebin.ca/518836

Those channels hang around (looks like in a BYE state), and are never destroyed.

This was done on revision 66244 (1.4 branch).

FYI: 62005 works perfect for me, so whatever changed after that branch breaks things badly for me.

By: Olle Johansson (oej) 2007-05-29 11:22:13

Ok, please test with latest 1.4 or trunk from the svn repositories, I've added a change that should fix an issue I saw on slavo's system and some extra logging.

Again, make sure you enable sip history and show me the history of the calls that hang. Hopefully this fixed and issue where the CALLER sent a CANCEL and we sent a AST_HANGUP to the outbound INVITE, which failed to happen since we did not even have 100 trying on that channel. That means that the channel hung with a delayed BYE, waiting for at least 100 trying.

We have also fixed an issue on jcmoore's system where we did resend BYE's forever since we did not get any replies.

By: Olle Johansson (oej) 2007-05-29 14:16:48

Added another patch to not issue a BYE on a failed BYE transaction

By: Curt Moore (jcmoore) 2007-05-29 21:50:13

oej: I'm still seeing the occasional hung BYE using SVN-1.4-r66538.  If I use my patch, hung_bye.diff.txt, I do not get hung BYEs.  I will concede that this patch is pretty much a brute force way of tearing down the BYEs and there is probably something else going on which is causing us to queue up a BYE to a BYE.  Any thoughts on where I should look next?

By: Olle Johansson (oej) 2007-05-30 01:25:59

jcmoore: I tried looking at the code for more similar hangups on bye failures. Please show me the SIP history, as always in this bug report.

slavon: Please update to latest version and restart so we can see if I've fixed the issue I noticed on your system.

Note: Will be travelling for a week, hopefully getting Internet access back early next week latest, if not in conference centers and hotels. Can't promise immediate reply during this period.

This bug has consisted of several unrelated issues so far. Let's try to track down the rest of them. Thank you for your cooperation and a special thank you to all of you that have given me access to systems, even production systems.

By: daniel g (revolution) 2007-05-30 07:28:28

Still seeing some hung channels here... though not nearly as many.

This is actually a hung ACK...

sip show history

1. NewChan         Channel SIP/0-08327cc0 - from 129170fb24c30c5948d89b201
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. Rx              SIP/2.0 / 102 INVITE / 100 Trying
4. Rx              SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx              SIP/2.0 / 102 INVITE / 200 OK
6. TxReq           ACK / 102 ACK - -UNKNOWN-
7. Rx              INVITE / 1 INVITE / sip:+0@aaa.bbb.ccc.ddd
8. Hold            INVITE
9. ReInv           Re-invite received
10. TxResp          SIP/2.0 / 1 INVITE - 200 OK
11. Rx              ACK / 1 ACK / sip:+0@aaa.bbb.ccc.ddd
12. Rx              INVITE / 2 INVITE / sip:+0@aaa.bbb.ccc.ddd
13. Unhold          INVITE
14. ReInv           Re-invite received
15. TxResp          SIP/2.0 / 2 INVITE - 200 OK
16. Rx              INVITE / 3 INVITE / sip:+0@@aaa.bbb.ccc.ddd
17. TxResp          SIP/2.0 / 3 INVITE - 491 Request Pending
18. Rx              ACK / 3 ACK / sip:+0@aaa.bbb.ccc.ddd
19. Rx              ACK / 3 ACK / sip:+0@aaa.bbb.ccc.ddd
20. SchedDestroy    32000 ms
21. CancelDestroy

sip show channel

 * SIP Call
 Curr. trans. direction:  Incoming
 Call-ID:                129170fb24c30c5948d89b2014181f8d@aaa.bbb.ccc.ddd
 Owner channel ID:       <none>
 Our Codec Capability:   256
 Non-Codec Capability (DTMF):   1
 Their Codec Capability:   256
 Joint Codec Capability:   256
 Format:                 0x0 (nothing)
 MaxCallBR:              384 kbps
 Theoretical Address:    aaa.bbb.ccc.ddd:5060
 Received Address:       aaa.bbb.ccc.ddd:5060
 SIP Transfer mode:      open
 NAT Support:            RFC3581
 Audio IP:               aaa.bbb.ccc.ddd (local)
 Our Tag:                as6aca1e14
 Their Tag:              1413123795
 SIP User agent:        
 Username:               0
 Peername:               0
 Original uri:           sip:0@aaa.bbb.ccc.ddd:5060
 Need Destroy:           0
 Last Message:           Rx: ACK
 Promiscuous Redir:      No
 Route:                  sip:0@aaa.bbb.ccc.ddd:5060
 DTMF Mode:              rfc2833
 SIP Options:            (none)


The only other hung channels are all in regards to the same peer (3 channels hung) and all look similar (i.e. same cause)

1. NewChan         Channel SIP/0-083a1ea8 - from 4febe71f2f0d4b07080fe7646
2. TxReqRel        INVITE / 102 INVITE - -UNKNOWN-
3. Rx              SIP/2.0 / 102 INVITE / 100 Trying
4. Rx              SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx              SIP/2.0 / 102 INVITE / 200 OK
6. TxReq           ACK / 102 ACK - -UNKNOWN-
7. Rx              INVITE / 65454 INVITE / sip:restricted@aaa.bbb.ccc.ddd
8. TxResp          SIP/2.0 / 65454 INVITE - 488 Not acceptable here
9. Rx              ACK / 65454 ACK / sip:restricted@aaa.bbb.ccc.ddd
10. Rx              INVITE / 65455 INVITE / sip:restricted@aaa.bbb.ccc.ddd
11. TxResp          SIP/2.0 / 65455 INVITE - 488 Not acceptable here
12. SchedDestroy    32000 ms
13. CancelDestroy

 Curr. trans. direction:  Incoming
 Call-ID:                7d3b95ea27992d470fc5d82e738064e0@aaa.bbb.ccc.ddd
 Owner channel ID:       <none>
 Our Codec Capability:   256
 Non-Codec Capability (DTMF):   1
 Their Codec Capability:   256
 Joint Codec Capability:   256
 Format:                 0x0 (nothing)
 MaxCallBR:              384 kbps
 Theoretical Address:    aaa.bbb.ccc.ddd:5060
 Received Address:       aaa.bbb.ccc.ddd:5060
 SIP Transfer mode:      open
 NAT Support:            RFC3581
 Audio IP:               aaa.bbb.ccc.ddd (local)
 Our Tag:                as1b8e6fd1
 Their Tag:              ed7814519c165226
 SIP User agent:        
 Username:               0
 Peername:               0
 Original uri:           sip:0@aaa.bbb.ccc.ddd
 Need Destroy:           0
 Last Message:           Rx: INVITE
 Promiscuous Redir:      No
 Route:                  sip:0@aaa.bbb.ccc.ddd
 DTMF Mode:              rfc2833
 SIP Options:            replaces replace


Interesting side-note: While I was getting all of this information together, I noticed two other hung ACKs; however, when I went to view the history on both of those channels -- they had already been cleared out. So obviously progress has been made and we're not too far from a resolution (imo).

p.s. Asterisk SVN-branch-1.4-r66414



By: Leif Madsen (lmadsen) 2007-05-30 23:04:09

oej: some more history for you (this is what I believe to be a hung channel on revision 62005)

http://www.pastebin.ca/524416

By: Olle Johansson (oej) 2007-05-31 05:04:09

revolution: these histories indicate an open call - there's no bye and a proper call setup. In order to have asterisk to destroy these, someone needs to send a bye or you need an absolute timeout or rtptimers - Asterisk can't destroy properly setup calls randomly

By: Olle Johansson (oej) 2007-05-31 05:10:02

blitzrage: where's the SIP history for these calls?

By: daniel g (revolution) 2007-05-31 07:45:07

oej -- in this case, a call was setup and kept open for almost 19 minutes, then it was terminated and cdr was written. Does Asterisk not wait to write CDR until after a BYE is received?

SIP/0-08327cc0,Dial,SIP/0|20,1132,1125,ANSWERED,3,acctcode,0,0,0

sip show history

1. NewChan Channel SIP/0-08327cc0 - from 129170fb24c30c5948d89b201
2. TxReqRel INVITE / 102 INVITE - -UNKNOWN-
3. Rx SIP/2.0 / 102 INVITE / 100 Trying
4. Rx SIP/2.0 / 102 INVITE / 180 Ringing
5. Rx SIP/2.0 / 102 INVITE / 200 OK
6. TxReq ACK / 102 ACK - -UNKNOWN-
7. Rx INVITE / 1 INVITE / sip:+0@aaa.bbb.ccc.ddd
8. Hold INVITE
9. ReInv Re-invite received
10. TxResp SIP/2.0 / 1 INVITE - 200 OK
11. Rx ACK / 1 ACK / sip:+0@aaa.bbb.ccc.ddd
12. Rx INVITE / 2 INVITE / sip:+0@aaa.bbb.ccc.ddd
13. Unhold INVITE
14. ReInv Re-invite received
15. TxResp SIP/2.0 / 2 INVITE - 200 OK
16. Rx INVITE / 3 INVITE / sip:+0@@aaa.bbb.ccc.ddd
17. TxResp SIP/2.0 / 3 INVITE - 491 Request Pending
18. Rx ACK / 3 ACK / sip:+0@aaa.bbb.ccc.ddd
19. Rx ACK / 3 ACK / sip:+0@aaa.bbb.ccc.ddd
20. SchedDestroy 32000 ms
21. CancelDestroy

Is it a possibility that * received the 'BYE' and acknowledged it to write CDR -- but not to close the channel? I will attempt to reproduce these errors and capture the relevant information if I can.

By: Leif Madsen (lmadsen) 2007-05-31 08:24:07

Crap, sorry oej... it was really late, and I was talking to Josh, and those were the things he wanted.

Oh ya! I did make a bad pastebin, and it happens to have the SIP history in it as well:

http://www.pastebin.ca/524402

By: daniel g (revolution) 2007-05-31 10:51:58

oej... I have a capture of the cause of one of these hung channels (specifically it is the "488 Not acceptable here" which is caused by the ATA attempting to switch from 729 to ulaw in order to receive a fax and Asterisk having been set to disallow ulaw on this account).

When the ATA is set to disallow ulaw and allow g729... all involved (Provider, ATA, Test Caller) request a BYE; however, * doesn't send BYE to the ATA (thus resulting in the hung channel I'd suppose?) -- could this be due to the "Not Acceptable Here" msg?

When the ATA is set to allow ulaw and disallow g729... everything functions as expected (i.e. BYE is requested by and sent to all)... faxing is moderately successful.

When ATA is set to allow both ulaw and g729... same results as a: ulaw and d: g729 (BYE requested and sent to all).

So my only question is this -- why is * not sending a BYE after a "488 Not Acceptable Here"

oej if you would like to see the captures... please let me know...

By: Lucas Martinez (martlucas_gtt) 2007-06-13 09:27:05

I have the same problem sending call to a host not listening in 5060 port, is this solved in the svn trunk right now?

By: Leif Madsen (lmadsen) 2007-06-13 09:42:08

martlucas_gtt: no, this is not resolved -- please attach as much debugging information as possible as requested from oej in the bug notes.

Thanks!

By: Russell Bryant (russell) 2007-06-13 11:20:13

Changing from "assigned" back to "new" since it's not actually assigned to anyone.  I'm looking at this one now, too, though ...

By: Russell Bryant (russell) 2007-06-13 15:06:43

jcmoore's hung_bye.diff.txt has been merged into 1.4 and trunk in revisions 69183 and 69187.

The commit message:

Move the logic for destroying a call when no response is received to a BYE
outside of the block that checks for FLAG_FATAL to be set.  This flag is only
set when the packet is transmitted with the reliability set to XMIT_CRITICAL
when the original packet is transmitted.  A BYE is always sent with it set
to XMIT_RELIABLE, meaning this code could never be encountered.  This resulted
in seeing some SIP channels that would never go away with the last packet
sent being a BYE.
(part of issue ASTERISK-8969, patch from jcmoore)

By: Lucas Martinez (martlucas_gtt) 2007-06-13 15:26:11

Could we apply the "hung_bye.diff.txt" patch or we need to do other stuff more? This is talkin about the 1.4.4 stable version



By: Leif Madsen (lmadsen) 2007-06-13 17:27:48

martlucas_gtt: uhhhhh... sure, go ahead and apply to 1.4.4. It may or may not apply depending on what has changed.

By: Curt Moore (jcmoore) 2007-06-14 01:31:32

I've uploaded hung_cancel_log.txt which shows Asterisk sending a CANCEL after an INVITE times out; all is normal there.  The odd thing is that the far end responds with a "200 canceling" and then sends a "180 Ringing"!?!  This confuses Asterisk since upon the receipt of the "180 Ringing" we cancel the destroy of the channel.  Since the destroy has been canceled and there are no further dialogs received, the SIP channel is hung forever after sending a CANCEL to which it received a "180 Ringing".  Is there any scenario in which we would care about anything other than an ACK after we issue a CANCEL?

I haven't quite figured out the best way to go about checking for this so I need some input.

It would appear that in my hung_cancel_log.txt example that we sent a CANCEL in step 5 as a result of the scheduled destroy time expiring in step 4.  This would seem to indicate that we were in sip_hangup().  It appears that we sent the CANCEL in the section of code with the comment "Send a new request: CANCEL".  
 
It would seem that correct thing would be to set SIP_PKT_IGNORE on the 180 which comes in after we've sent the CANCEL.  This way we would not cancel the destroy in the 180 case of handle_response_invite() which was scheduled as a result of our sending of the CANCEL as happens in step 9.  It would seem the proper place to set this flag would be in handle_request() but I'm not exactly sure of the logic to correctly do this.

I hope this makes sense...

By: Leif Madsen (lmadsen) 2007-06-14 16:30:37

I am able to reproduce a hung channel on my test system. I make a call to a remote Asterisk box via a DID number that goes through our softswitch. First call answers fine (DTMF doesn't work... but that's a different bug). Second call gets no audio, and when I hang up, gets hung.

Topology:

Polycom IP501 --> SER_1 (tampa) --> Asterisk (1.4, r69258) --> SER_1 (tampa) --> softswitch --> SER_2 (miami) --> Asterisk (1.4, r62005) --> MeetMe()

I've attached 2 SIP history files:  history (good call), and history (bad call).

If you need more information, or access to the box, let me know as I can reproduce this now.

By: Russell Bryant (russell) 2007-06-15 15:24:30

I just fixed the hung channel that blitzrage just reported in 1.4 in revision 69583.  It was not a problem anymore in trunk.

By: Russell Bryant (russell) 2007-06-15 18:08:19

1.4.5 has been released which contains fixes for a lot of the issues reported here.  Everyone that is having any problems with this, please update to (at least) 1.4.5, or later code from svn and report back if you are still having problems.

Then, we can address the remaining issues from there.

Thanks!

By: daniel g (revolution) 2007-06-18 00:05:42

All involved: I've updated to Ast 1.4.5 as of an hour ago on two boxes (one of which uses Queues to direct calls and the other of which uses Dial). I'll let both of these run until next Saturday before checking for lingering channels and updating everyone.

Thanks to everyone for your work/help on this.

By: drrt (drrt) 2007-06-21 16:42:35

moved my system to latest release(1.4.5). there is no hanging sip channels for about three days. the problem is solved at my point. many thanks to developers  who fixed the bug.

By: Leif Madsen (lmadsen) 2007-06-21 18:15:33

Upgraded to 1.4.5 on the day it was released in one of our production systems, and have yet to see a hung channel. So far so good. Will report back again if I see something.

By: Ronald Chan (loloski) 2007-06-22 11:45:43

I will upgrade my system from SVN branch 1.2 to SVN 1.4 i hope and pray that this issue was finally rest to end :), since this is a production server with 30-40 mixed in/out bound SIP calls during peak time with queues.

I will surely report whatever happen for the next couple of days.


Huge thanks to everyone



By: Bluce Ree (tasker) 2007-06-22 19:02:14

Ditto for me on 1.4.5. Systems are loaded here and no hung channels yet. By this time, i'd be forced to restart due to resource saturation.



By: Andrey Solovyev (corruptor) 2007-06-23 04:35:02

After upgrade to 1.4.5 I don't have hung channels too.

By: Ronald Chan (loloski) 2007-06-23 20:40:09

To All,

Great! so far so good, after passing our peak time there are no more hung channel on ACK and INVITE :)

Thanks again oej and the rest.

By: Russell Bryant (russell) 2007-06-23 23:53:14

Thanks for the feedback, everyone.  I am going to close this one out as fixed.  If you have any further problems, please just create a new report.  Thanks!