[Home]

Summary:ASTERISK-12034: some RTP packets sent to NAT IP instead of public IP; breaks built-in jitterbuffer on some phones
Reporter:jolan (jolan)Labels:
Date Opened:2008-05-16 15:55:00Date Closed:2008-11-19 16:34:57.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/RTP
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 14badcall.pcap
( 1) 14goodcall.pcap
( 2) badnat.pcap
( 3) console_debug_output.txt
( 4) iaxtopolycom.pcap
Description:I am experiencing the same symptoms as ASTERISK-1249566 but with 1.4.20-rc3, not trunk.  The bug was fixed in trunk, but not in 1.4.x.

To reiterate the problem:

Phone 100 calls phone 102.  Phone 102 answers and starts counting "1 2 3 4 5".  Phone 100 doesn't hear anything until "3" or "4".

****** STEPS TO REPRODUCE ******

- phone 100 dials phone 102
- phone 102 rings and we see RTP start flowing back and forth:
   -- SIP/102-089ab718 is ringing
  Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020075, ts 3229712, len 000160)
  Sent RTP packet to      76.171.35.171:54729 (type 00, seq 005419, ts 000480, len 000160)
- phone 102 gets answered and you can see the NAT address is used:
  Peer audio RTP is at port 10.0.0.96:2236
- You start to see packets being sent to the NAT address
 Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020423, ts 3285392, len 000160)
 Sent RTP packet to      10.0.0.96:2236 (type 00, seq 037168, ts 3285392, len 000160)
 Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020424, ts 3285552, len 000160)
 Sent RTP packet to      10.0.0.96:2236 (type 00, seq 037169, ts 3285552, len 000160)
- But somehow it recovers:
 Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020433, ts 3286992, len 000160)
 Sent RTP packet to      10.0.0.96:2236 (type 00, seq 037178, ts 3286992, len 000160)
 Got  RTP packet from    76.171.35.171:54477 (type 00, seq 021866, ts 1917181368, len 000160)
 Sent RTP packet to      76.171.35.171:54729 (type 00, seq 005767, ts 1917181368, len 000160)
 Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020434, ts 3287152, len 000160)
 Sent RTP packet to      76.171.35.171:54477 (type 00, seq 037179, ts 3287152, len 000160)
 Got  RTP packet from    76.171.35.171:54477 (type 00, seq 021867, ts 1917181528, len 000160)
 Sent RTP packet to      76.171.35.171:54729 (type 00, seq 005768, ts 1917181528, len 000160)
 Got  RTP packet from    76.171.35.171:54729 (type 00, seq 020435, ts 3287312, len 000160)

I believe either the early media or the packets to the NAT address trigger a reset/timeout in the jitterbuffer on the phone, hence the silence.

This isn't specific to any phone, but some phones handle it better than others.  The amount of audio lost seems to vary upon:

- Phone model (I've experienced this both with Cisco 7940s and Polycom 601s)
- Codec (I believe since compressed codecs use less of the jitterbuffer's memory for the same amount of audio)
- Time the phone was ringing for (I had problems reproducing this because I get answering the calls right away.  Seems roughly 1 ring = 1 second of audio lost)
Comments:By: jolan (jolan) 2008-05-16 15:58:51

You can see the first packet sent to NAT in badnat.cap; packet number 984 according to wireshark.

Also, I have attached the console debug output log.  The excerpts above are from said log.

By: Joshua C. Colp (jcolp) 2008-05-16 16:00:29

We can't know the IP address to actually send to until the remote party sends us a packet, we then use the source IP address and port on subsequent RTP transmissions. I see nothing wrong in the log you provided, once the phone sent an RTP packet we switched over and sent it to the right place.

By: jolan (jolan) 2008-05-16 16:02:20

Please note that the "Steps to Reproduce" section is filled out and can be see with "View Advanced" toggled.

By: jolan (jolan) 2008-05-16 16:35:24

I'm not sure I quite understand you.  Asterisk is clearly sending early media to the proper address & port when the phone rings. Then when it's answered Asterisk starts sending to the NAT IP.  When those timeout it starts sending to the proper address & port again.

My guess is that sending packets to the NAT address, waiting for a timeout, and then re-transmitting to the public IP introduces enough delay to confuse the built-in jitterbuffer in the phone (which was being filled with media when the phone rang).

I have reproduced this with a Cisco 7940 and a Polycom 601 but the reproducibility percentage is different between the two.

Is there anything else I can check?

By: Joshua C. Colp (jcolp) 2008-05-16 16:41:06

It sends audio to the NATted IP address and port of the phone you called. That phone sent back a 200 OK with an RTP address of 10.0.0.96 port 2236. Once that phone sent a packet to Asterisk and Asterisk saw it's source IP address and port it switched over. Now for the phone that you placed the call from Asterisk never actually changed the IP address and port it was sending to. It was always 76.171.35.171 port 54729

By: Joshua C. Colp (jcolp) 2008-05-16 16:42:51

You will also notice that there is a gap between when the 200 OK answered and when packets actually went to the phone you called from, because the phone you placed a call to did not send any for a short period of time.

By: jolan (jolan) 2008-05-16 16:50:27

Hmmm, I wonder if this is just an interop issue with Polycom phones.  The original submitter for ASTERISK-1249566 was using a Polycom as well.  I have seen this on Cisco phones as well, but come to think of, that may have been when the other phone in the call was a Polycom.

I also just noticed ASTERISK-1254615 which involves DTMF problems with Polycoms and Asterisk 1.4.  I have also been experiencing this issue.

I guess I'll do some comparisons of call captures between Polycom <-> Polycom phones, Cisco <-> Polycom, and Cisco <-> Cisco.

By: Joshua C. Colp (jcolp) 2008-05-16 16:52:55

12566 had nothing to do with NAT, it was an internal locking issue on the Asterisk channel itself.

There's nothing that we can do to solve what you are seeing. Until the other side sends us audio we have NO idea they are behind NAT. The best we can do is send audio to where they told us until then.

By: jolan (jolan) 2008-05-16 17:15:03

Found this in the polycom firmware revision notes:

35228: Phone may have one-way audio when SDP is received with c line below m line

You can definitely see this with packet ASTERISK-969.  I'll upgrade and confirm.  Firmware upgrade may also fix  ASTERISK-1254615.

By: Jason Parker (jparker) 2008-05-19 12:39:13

I know it's only been a few days, but have you had a chance to upgrade the firmware and verify that this (and ASTERISK-11999) were fixed?

If so, this is definitely something useful for us to keep in mind in the future.

By: jolan (jolan) 2008-05-19 12:43:21

I did upgrade the firmware on friday but the issue was still there.

Right now I'm doublechecking my work since config files had to be augmented for the firmware upgrade.  

I'll hopefully post another update later today.

By: jolan (jolan) 2008-06-02 17:44:43

Sorry for the delay.  We had some old Polycom models (300/500) that required extra attention since they do not support the same firmware as other models.

After upgrading 300/500 phones to 2.1.3 and other phones to 2.2.2, I am still experiencing audio loss.

There is a 3.0.2RevC version of firmware but I don't have access to it so I haven't tried it.

BTW, I saw issue ASTERISK-11793 today which is entitled 'lost packets when using Polycom 550 and ulaw codec to outbound sip also using ulaw, if change polycom to alaw no lost packets'.   I tried switching the codec to alaw per the submitters suggestion but I don't see any improvement.  However, he sees audio loss reported by RTCP and I don't.  If he's using a 550 it's quite likely that he is using the Polycom 3.x firmware branch.  I believe that version has improved support for RTCP so that might account for the discrepancy.

By: jolan (jolan) 2008-06-02 17:55:12

Looking at packet number 975 in the badnat.pcap again:

Ethernet II, Src: Cisco_c2:f9:4e
f: "7940 g711" <sip:100@209.242.35.6>;tag=as4f924dfb
t: <sip:102@10.0.0.96:5060>;tag=429ED0C-8F026405
User-Agent: PolycomSoundPointIP-SPIP_601-UA/2.0.1.0313

Am I reading that right?  A packet from the Cisco has the User-Agent set to Polycom?



By: jolan (jolan) 2008-06-03 19:00:18

I've moved the phones/server to my local LAN and the problem still exists.  I attached a captures of both a problem call (14badcall.pcap) and a call that had no audio problems (14goodcall.pcap).

I don't see any obvious time gaps, different packet contents, different packet ordering, etc. but it'd be nice if someone could confirm that.

I'm going to revert to Asterisk 1.2 and get a capture of a call there to see what's different that is freaking these Polycoms out.



By: jolan (jolan) 2008-06-05 17:55:12

I'm pretty sure this is an Asterisk bug related to RTP timestamps.

With Asterisk-1.2 Cisco<->Polycom, the timestamp of the first RTP packet is set to 0.

With Asterisk-1.4 IAX<->Polycom, the timestamp of the first RTP packet is set to 160.

With Asterisk-1.4 Cisco<->Polycom, the timestamp of the first RTP packet is seemingly random.

By: jolan (jolan) 2008-06-05 19:11:00

This seems to workaround the issue for me.

Index: rtp.c
===================================================================
--- rtp.c (revision 120859)
+++ rtp.c (working copy)
@@ -2663,8 +2663,10 @@
if (rtp->lastts > rtp->lastdigitts)
rtp->lastdigitts = rtp->lastts;

+ /* XXX polycoms can't cope with high timestamps on the initial packet
if (ast_test_flag(f, AST_FRFLAG_HAS_TIMING_INFO))
rtp->lastts = f->ts * 8;
+ */

By: Joshua C. Colp (jcolp) 2008-06-10 08:05:50

I find it hard to believe that the Polycom can't handle that... per the RFC:

The initial value of the timestamp SHOULD be random, as for the sequence number.

So it shouldn't matter where the timestamp starts...

By: jolan (jolan) 2008-06-10 08:55:06

I think this is actually more related to ASTERISK-12042 but I haven't tried the diff in that bug yet.

By: jolan (jolan) 2008-06-20 15:21:49

The patch in ASTERISK-12042 lowers the reproducibility rate but I think it's incomplete.

And yes, I know that the RFC recommends a random initial value.

I can reproduce this across the 2.0.x, 2.2.x, and 3.x Polycom firmware branches with every model I tried (300, 301, 500, 501, 601).

I've tried other patches from the bug tracker in addition to my own.  I can confirm it definitely isn't a NAT issue or else this probably would have fixed it:

  if (p->rtp) {                                                                
     if (portno > 0) {                                                        
        sin.sin_port = htons(portno);                                          
+        if (ast_test_flag(&p->flags[0], SIP_NAT)) {                            
+           if (debug)                                                          
+              ast_verbose("NAT active; using %s instea                        
d of %s\n", ast_inet_ntoa(p->recv.sin_addr), ast_inet_ntoa(sin.sin_addr));      
+           sin.sin_addr = p->recv.sin_addr;                                    
+        }                                                                      
        ast_rtp_set_peer(p->rtp, &sin);                                        
        if (debug)                                                            
           ast_verbose("Peer audio RTP is at port %s:%d\n",                    
ast_inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));

But nothing has solved the issue 100% other than the workaround patch I already posted:

Index: rtp.c
===================================================================
--- rtp.c (revision 120859)
+++ rtp.c (working copy)
@@ -2663,8 +2663,10 @@
    if (rtp->lastts > rtp->lastdigitts)
        rtp->lastdigitts = rtp->lastts;

+ /* XXX polycoms can't cope with high timestamps on the initial packet
    if (ast_test_flag(f, AST_FRFLAG_HAS_TIMING_INFO))
        rtp->lastts = f->ts * 8;
+ */

I don't know what else I need to do to prove that this is a real problem.  You don't have a polycom phone to try to reproduce?  This is still a problem for me with stock 1.4.21.

By: Terry Wilson (twilson) 2008-11-19 16:24:50.000-0600

Ok, I've tried to reproduce this with two polycoms running 3.0.1.0032 and I can not reproduce this.  I've tried w/ a NAT, w/o a NAT, adding inbound and outbound latency to the network, and I just cannot reproduce this.  If you run directly off of the 1.4svn branch, do you still get this problem?

Also, unless you have a really good reason to, you shouldn't Answer() before the Dial().  Also, under most circumstances, you don't need to pass the 'r' option to dial either.

By: jolan (jolan) 2008-11-19 16:30:21.000-0600

I now believe this to be a dupe of ASTERISK-12042.  You can close this.

By: Terry Wilson (twilson) 2008-11-19 16:34:56.000-0600

Closing at request of reporter in favor of 12679