[Home]

Summary:ASTERISK-05060: asterisk silently stops accepting IAX calls
Reporter:Matt Ranney (mjrmjr2)Labels:
Date Opened:2005-09-12 10:02:49Date Closed:2011-06-07 14:00:26
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) better-workaround-patch
( 1) chan_iax2.c.negtsdetect.diff
( 2) jbdebug1.txt
( 3) jbdebug2.txt
Description:4 times now over the past week I've observed the following behavior:

1) asterisk appears to be up, CLI is reponsive
2) no IAX traffic has been seen by peers for a while
3) new IAX calls time out due to lack of response

My PRI hardware is a TE411P, all calls flow from IAX into the PRI, or from IAX to IAX.  Jitter buffer is enabled, and g729 is in use.  

What would be useful for me to do the next time this happens?

Comments:By: Matt Ranney (mjrmjr2) 2005-09-12 10:12:52

Some other things to note:

1) The PRI is plugged in to port 4, not port 1.
2) I have debug logging on.  Here's what it looked like right before the crash:


Sep 11 09:37:32 VERBOSE[23715] logger.c:     -- Executing Dial("IAX2/TCC1-OLY1-11", "Zap/r1/1NPANXX1234") i
n new stack
Sep 11 09:37:32 VERBOSE[23715] logger.c:     -- Requested transfer capability: 0x00 - SPEECH
Sep 11 09:37:32 DEBUG[19300] channel.c: Avoiding initial deadlock for 'Zap/94-1'
Sep 11 09:37:32 VERBOSE[23715] logger.c:     -- Called r1/1NPANXX1234
Sep 11 09:37:32 DEBUG[19306] chan_zap.c: Queuing frame from PRI_EVENT_PROCEEDING on channel 0/22 span 4
Sep 11 09:37:32 VERBOSE[23715] logger.c:     -- Zap/94-1 is proceeding passing it to IAX2/TCC1-OLY1-11
Sep 11 09:37:33 DEBUG[19308] chan_iax2.c: Ooh, voice format changed to 256
Sep 11 09:37:35 DEBUG[19306] chan_zap.c: Enabled echo cancellation on channel 94
Sep 11 09:37:35 VERBOSE[23715] logger.c:     -- Zap/94-1 is ringing
Sep 11 09:37:41 NOTICE[19308] chan_iax2.c: Peer 'TCC1-BDD1' is now UNREACHABLE! Time: 680
Sep 11 09:37:48 DEBUG[19306] chan_zap.c: Echo cancellation already on
Sep 11 09:37:48 VERBOSE[23715] logger.c:     -- Zap/94-1 answered IAX2/TCC1-OLY1-11
Sep 11 09:37:55 NOTICE[19308] chan_iax2.c: Peer 'TCC1-ARL1' is now UNREACHABLE! Time: 13
Sep 11 09:37:56 NOTICE[19308] chan_iax2.c: Peer 'TCC1-JPH1' is now UNREACHABLE! Time: 55
Sep 11 09:37:58 NOTICE[19308] chan_iax2.c: Peer 'TCC1-CMH1' is now UNREACHABLE! Time: 60
Sep 11 09:37:58 NOTICE[19308] chan_iax2.c: Peer 'TCC1-TRN1' is now UNREACHABLE! Time: 38
Sep 11 09:38:02 NOTICE[19308] chan_iax2.c: Peer 'TCC1-HQ1' is now UNREACHABLE! Time: 3
Sep 11 09:38:02 NOTICE[19308] chan_iax2.c: Peer 'TCC1-TCC2' is now UNREACHABLE! Time: 3
Sep 11 09:38:04 WARNING[19308] chan_iax2.c: Max retries exceeded to host 1.2.3.4 on IAX2/TCC1-OLY1-
11 (type = 4, subclass = 15, ts=843, seqno=3)
Sep 11 09:38:04 DEBUG[23715] channel.c: Didn't get a frame from channel: IAX2/TCC1-OLY1-11
Sep 11 09:38:04 DEBUG[23715] channel.c: Bridge stops bridging channels IAX2/TCC1-OLY1-11 and Zap/94-1
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Set option AUDIO MODE, value: ON(1) on Zap/94-1
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Hangup: channel: 94 index = 0, normal = 103, callwait = -1, thirdc
all = -1
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Not yet hungup...  Calling hangup once with icause, and clearing c
all
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: disabled echo cancellation on channel 94
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Set option TDD MODE, value: OFF(0) on Zap/94-1
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Updated conferencing on 94, with 0 conference users
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: Set option AUDIO MODE, value: OFF(0) on Zap/94-1
Sep 11 09:38:04 DEBUG[23715] chan_zap.c: disabled echo cancellation on channel 94
Sep 11 09:38:04 VERBOSE[23715] logger.c:     -- Hungup 'Zap/94-1'
Sep 11 09:38:04 DEBUG[23715] app_dial.c: Exiting with DIALSTATUS=ANSWER.
Sep 11 09:38:04 VERBOSE[23715] logger.c:   == Spawn extension (from-iax, 91NPANXX1234, 1) exited non-zero o
n 'IAX2/TCC1-OLY1-11'
Sep 11 09:38:04 DEBUG[23715] chan_iax2.c: We're hanging up IAX2/TCC1-OLY1-11 now...
Sep 11 09:38:04 NOTICE[23715] sched.c: Attempted to delete nonexistent schedule entry 75462813!
Sep 11 09:38:04 VERBOSE[23715] logger.c:     -- Hungup 'IAX2/TCC1-OLY1-11'
Sep 11 09:38:08 NOTICE[19308] chan_iax2.c: Peer 'TCC1-OAK1' is now UNREACHABLE! Time: 113
Sep 11 09:38:13 NOTICE[19308] chan_iax2.c: Peer 'TCC1-AH2' is now UNREACHABLE! Time: 36
Sep 11 09:38:33 NOTICE[19308] chan_iax2.c: Peer 'TCC1-BSR1' is now UNREACHABLE! Time: 663
Sep 11 09:38:36 NOTICE[19308] chan_iax2.c: Peer 'TCC1-OLY1' is now UNREACHABLE! Time: 654
Sep 11 09:53:19 VERBOSE[19306] logger.c:     -- B-channel 0/1 successfully restarted on span 4
Sep 11 09:53:19 VERBOSE[19306] logger.c:     -- B-channel 0/2 successfully restarted on span 4
Sep 11 09:53:19 VERBOSE[19306] logger.c:     -- B-channel 0/3 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/4 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/5 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/6 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/7 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/8 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/9 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/10 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/11 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/12 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/13 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/14 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/15 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/16 successfully restarted on span 4
Sep 11 09:53:20 VERBOSE[19306] logger.c:     -- B-channel 0/17 successfully restarted on span 4
[...]

etc.  Then the only messages in the log are the hourly B channel restarts.



By: Mark Spencer (markster) 2005-09-12 22:40:23

Need to do a clean build with "make clean ; make valgrind" and have -DDEBUG_THREADS turned on in the Makefile.  When it gets stuck in this mode, generate a core by attaching gdb, then you can restart if you need to (or leave it in the stuck state) and let me login to try to analyze the core.  It is important that you have very latest CVS head.

By: Matt Ranney (mjrmjr2) 2005-09-13 14:57:49

OK, I've done this and it is now stuck again.  It happened much more quickly this time.

Let me know what else I can do, or email me for access to the machine: mjr@ranney.com.

By: Kevin P. Fleming (kpfleming) 2005-09-14 20:25:40

You will need to contact someone on #asterisk-bugs on FreeNode IRC to help you generate a core dump/backtrace when this problem occurs again (or you can try reading doc/README.backtrace in the source tree, it should have enough info to help you get a backtrace made). Emailing back and forth will probably be too slow, you'll need to get your system restarted before someone will be able to help you make a trace :-)

By: Matt Ranney (mjrmjr2) 2005-09-15 17:26:42

Now that I'm armed with all this information, I'm not sure if I'm happy or sad that it hasn't locked up in well over a day.

By: Russell Bryant (russell) 2005-09-16 13:36:01

Have you done an update since the last lockup?  I'm just wondering about the possibility than an update may have already fixed your problem ...



By: Matt Ranney (mjrmjr2) 2005-09-16 13:47:31

Yes, I did do an update of asterisk, libpri, and zaptel.  I'm also battling a spurious DTMF detection problem with a TE411P on this server.

Its certainly possible that this update fixed the problem.  It is also the case that some site admins are wary of sending traffic through this server due to its  instability.  There has been less traffic on the server for the past couple of days.

By: Russell Bryant (russell) 2005-09-18 23:09:42

Ok, if your system locks up again, be sure to get the backtrace and re-open this bug.  I'm just going to close it out for now pending the backtrace.

As far as the issues with the TE411P go, contact support@digium.com about it.  They should be able to help you resolve any problems with their hardware.

Thank you!

By: Russell Bryant (russell) 2005-09-20 12:36:47

Reopened at the request of the original reporter.

By: Matt Ranney (mjrmjr2) 2005-09-20 12:42:29

It is happening again.  I have noticed that the IAX socket does not seem to be getting serviced:

Proto Recv-Q Send-Q Local Address               Foreign Address             State      
udp   110512      0 0.0.0.0:4569                0.0.0.0:*                              


I can't seem to get a usable core file from gdb though.  I've attached gdb to the leader pid, run gcore, then run gdb /usr/sbin/asterisk core.pid.  When I start gdb with that core file, I get:

Core was generated by `/usr/sbin/asterisk'.
warning: svr4_current_sos: Can't read pathname for load map: Input/output error
Cannot access memory at address 0xb7f73408
#0  0x00542d4e in ?? ()

While gdb is attached though I can set breakpoints and do all the normal gdb things.

Is there something else specifically I could do with gdb to either generate a valid core file or help narrow down what is happening?

By: Mark Spencer (markster) 2005-09-21 08:41:52

In the absense of a useful core file, I think I will likely need to login to the machine when it is in the stuck state.  If you have another system, you can take this one out of service when the problem occurs and put the backup into operation and *leave asterisk running* on the system that has failed...  Please e-mail me your contact info (markster@digium.com).  Thanks for your patience.

By: Matt Ranney (mjrmjr2) 2005-09-24 14:01:06

Update, this machine has been running well for a few days now.  Last time it took about this long to die.  I'm keeping my eye on it, and I'll report back when something changes.

By: Kevin P. Fleming (kpfleming) 2005-09-26 15:49:23

Mark has told me that we now have a report of another user experiencing this symptom, and it appears to be the jitterbuffer entering some sort of odd state and not returning back to chan_iax2 when it should. You may be able to work around the  problem by compiling in the old jitter buffer, although it would be best if you could leave it as is so we can figure out the cause...

By: Matt Ranney (mjrmjr2) 2005-10-10 19:04:05

Has any progress been made on this?  I still get these lockups every few days.  markster has looked at the machine at least once, but he wasn't able to figure out what was happening.

By: Matt Ranney (mjrmjr2) 2005-10-11 11:32:10

I've gathered some other data on the situation.  Some users are reporting that while IAX is working they sometimes get several seconds of broken audio.  The way they describe it, it sounds like a problem with the jitter buffer.  It seems like this might be another symptom of the underlying problem.  Other users of this machine as a PSTN gateway seem to have pretty good performance.  Here is an ASCII flow diagram:  

7940--SIP/g.711 -- A -- IAX2/g.729 --- B --- IAX2/g.711 --- C -- SIP/g.711--7940

A and B are running CVS HEAD.  C is running 1.0.9.  B is the machine with the IAX lockup, and it is running forcejitterbuffer=yes.  7940 behind A reports acceptable quality.  7940 behind C reports occasional 1-2 seconds of choppy audio then things stabilize.

Then there is this:

7940--SIP/g.711 -- A -- IAX2/g.729 --- B --- PRI

Here A and B are the same machines as above, but the parties on the other end of the PRI call never report any dropouts.


So the jitterbuffer seems to be doing a poor job when going from IAX2/g.729 to IAX2/g.711.  Could this be causing the ultimate lockup?

By: gkloepfer (gkloepfer) 2005-10-11 16:18:13

I am noticing a similar problem with -HEAD (Nov 7, 2005) on FreeBSD.  I
thought it was a FreeBSD-related issue.

Do you notice your CPU going to 100% when this happens, and does it appear
to all be in "system" state (as reported by "top")?

I was able to get this to happen using the following case:

SIP (G.711) --> PBX A --- (IAX) ---> PBX B (local recording (gsm))

The problem happens faster using certain SIP phones and when I make multiple
calls at the same time.  Also note that the IAX link is over the Internet,
which would definitely point to a possible jitterbuffer issue.

I just started looking into this, so as soon as I get some more detailed
data, I will report back as well.  Just wanted you to know you're not the
only one with the trouble.

The exact symptom I see is that I will begin getting very broken audio
for several seconds, then all IAX connections will shut down.  The only
way they can be brought back up is to restart Asterisk.  Under netstat,
I can see there are loads of packets in the receive queue for port 4569,
meaning Asterisk simply isn't taking the packets.  The CPU is pegged at 100%,
mostly stuck in system wait state (truss on FreeBSD shows that Asterisk is
doing kse_release over and over again, but this may simply be a strange
FreeBSDism).

By: Mark Spencer (markster) 2005-10-11 16:25:24

The problem appears to be in the jitter buffer itself.  I'd like to again try to make contact with someone who has the server stuck in this state since Steve is here in principle, and maybe i can track him down...

By: gkloepfer (gkloepfer) 2005-10-11 16:41:01

Agreed.  I just set jitterbuffer=no in iax.conf and the problem stopped
happening.

I believe I can get Asterisk in this state pretty consistently.  Is there
something in particular you'd like me to try? (schedule is really really bad
this week, or I'd try my first IRC debugging session with you...)

By: Matt Ranney (mjrmjr2) 2005-10-11 20:26:59

This sounds exactly like my situation.  100% CPU, IAX socket receive buffer full, etc.  I'm running on Linux.

For me the problem only happens every few days, but when it happens again I'll let you know Mark.

By: gkloepfer (gkloepfer) 2005-10-12 07:55:07

I reproduced the problem with "iax2 jb debug" set and put the output from
the console into jbdebug1.txt.

The problem occured when the LLLLLLLLL was printed by the jitterbuffer debugging
routines.

By: gkloepfer (gkloepfer) 2005-10-12 08:47:03

jbdebug2.txt is another instance of the failure occuring with
DEEP_DEBUG turned on in jitterbuf.c.

It looks like the problem is happening when the two timestamps in jb_put
are negative numbers.

If I understand the new jitterbuf code correctly, it looks like there is
a thread created that actually handles the jitter buffer on a channel.
What appears to happen is that the thread that handles the jitterbuf for IAX2
eventually dies and thus no more IAX2 packets can be sent/received
until Asterisk is restarted.

One thing I can say is that the timestamps in the iax_frame are unsigned ints and seem to have a very specific format.  The timestamps in jb_put are longs.  So if the high-order bit of the IAX2 timestamps are set, that's going to create a negative number in jb_put.

By: gkloepfer (gkloepfer) 2005-10-12 09:33:10

Okay, that was it.  The timestamps in jb_put cannot be negative.  If I
put the following hack into chan_iax2.c:

Change:
       ret = jb_put(iaxs[fr->callno]->jb, fr, type, len, fr->ts,
                       calc_rxstamp(iaxs[fr->callno],fr->ts));

To:
       ret = jb_put(iaxs[fr->callno]->jb, fr, type, len, fr->ts & 0x7fffffff,
                       calc_rxstamp(iaxs[fr->callno],fr->ts) & 0x7fffffff);

I now cannot get the jitterbuffer thread to die, like it was before.  In
fact, I had 5 telephones calling over the IAX link during this test, and
while the audio broke-up (because the link was saturated) and the 5th phone
usually got Congestion (because the link was congested), everything
recovered as soon as the jitter got back to a reasonable level.

So this problem looks like a type mismatch between an unsigned int and a long.
Is the new jitterbuffer code really supposed to be handling negative
timestamps, or should the jitterbuffer code be using unsigned longs instead
of longs?

By: stevekstevek (stevekstevek) 2005-10-24 13:49:27

mjrmjr2: have you tried gkloepfer's patch?

As far as dealing with negative timestamps -- It seems there must be a bug somewhere along the line to even get a negative value for fr->ts, right?  Why would a timestamp ever be generated like that?

Also, it would be good to see whether it's fr->ts that's negative, or calc_rxstamp() that's returning negative, etc.

In both these cases, timestamps are coming originally from some SIP equipment, and then being massaged by chan_sip, then chan_iax2 on output, before being received by the box running the JB code.  On the chan_iax2 side, timestamps are defined as being unsigned, but if it gets converted to a signed number and becomes negative, either more than 2^31 ms (~600 hours or 24 days) should pass.

So, I think there ought to be two things to do (and _both_ should be done):

1) Trace back the packets in question with ethereal or something, and find out what the packet's timestamp is when it's sent from the SIP equipment, and what it looks like on the IAX link in-between boxes.  This should help you determine whether it's something going wrong with the SIP equipment originally, or the box doing SIP<->IAX bridging.
a) If it's the SIP equipment, then chan_sip probably ought to sanitize it.
b) If it's in the bridging box, then you need to determine if the bug is in chan_sip or chan_iax2 (probably one of those places), and fix that.

2) Finally, it would be nice to make the JB more resilient to this.  There's already a lot of code in there to cause it to "resync" when it gets discontinuous timestamps, I'm not sure why it doesn't work when things get negative.  gkloepfer's patch might be the right thing for this.

By: Matt Ranney (mjrmjr2) 2005-10-29 20:17:25

I've been running for 4 days after applying gkloepfer's patch.  This is generally longer than I could stay up without an IAX lockup.  In another 4 days I'll know for sure.

I wish I had a reliable way to reproduce this issue, but I have no idea what is causing it.  Each time it locks up there seem to be a different set of sites making calls.

By: Andrew Kohlsmith (akohlsmith) 2005-10-29 20:20:46

I have been having this problem since the start of October.  My setup is very simple.  PRI (TE406) to colocated *, 1-hop SDSL IAX2 connection to the office *, which is PRI (TE405) to the Norstar MICS.

I have a couple other systems which I route DIDs to but by far the bulk of calls come in from the PRI and go over IAX2 to my office * system.

This seems to occur every 3-5 days with me.  I have not applied any patches or anything odd.  Straight CVS HEAD on both machines.

A simple "restart when convenient" works for another while.  I'll try a reload chan_iax2.so next time.



By: Mark Spencer (markster) 2005-10-30 02:11:58.000-0600

Which of the two "&" things seems to be the necessary one?

By: Andrew Kohlsmith (akohlsmith) 2005-10-30 06:59:04.000-0600

I'm sorry -- which two '&' things?

By: Serge Vecher (serge-v) 2005-10-30 10:41:15.000-0600

akohlsmith: I think markster meant "&" in gkloepfer's patch note (0034757 10-12-05 09:33)
       ret = jb_put(iaxs[fr->callno]->jb, fr, type, len, fr->ts & 0x7fffffff,
                       calc_rxstamp(iaxs[fr->callno],fr->ts) & 0x7fffffff);

By: Andrew Kohlsmith (akohlsmith) 2005-10-30 11:10:54.000-0600

I've created a small patch which attempts to identify where the -ve timestamp's coming from.  It's just a quick hack, but it zeroes the timestamp if negative, instead of making it positive.  Was there a reason why you chose to flip the sign instead of nail it to a rail?

By: Andrew Kohlsmith (akohlsmith) 2005-10-30 11:20:11.000-0600

oh, your spurious DTMF issue with the TE411 -- It happens on some systems (like mine, with a TE406).  Mark's response was to disable the VPM DTMF support and that's certainly killed it, but isn't a real fix.  :-)

By: gkloepfer (gkloepfer) 2005-10-30 14:12:30.000-0600

Please DO NOT take my hack and call it a fix.  It is NOT a fix.
It was simply to show where the issue was happening, and to show
how to begin addressing it.  I've actually had issues with that
"patch" causing strange audio problems.  The best temporary workaround
is to turn off the jitterbuffer in iax.conf until the issue is resolved.

The real problem is that there is a type mismatch between the arg to
jb_put (long) and the type of fr->ts (unsigned int).  The "negative" number
is happening because the high-order bit of fr->ts is getting set and the
jitterbuffer routines are calling them negative (because a "long" is signed
and is the same size as an int on 32 bit machines).

I can't tell for sure what would make the IAX2 timestamp set the high-order
bit of an unsigned int.  However, looking at chan_iax2.c, there are various
pieces of code that look like this:

  fr->ts = ( (p1->last & 0xFFFF0000) + 0x10000) | (fr->ts & 0xFFFF);

I haven't quite figured out exactly how the timestamp is being created
and what the expected range may be.  It is possible that when a frame is
dropped, the upper order bits of some timestamp are set to 0xFFFF.
I don't know.  I was hoping that someone who was more familiar with
chan_iax2.c would know.

I guess in any case, passing an unsigned int to a signed long should be
forbidden anyway (especially when the surrounding code starts playing
with the upper order bits).

By: Andrew Kohlsmith (akohlsmith) 2005-11-02 13:29:06.000-0600

I have output!

Nov  1 18:08:42 ERROR[936] chan_iax2.c: call -18: calc_rxstamp returned -ve, zeroing!

Now I screwed up... iaxs[fr->callno]->peer is not a string, but my patch output has proven that fr->ts is not negative, at least it never has been in the hundreds of lines of output in my system... it's always calc_rxstamp() that's returning a negative value.

Unfortunately I don't have this patch applied on my home system and I get the odd negative timestamp there too but since I'm not zeroing it there, I can't tell if zeroing is the right thing to do :-)

Obviously the right RIGHT thing to do is to find out why calc_rxstamp thinks a -ve answer is right...

By: stevekstevek (stevekstevek) 2005-11-02 14:16:41.000-0600

A couple of comments:

<quote>
I can't tell for sure what would make the IAX2 timestamp set the high-order
bit of an unsigned int. However, looking at chan_iax2.c, there are various
pieces of code that look like this:

  fr->ts = ( (p1->last & 0xFFFF0000) + 0x10000) | (fr->ts & 0xFFFF);

I haven't quite figured out exactly how the timestamp is being created
and what the expected range may be. It is possible that when a frame is
dropped, the upper order bits of some timestamp are set to 0xFFFF.
I don't know. I was hoping that someone who was more familiar with
chan_iax2.c would know.
</quote>

In IAX2, timestamps sent are sent over the wire as either 32-bit (full) or 16-bit (mini) values.  The mini-values are sent by "mini" frames, and represent only the lower 16bits of the full timestamp.

In general, IAX2 endpoints reconstruct the full 32-bit timestamp by taking the top 16bits of the last FULL 32-bit received timestamp, and ORing it with the 16bit mini frame that was received.  Things get a bit tricky, however, if

(A) things arrive out-of-order,
(B) endpoints don't actually send the FULL frame when the top 16 bits change.

So, IAX2 endpoints need to use some heuristics to determine when one of these has happened, and basically add one or subtract one from the top-16 bits used.

Hopefully that explains where lines like that come from.

IAX2 timestamps on the wire should always be >= 0.

Now, to akohlsmith's observation:
calc_rxcode has changed a bunch recently, with all this ast_tvdiff_ms stuff.  If you take out all the debugging, you have this code:

 static unsigned int calc_rxstamp(struct chan_iax2_pvt *p, unsigned int offset)
{
       /* Returns where in "receive time" we are.  That is, how many ms
          since we received (or would have received) the frame with timestamp 0 */
       int ms;

       /* Setup rxcore if necessary */
       if (ast_tvzero(p->rxcore)) {
               p->rxcore = ast_tvnow();
               p->rxcore = ast_tvsub(p->rxcore, ast_samp2tv(offset, 1000));
       }
       ms = ast_tvdiff_ms(ast_tvnow(), p->rxcore);
       return ms;
}

I'm also comparing this to an old rxcore, from 20050621 (which is what I use most of the time):

There's a couple of things worth noting:

1) ms is defined as int, but we're returning unsigned int.  This is unchanged from the previous version.

2) all the new functions from time.h were hand-coded here before.  This is certainly a good change, but we really need to see if some semantics have changed here in some strange way..

So, let's see:
Setting rxcore:

WAS:
               gettimeofday(&p->rxcore, NULL);
               p->rxcore.tv_sec -= offset / 1000;
               p->rxcore.tv_usec -= (offset % 1000) * 1000;
               if (p->rxcore.tv_usec < 0) {
                       p->rxcore.tv_usec += 1000000;
                       p->rxcore.tv_sec -= 1;
               }

NOW:
               p->rxcore = ast_tvnow();
               p->rxcore = ast_tvsub(p->rxcore, ast_samp2tv(offset, 1000));

So, these look the same:

WAS:
       gettimeofday(&tv, NULL);
       ms = (tv.tv_sec - p->rxcore.tv_sec) * 1000 +
               (1000000 + tv.tv_usec - p->rxcore.tv_usec) / 1000 - 1000;

NOW:
       ms = ast_tvdiff_ms(ast_tvnow(), p->rxcore);

Also, looks the same..  (ast_tvdiff_ms seems like it came directly from chan_iax2.c before these changes were made, and just adds an extra set couple of sets of parentheses, which seem to be unnecesary.  it also returns int.

so, hmm..  I'm not sure why it would return a value with the high bit set, at least not until a call was up for a long time?

Maybe I'm missing something here...

By: gkloepfer (gkloepfer) 2005-11-03 08:31:31.000-0600

I've been looking a lot at my second debug session (jbdebug2.txt)
and noted something very interesting.  The large jump in timestamp
occurred on call number 86c2000.  No jb_put calls were done between
the time jb_new was called for 86c2000 and the time the large timestamp
happened.

This means to me (based on stevek**2's good explanation and trying
to carefully follow the code) that either we have something clobbering
the timestamp within the channel or a new call has just been started
and the timestamp within the packet coming from the new call is corrupt.

Now I'm starting to understand why an Ethereal packet trace would be
helpful.  I'm going to try to get that this morning.  Unless I get
pulled away onto something else, I should have something to report
pretty soon.

By: gkloepfer (gkloepfer) 2005-11-03 15:01:55.000-0600

Newsflash:

The big timestamps are being received from the other Asterisk instance.
So while it would be a good thing to better deal with corrupt packets,
aka ones with bad timestamps, the source of the bad packets is Asterisk
itself.

Note that the source Asterisk is running release 1.0.10.

I am investigating the source of the bad timestamps.

By: Matt Ranney (mjrmjr2) 2005-11-03 16:12:17.000-0600

As an update, my server remains up after gkloepfer's fix to force all the timestamps positive, so this is probably the same issue.

In my scenario a very small number of calls are sent to a machine running 1.0.9.   The bulk of the traffic though goes to machines running CVS HEAD.  I just noticed that the 1.0.9 machine doesn't have the ztdummy module loaded.  Perhaps this is related?

By: gkloepfer (gkloepfer) 2005-11-03 17:16:38.000-0600

I have been unable to determine the source of the negative timestamps
from Asterisk 1.0.10.  I'm starting to wonder if this problem hasn't
already been fixed in -HEAD, unrelated to what we're seeing here.

For those who are forcing bit ASTERISK-27 (32nd bit) = 0 with my original "patch" -
please stop using this.  I will upload another workaround in a minute
that is a much better workaround, is cleaner, and less likely to cause
problems.  It may actually be the correct fix.

What I did with was modify unwrap_timestamp() (which is called
to clean-up the timestamp when it wraps around) to effectively wrap
when bit 31 is set.  This will not only prevent a "negative" timestamp
from entering the jitterbuffer code, but will also prevent a rabidly
abnormal timestamp from messing up the jitterbuffer "history".  It
will effectively smooth out the abnormality rather than just clipping it.

In my test environment, I could not get Asterisk to hang, and the
strange noises I heard with the bit-clipping "patch" (hack) didn't
happen.

Again, please remove that bit clipping hack (with the 0x7fffffff in it) and
use the uploaded patch instead if you need a workaround.  This still
isn't a real fix, but is much cleaner than chopping bits off.

Note that the patch is against asterisk/channels/chan_iax2.c (I did not
do the diff from the top-level directory).

By: stevekstevek (stevekstevek) 2005-11-03 17:26:59.000-0600

I'm sorry I haven't had more time to look at this today.  It's been two days this week, that the same two people in my office have done brain-dead things related to trojans and viruses, one of these included plugging the infected machine _back_ into the network after we had unplugged it, etc.  (that machine is now locked in a cabinet).

Anyway, it seems like gkloepfer has a good handle on things for now.  Here's my 2c, for what it's worth:

1) One thing that's particularly hokey with 1.0.x (and 1.2, for that matter, when you're not using trunktimestamps) is trunking.  Lots of timestamp weirdness happens there.

2) Even if you solve this, it still points to an underlying bug in the jitterbuffer itself, where jb_next returns a time less than it should.  I didn't really analyze the behavior of what happens when things wrap; I guess I was just hoping it would resync, and, although there might be an audio glitch every 24 days, it wouldn't cause anything bad to happen.  That might have been a false hope :)

By: ewieling (ewieling) 2005-11-12 09:18:01.000-0600

Throw Asterisk 1.2-rcX on IAXTel.  That should expose any of these issues quickly.

By: Tilghman Lesher (tilghman) 2005-11-12 09:49:43.000-0600

Just wondering if this might have been fixed by ASTERISK-5537 -- it had the same symptoms of stopping all IAX calls, seemingly inexplicably.

By: stevekstevek (stevekstevek) 2005-11-12 10:05:35.000-0600

cordyon: no, that seemed to be a pretty localized bug in the cli handler for iax2 show peer, where it didn't unlock things when it couldn't find a peer..

Unless you actually use iax2 show peer, it wouldn't be reproducable..

By: the_duke (the_duke) 2005-11-21 05:42:38.000-0600

FYI: I have seen this problem on 1.2b1 where the latest patch for this seems to be applied already.

By: Mark Spencer (markster) 2005-12-20 02:46:34.000-0600

So what's the story here?  Is the problem still manifested as 100% CPU utilization and is it still around in svn trunk?

By: Matt Ranney (mjrmjr2) 2005-12-20 09:35:11.000-0600

I have upgraded all the servers in my network to either 1.2 or svn trunk.  I have not seen this problem since I did that.

I believe the problem was caused by an asterisk 1.0 system generating crazy timestamps.  From reading the comments from stevek and gkloepfer above, it seems like there is still at least one serious problem with the jitter buffer.

Thankfully I am not running into it every couple of days anymore.

By: gkloepfer (gkloepfer) 2005-12-20 09:41:58.000-0600

I concur with mjrmjr2.  I have noticed the same behavior since updating everything to 1.2/SVN.  If I have a chance today, I'll beat on the jitterbuffer a little and see if it breaks in the same way, but I suspect it won't.

By: gkloepfer (gkloepfer) 2005-12-20 09:43:43.000-0600

(and now to make my note more coherent)
I have NOT noticed any problem since updating everything to 1.2/SVN trunk.

But I will still check later today to see if I still see this problem.

Argh. Does someone have a nice large cork to stick in my mouth!! ;-)

By: gkloepfer (gkloepfer) 2005-12-20 14:50:55.000-0600

I just did the same tests as I did when it failed with the combination of 1.0.10 and 1.2 beta (only this time with all 1.2.1 branch) and was unable to get it to fail again.

Conclusions:
(1) It looks like 1.2.1 is not spitting out IAX2 packets that have weird timestamps that confuses the jitterbuffer like 1.0.10 appeared to do.  This is a good thing.

(2) I'm still a little concerned that the jitterbuffer issues we saw during the investigation of this bug may still exist if someone crafts an invalid IAX2 packet.

As Asterisk continues to be more widely-deployed, it will quickly become a target for security exploits.  If someone can effectively crash an Asterisk PBX simply by sending it IAX2 packets with bizarre timestamps, this would be considered a highly exploitable DOS.  Imagine what could happen if someone finds a buffer over/underflow situation.

By: Andrew Kohlsmith (akohlsmith) 2005-12-24 10:56:17.000-0600

wu-ast*CLI> show version
Asterisk SVN-trunk-r7469M built by root @ wu-ast.mixdown.ca on a i686 running Linux on 2005-12-13 16:44:31 UTC

Still does this on occassion.  I did notice some "zombie" iax2 channels open but I was unable to keep a copy of the screen.  restart now "fixed" it.

By: Andrew Kohlsmith (akohlsmith) 2005-12-26 16:28:14.000-0600

trying the updated patch (last msg about the crash was WITHOUT the better workaround patch)

By: Matt Ranney (mjrmjr2) 2006-01-09 22:21:07.000-0600

I thought the problem had gone away for me, but it just happened again on one of my busier machines.  It is running SVN-trunk-r7341M and it exhibited the same symptoms as before: 100% CPU, no IAX data flowing, CLI still responsive.

By: Olle Johansson (oej) 2006-02-02 01:33:10.000-0600

Still an issue? Any more information?

/Housekeeping

By: Michael Young (mxyoung) 2006-02-03 14:17:22.000-0600

We have had this same problem today running 1.2.1 -- 100% CPU, CLI responds, no IAX communication. Restart Now "fixes" it. We are going to upgrade to 1.2.4 tonight to see if this resolves issue.

We only have only seen this on one fairly loaded server, with 40 concurrent calls at peak times and g.729 on about 10 of those calls (the rest are ulaw).

By: Matt Ranney (mjrmjr2) 2006-02-03 14:20:52.000-0600

I haven't had this issue since 1/9/2006, but I haven't updated asterisk since then either.  I suspect if I had more traffic running through that machine that the problem would arise again.

By: Olle Johansson (oej) 2006-02-04 03:22:52.000-0600

Ok, closing this report now. Please reopen (or ask a bug marshal to re-open) if this problem appears again.

Thank you!

/O