[Home]

Summary:ASTERISK-06237: Calling any Agi before Dial breaks timeout
Reporter:xrobau (xrobau)Labels:
Date Opened:2006-02-02 06:22:43.000-0600Date Closed:2006-04-06 11:28:36
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) hack-patch.txt
Description:Dialplan example below:
exten => 444,1,NoOp(No agi)
exten => 444,n,Dial(SIP/303|15|tr)
exten => 445,1,NoOp(With agi)
exten => 445,n,Agi(anything.agi)
exten => 445,n,Dial(SIP/303|15|tr)

When you dial 444, it works as normal. It'll ring for 15 seconds and then come back with NOANSWER. However, calling 445 results in an INSTANT return with NOANSWER.

This has broken in the recent updates, it was working in trunk-89xx

****** ADDITIONAL INFORMATION ******

Please note the timestamps, showing no elapsed time between the timeout.
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Executing NoOp("SIP/301-45cd", "With agi") in new stack
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Executing AGI("SIP/301-45cd", "anything.agi") in new stack
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Launched AGI Script /var/lib/asterisk/agi-bin/anything.agi
Feb  2 23:14:20 VERBOSE[2809] logger.c:   ==  invalid.agi: Failed to execute '/var/lib/asterisk/agi-bin/anything.agi': No such file or directory
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- AGI Script anything.agi completed, returning 0
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Executing Dial("SIP/301-45cd", "SIP/303|15|tr") in new stack
Feb  2 23:14:20 DEBUG[2809] chan_sip.c: Setting NAT on RTP to 0
Feb  2 23:14:20 DEBUG[2809] chan_sip.c: Outgoing Call for 303
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Called 303
Feb  2 23:14:20 VERBOSE[2809] logger.c:     -- Nobody picked up in 15000 ms


Note, in this I used an AGI that doesn't actually exist, that has no bearing on this problem. It does exactly the same thing wether or not it's valid
Comments:By: Olle Johansson (oej) 2006-02-02 06:36:08.000-0600

Tried repeating this, can't do it.

By: xrobau (xrobau) 2006-02-02 06:49:06.000-0600

I've removed all modules, did a clean make install, still doing it. FWIW, the only difference is oej is on a Powerbook OS/X whilst this is an i386-linux with gcc version 3.4.4 20050721 (Red Hat 3.4.4-2) (eg CentOS 4.2)

By: xrobau (xrobau) 2006-02-02 07:31:24.000-0600

Rolling back versions, I've found that it's broken somewhere between 8835 (works)  and 8883 (broken). Unfortunately, svn-mail was broken at that time, so I'm flying blind with revision numbers. I'll see if I can find where it breaks.

By: xrobau (xrobau) 2006-02-02 07:41:57.000-0600

OK. Found it. r8877, which was:
Merge Rizzo's waitfor update (bug ASTERISK-4468)

Modified:
   trunk/channel.c
   trunk/channels/chan_agent.c
   trunk/channels/chan_features.c
   trunk/include/asterisk/channel.h

When I roll forward to that, it stops working. Before that, it works fine.

By: Jared Smith (jsmith) 2006-02-02 08:48:28.000-0600

I tried repeating this also, but couldn't get it to break in any unexpected ways.  I'm running on an 864_64 box, running SVN revision SVN-branch-1.2-r8905.  In IRC, xrobau mentioned that he's using GCC 3.2, while oej and I are running GCC 4.  I'm not sure if that's relevant, but I thought I'd put it in here for history's sake.

By: Olle Johansson (oej) 2006-02-02 08:53:04.000-0600

I tested on OS/X, svn trunk, with *no* zaptel.

By: Carlo Pires (carlopires) 2006-02-02 09:22:44.000-0600

Confirmed with SVN-trunk-r9057 with gcc 3.4.4-2. This was an Asterisk@Home box (CentOS). I just upgrade from 1.2.3 to SVN-trunk-r9057. I back to revison 8876 and the problem disapeared.

By: xrobau (xrobau) 2006-02-02 09:23:34.000-0600

I'mm too tired to continue with this tonight, but I think my original idea may be correct. In some places in that patch it's comparing a 'long' to a 'time_t'.. If these differ in sizes, could this be causing the problem?
I've got a Suse machine here that needs a bit of work before I can asterisk running on it, but I'll see what I can do over the weekend. I'm building another CentOS 4.2 machine tomorrow night, and I'll ensure that that machine is doing the same thing.
Hardware specs of this machine is a P4 2.4G, 512M with a TDM422. CentOS 4.2 is 2.6.9-22, with GCC 3.2.2. I have tried it with the core centos gcc4 (4.0.1 20050727) to see if that's the problem, and no luck.
I also put a whole pile of debugging in the 8877 patch to see if I could find out where the hangup was being triggered, and I couldn't find it. I'm stumped. If anyone has a spare machine, grab an A@H CD, which will install something close enough to this box, when updated to trunk, to hopefully trigger the error.

By: Jason Parker (jparker) 2006-02-02 17:58:50.000-0600

rizzo, any input on this?

By: xrobau (xrobau) 2006-02-02 18:21:21.000-0600

OK. I've tried it on SuSE and it works fine. Kevin's tried it on Debian, and it works fine. This is looking like a CentOS problem.

By: damin (damin) 2006-02-02 19:39:06.000-0600

Tested against AstLinux 0.3.0 on a Soekris. Works as expected and does not exhibit the bug.

damin-pbx asterisk # uname -a
Linux damin-pbx 2.6.13.3-AstLinux-sc1100b #1 Wed Dec 7 15:15:43 CST 2005 i586 unknown

damin-pbx*CLI> set verbose 3
Verbosity is at least 3
   -- Executing NoOp("SIP/102-d9a9", "No agi") in new stack
   -- Executing Dial("SIP/102-d9a9", "SIP/101|15|tr") in new stack
   -- Called 101
   -- SIP/101-c10f is ringing
   -- SIP/101-c10f is ringing
   -- Nobody picked up in 15000 ms
   -- Executing NoOp("SIP/102-b8fa", "With agi") in new stack
   -- Executing AGI("SIP/102-b8fa", "anything.agi") in new stack
   -- Launched AGI Script /var/lib/asterisk/agi-bin/anything.agi
   -- AGI Script anything.agi completed, returning 0
   -- Executing Dial("SIP/102-b8fa", "SIP/101|15|tr") in new stack
   -- Called 101
   -- SIP/101-e280 is ringing
   -- Nobody picked up in 15000 ms
damin-pbx*CLI> show version
Asterisk SVN-branch-1.2-r7382M built by root @ ls on a i686 running Linux on 2005-12-07 21:29:05 UTC

By: xrobau (xrobau) 2006-02-02 19:48:23.000-0600

damin-pbx*CLI> show version
Asterisk SVN-branch-1.2-r7382M

It's trunk-r8877 that has the problem, not 1.2

By: xrobau (xrobau) 2006-02-02 19:48:53.000-0600

Note, I tested it on SuSE, and it works fine. I'm currently building a new 2.6.14.7 kernel for the CentOS machine to see if it's a kernel issue.

By: damin (damin) 2006-02-02 19:55:30.000-0600

This appears to work fine without error on SVN-branch-1.2-r8905 running on a P4 3.0Ghz and Centos 4.2. Kernel is version 2.6.9-22.0.2.ELsmp

[root@gw1 asterisk]# uname -a
Linux gw1.n2net.net 2.6.9-22.0.2.ELsmp #1 SMP Tue Jan 17 07:10:04 CST 2006 i686 i686 i386 GNU/Linux

gw1*CLI> show version
Asterisk SVN-branch-1.2-r8905 built by root @ gw1.n2net.net on a i686 running Linux on 2006-02-02 03:26:35 UTC

   -- Accepting AUTHENTICATED call from 207.166.192.225:
      > requested format = ulaw,
      > requested prefs = (),
      > actual format = ulaw,
      > host prefs = (),
      > priority = mine
   -- Executing NoOp("IAX2/boehnlein-7", "No agi") in new stack
   -- Executing Dial("IAX2/boehnlein-7", "SIP/broadvox-noct/12166192000|5|tr") in new stack
   -- Called broadvox-noct/12166192000
   -- Nobody picked up in 5000 ms
   -- Hungup 'IAX2/boehnlein-7'

   -- Accepting AUTHENTICATED call from 207.166.192.225:
      > requested format = ulaw,
      > requested prefs = (),
      > actual format = ulaw,
      > host prefs = (),
      > priority = mine
   -- Executing NoOp("IAX2/boehnlein-9", "With agi") in new stack
   -- Executing AGI("IAX2/boehnlein-9", "anything.agi") in new stack
   -- Launched AGI Script /var/lib/asterisk/agi-bin/anything.agi
   -- AGI Script anything.agi completed, returning 0
   -- Executing Dial("IAX2/boehnlein-9", "SIP/broadvox-noct/12166192000|5|tr") in new stack
   -- Called broadvox-noct/12166192000
   -- SIP/broadvox-noct-899f is making progress passing it to IAX2/boehnlein-9
   -- Nobody picked up in 5000 ms
   -- Hungup 'IAX2/boehnlein-9'

By: xrobau (xrobau) 2006-02-02 20:00:12.000-0600

OK, So what's the difference between your box and mine then? And carlopires's? Does that machine have Zaptel/Libpri? (The one that's crashing does)

By: xrobau (xrobau) 2006-02-02 20:10:49.000-0600

Sigh, that was still branch-1.2, I should learn to read. Please test with _trunk_, not with branch.

Use the command
svn -r8852 co http://svn.digium.com/svn/asterisk/trunk /usr/src/asterisk-trunk

Compile and run that. Then do
 cd /usr/src/asterisk-trunk
 svn -r8877 update

Then try that. The second one will fail. (So far, on every CentOS 4.2 machine I've tried 8)

By: xrobau (xrobau) 2006-02-02 21:21:11.000-0600

It's a kernel issue. The standard linux kernel that comes with CentOS 4.2 has an issue with that patch. I upgraded to 2.6.14.7, and it's working perfectly.

So where do we go from here?

By: xrobau (xrobau) 2006-02-02 22:03:40.000-0600

Corydon76 suggested that this bug be closed, and I agree. However, this _is_ going to be a problem for 1.4.0 when it's released, unless a workaround is discovered. The kernel 2.6.9-22xxxx is used in CentOS 4.2 (which includes all the A@H machines), WhiteBox and RHEL4. Whilst this is definately a kernel bug, I'm not trying to deny that, perhaps a check for that kernel version should be done in the asterisk Makefile with an abort and error of that it won't work due to a known bug.

By: Carlo Pires (carlopires) 2006-02-03 23:14:12.000-0600

It's definetly a kernel issue. I upgraded the same CentOS machine to kernel 2.6.15.2, recompiled zaptel, upgraded asterisk to SVN-trunk-r9139 and the problem was gone. This bug can be closed.

By: Mark Monnin (wrmem) 2006-02-09 11:50:21.000-0600

Please don't close.  On RHEL4, it breaks the production model to have an "enterprise linux" with a locally maintained, custom kernel.  There must be something different in rizzo's patch that is tickling the kernel issue.

(and yes, I got hit with this one too)

By: Mark Monnin (wrmem) 2006-02-24 16:11:26.000-0600

Ok.  Narrowed it down.  Changed line 1587 from

              *ms = (errno != EINTR) ? -1 : 0;
to
              if (errno != EINTR) *ms = -1;

At least in my simple test, it fixed the problem.   I don't know if this is the correct solution or not.  People with better understanding of this section of code please check.

Anyone else who had experienced the problem want to verify the fix works?

(yes, disclaimer is on file)

By: Olle Johansson (oej) 2006-03-17 03:05:22.000-0600

xrobau: Did the suggested patch solve your issue?

By: Tim Kelly (timkelly1980) 2006-04-03 16:17:56

I have same problem except that I get an instant return regardless to whether i am calling an AGI program first or not! Not being able to Dial on a zap trunk is pretty high priority for me!!!!!

I have fixed the problem using the one line fix suggested by wrmem:

ie:

1607: if (errno != EINTR) *ms = -1;

instead of:

*ms = (errno != EINTR) ? -1 : 0;

NOTE: Line numbers have changed since wrmem's post.

Using debian2.4.32
ASterisk: SVn-oej-jitterbuffer-r17041M

By: kuj (kuj) 2006-04-05 19:59:48

Confirm that this fixes similar Dial issues I've had for quite a while on trunk. Verified fix on trunk-r17735. Dial used to "time out" ("Nobody picked up in 60 seconds") immediately after dialing, not the alleged 60 seconds later. Channeltype did not matter, behavior was seen on ZAP and SIP channels.

These issues do not appear with 1.2.6, which I'm running now. (17735 had other issues w/ DTMF).

Running on CentOS 4.x, kernel 2.6.9-34.EL

By: kuj (kuj) 2006-04-05 20:22:02

And I should've mentioned that I'm executing a Macro() from within a Perl AGI. The Macro in turn calls Dial.

By: Luigi Rizzo (rizzo) 2006-04-06 03:03:12

committed the fix suggested by wrmem.
please let us know how it works.

By: Luigi Rizzo (rizzo) 2006-04-06 04:26:01

committed to trunk. not sure if it belongs to 1.2.x as well.

By: Olle Johansson (oej) 2006-04-06 04:45:53

Change status to "pending release branch" if you are not sure.

By: Russell Bryant (russell) 2006-04-06 11:28:33

I am closing this out with no change to the 1.2 branch because from reading previous bug notes, it appears as if the problem was only a trunk issue.  If I am mistaken, feel free to reopen this report or email me directly at russell@digium.com.

Thanks everyone!