Summary:ASTERISK-06395: [patch] Asterisk crashes randomly when using manager to generate predictive calls with Zap and SIP
Reporter:Paulo Mendes da Silva (kanelbullar)Labels:
Date Opened:2006-02-22 04:25:39.000-0600Date Closed:2006-05-30 15:07:34
Versions:Frequency of
Environment:Attachments:( 0) bt.txt
( 1) pbx.diff
( 2) thread_apply_all_bt.txt
Description:We are experiencing random Asterisk crashes after running for a few hours. We are using Asterisk 1.2.4.

Our scenario is a bit complicated, but I will try to explain it.

We are using the Asterisk Manager Interface to implement a predictive dialing application. We use the Originate request to make a call to a given Zap destination, like:

exten = _0.,1,Dial(Zap/g0/${EXTEN:1})

When the destination answers, the call is connected to the following extension:

exten = pred_dev,1,NoOp(pred_dev)
exten = pred_dev,2,Wait(30)
exten = pred_dev,3,Hangup

Our application is monitoring Manager events, so it will know at this point the call has been answered at the destination. It then redirects the call using Redirect to a SIP extension like:

exten = 7100,1,Dial(SIP/7100,30,t)

The Zap channel and the local SIP channel are connected at this point. After a few minutes, the call is disconnected using Manager.

We have 60 Zap channels available and 40 SIP extensions to take calls.


After analyzing the backtrace outputs from several crashes, we concluded they were the same.
Here is the example we are sending you:

#0  0x0035ec94 in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#0  0x0035ec94 in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#1  0x00634c51 in local_hangup (ast=0x8b14e60) at ../include/asterisk/lock.h:592
#2  0x080678b4 in ast_hangup (chan=0x8b14e60) at channel.c:1327
#3  0x0045500e in dial_exec_full (chan=0x8be9170, data=) at app_dial.c:1574
#4  0x004573ed in dial_exec (chan=0xb6f1cdf0, data=0x8b14e60) at app_dial.c:1601
ASTERISK-1  0x0809072d in pbx_extension_helper (c=0x8be9170, con=) at pbx.c:544
ASTERISK-2  0x080919f6 in __ast_pbx_run (c=0x8be9170) at pbx.c:2218
ASTERISK-3  0x0809345c in pbx_thread (data=0xb6f1cdf0) at pbx.c:2505
ASTERISK-4  0x0035d341 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-5  0x002666fe in clone () from /lib/tls/libc.so.6

We are also including the "thread apply all bt" output.

Please let me know you you need any additional information.
Comments:By: Paulo Mendes da Silva (kanelbullar) 2006-03-03 05:54:17.000-0600

We keep experiencing this problem. Sometimes it takes several hours to happen, sometimes it takes just 30 minutes. We have increased load in order to check if the problem shows up faster, but it doesn't seem to make much difference.

Latest tests were made with 150 Zap channels and 120 SIP extensions to take calls.

Please do take a look at this problem, as it dramatically affects stability in our predictive call scenario.

By: David James (davidj) 2006-03-04 17:57:07.000-0600

Looks like the caller is hanging up, then Asterisk is hanging up the channel.

From what I understand you can't hangup/destroy (ast_hangup) a channel thats already hung up.

And because the caller is on AutoService mode there is no way to detect their hangup (?? am I wrong?)

I believe the fix is that before the code reaches this:

#2 0x080678b4 in ast_hangup (chan=0x8b14e60) at channel.c:1327

It needs to call ast_check_hangup(chan), then, if true, hangup, else do not hang up.

You would need to add ast_check_hangup(chan) to every ast_hangup near app_dial.c:1574

By: Tilghman Lesher (tilghman) 2006-03-04 22:06:51.000-0600

It appears that you're using the Local channel to initiate the call.  Might I suggest dialling the destination directly, i.e.

Action: Originate
Channel: Zap/g0/12345678901
Context: whatever
Extension: pred_dial
Priority: 1

By: Paulo Mendes da Silva (kanelbullar) 2006-03-07 05:51:25.000-0600

Corydon76, it is not feasible from our side to dial the destination directly. We are using the Local channel in order to be able to take advantage of the dial plan. Otherwise, our application would have to be aware of the specific details of the destination numbers, namely, the Zap groups that would have to be used. We need to hide that information from our application and the dial plan is perfect to do so. That is why using the Local channel is important for us.

Should we make ourselves the correction that was suggested by davidj?

By: Abhay Gupta (agupta) 2006-03-07 09:33:11.000-0600

We are also using Local channels and facing Random crashes with 1.2.4 . David can u help us with the suggested correction

By: Tilghman Lesher (tilghman) 2006-03-07 10:29:05.000-0600

kanelbullar:  My suggestion was a workaround.  There are known issues with using the Local channel.  While we will fix the bugs as we figure out what they are, it's sometimes preferable to have an immediate solution for crashes, rather than being told "we're working on it".

By: Paulo Mendes da Silva (kanelbullar) 2006-03-07 11:44:36.000-0600

Ok, thank you Corydon76. I was just trying to explain why we were using the Local channel instead of the direct channel.

By: Paulo Mendes da Silva (kanelbullar) 2006-03-24 12:36:30.000-0600

We have attempted to implement the changes that were suggested by davidj, but we keep experiencing the crashes.

We have noticed a very odd message in the log files, which may ring a bell for you:
Mar 23 23:30:06 WARNING[14129] channel.c: Hard hangup called by thread -1235805264 on Local/087455@default-207b,1<ZOMBIE>, while fd is blocked by thread -1235805264 in procedure ast_waitfor_nandfds!  Expect a failure

Shortly after this message, asterisk crashes.

By: xiribitata (xiribitata) 2006-03-27 07:53:52.000-0600

We are experiencing the same problem. After a while (sometimes days, sometimes hours), making several calls using local channels, asterisk crashes with segmentation fault. The same warning message "Hard hangup called by thread -1235805264 on Local/12345@default-111b,1<ZOMBIE>, while fd is blocked by thread -1235805264 in procedure ast_waitfor_nandfds! Expect a failure" appears before the crash.

It seems the same problem described in ASTERISK-5219.

By: Boris Moreno (jupiter) 2006-04-04 13:49:30

I`m having the same problem with version 1.2.5 and 1.2.6. I`m also making predictive calls using Local channels and using the Asterisk Manager. We have in the same machine 2 E1 and a TDM2406p with FCTs recently added. The problems began with the installation of the TDM board. Something interesting is that the FCTs report the detection of hangup with a 2 sec delay. And the crash happends when i`m connecting an agent with the FCT.

Currently i`m working in a workaround...

By: Boris Moreno (jupiter) 2006-04-05 16:16:09

after filling the manager.c,pbx.c and channel.c with outputs i identified that the problem was in the ast_async_goto when calling the ast_channel_masquerade funtion, after testing with many changes in the code, discovered that using the softhangup with the a option;before dial in the dialplan the segmentation fault never occurs again.

By: Boris Moreno (jupiter) 2006-04-17 10:18:09

Happend again, to solve the problem i had to change the ast_async_goto to return -1 if is not possible the sync_goto. Sometimes i lost calls, but at least don't crash again. I`m goin to try the 1.2.7 asterisk version...apparently it solves the problem.

By: Serge Vecher (serge-v) 2006-05-02 16:16:30

jupiter: what are the results with 1.2.7-1?

kanelbular: are you still having this issue with 1.2.7-1?

By: Boris Moreno (jupiter) 2006-05-02 16:56:04

Still happening, the only workarround for me was modifing pbx.c.

By: Serge Vecher (serge-v) 2006-05-03 08:45:38

can you please produce a patch against patch that fixes the issue? Thanks.

By: Boris Moreno (jupiter) 2006-05-03 09:58:03

I made a pbx.diff with the workaround that worked for me. Download it, test it and tell me if it works for you

By: Serge Vecher (serge-v) 2006-05-30 15:07:32

jupiter: your patch did not get a good feedback from the development team. As the original poster has not answered, about reproducibility, this issue will be closed. If anybody can reproduce this issue in unmodified Asterisk, please reopen the issue or ask the bug marshall to open it for you. Do not forget to attach a backtrace from Asterisk built with make dont-optimize. Thanks.