[Home]

Summary:ASTERISK-03660: Asterisk randomly crashs
Reporter:sariabod (sariabod)Labels:
Date Opened:2005-03-09 18:57:29.000-0600Date Closed:2011-06-07 14:00:21
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) btfull.txt
( 1) btfull2.txt
( 2) leadlogs.txt
Description:Everytime we upgraded our system we would get random crashs. We went back to an older version (Asterisk CVS-D2004.09.28.03.08.07-02/14/05-16:01:01) which runs like a champ (why upgrade? I would like to know that too!). Now, purchased new TE410P and new system, redid configs and still have crashs every 1-2 hours.


****** ADDITIONAL INFORMATION ******

uname -a: Linux voip2 2.6.10 #2 SMP Wed Mar 2 22:50:08 PST 2005 i686 AMD Athlon(tm) MP 2800+ AuthenticAMD GNU/Linux
attached btfull and full logs leading up to crash. May be related to bug ID0003593 (last time I upgraded). This time its not a mixed build.

Comments:By: sariabod (sariabod) 2005-03-09 19:00:14.000-0600

Mark, if you still think this is a PRI issue and I should contact digium I will.

By: Mark Spencer (markster) 2005-03-10 02:37:06.000-0600

I'll have to be able to login and look at this.

By: sariabod (sariabod) 2005-03-10 13:52:49.000-0600

doh, was up for 12 hours. How would you like me to give you the login info?

Edit: A little background info for the bulk of the calls

Incoming -> Operator Queue (Cisco 7960, sip) -> 1 of 4 other queues -> picked up by agents (iaxcomm, iax2) -> Sometimes put back into a different queue

edited on: 03-10-05 17:33

By: Fernando Romo (el_pop) 2005-03-10 18:36:03.000-0600

test removing all the modules in /usr/lib/asterisk/modules/ and recompile again, must of my errors gone, maybe is a old module that originate the problem.

i using the last CVS Head and dont crash (for now).. :)

By: Mark Spencer (markster) 2005-03-10 19:43:15.000-0600

Reopen if it happens again.  Thanks!

By: sariabod (sariabod) 2005-03-11 09:37:24.000-0600

I never closed it. Asterisk is still crashing.

By: Mark Spencer (markster) 2005-03-11 11:19:05.000-0600

Did you get us a new backtrace?  Also can you confirm there are no patches and that this is running straight up CVS head?

By: Fernando Romo (el_pop) 2005-03-11 11:56:11.000-0600

I test with a clean CVS HEAD instalation, and work fine, in a old instalation with update the CVS Head send me compilations errors, i delete all the stuff in /usr/lib/asterisk/modules and fail, but i delete all in /usr/include/asterisk and recompile again and work fine.

Maybe we need to put erasing files routings in "make clean" to avoid this kind of behaivor, i mean remove all the previous instalation modules, lib and include.

By: sariabod (sariabod) 2005-03-11 12:30:31.000-0600

Asterisk CVS-HEAD-03/10/05-12:51:00
I pretty much update every night in hopes of it automagically fixing the problem.
It is a clean build. Seems to me it crashs under load or there is some sequence of events that has a higher chance of happening with many calls. Below are the crash times.

Wed Mar  9 09:32:13 PST 2005
Wed Mar  9 10:14:25 PST 2005
Wed Mar  9 11:45:14 PST 2005
Wed Mar  9 11:45:15 PST 2005
Wed Mar  9 12:14:54 PST 2005
Wed Mar  9 12:27:43 PST 2005
Wed Mar  9 15:01:16 PST 2005
Wed Mar  9 16:16:33 PST 2005
Wed Mar  9 16:55:32 PST 2005
Thu Mar 10 11:35:23 PST 2005
Thu Mar 10 12:10:02 PST 2005
Thu Mar 10 13:01:40 PST 2005
Thu Mar 10 13:15:58 PST 2005
Thu Mar 10 14:37:03 PST 2005
Thu Mar 10 17:15:30 PST 2005
Fri Mar 11 10:18:34 PST 2005

It runs fine all night, then starts crashing during "busininess hours" when we have our highest call volume.

edited on: 03-11-05 12:30

By: Mark Spencer (markster) 2005-03-12 00:07:17.000-0600

This appears to be the same as bug ASTERISK-3510.  Can you try the steps outlined there on a test box and confirm that is the problem?  I need access to a system exhibiting this problem in order to debug and repair it.

By: Fernando Romo (el_pop) 2005-03-13 18:33:36.000-0600

I think i found the problem:

Ok, I test with a new instalation with a CVS HEAD in a call center, and asterisk crask without any message, but i notice in the console asterisk screen in the server (using Alt+Control+F9) the following error after the crash:

---------------------------------------------
 == Agent '1001' logged in (format ulaw/slin)
   -- Executing Queue("SIP/1699-09c4", "viajesaltillo|tdr") in new stack
   -- Stopped music on hold on SIP/2001-b6e2
   -- agent_call, call to agent '1001' call on 'SIP/2001-b6e2'
   -- Playing 'beep' (language 'en')
   -- Called Agent/1001
   -- Agent/1001 answered SIP/1699-09c4
   -- Started music on hold, class 'default', on SIP/2001-b6e2
 == Spawn extension (default, 8001, 1) exited non-zero on 'SIP/1699-09c4'
Ouch ... error while writing audio data: : Broken pipe
Warning, flexibel rate not heavily tested!
---------------------------------------------

Seems to me the mpg123 process Pipe broke and Asterisk don't know how handle the exeption, i can reproduce more often with any operation with the music on hold involved.

i testing with the "unbuffered" setting of musiconhold.conf in place of the default and the asterisk seems stable, i continue testing.

the "quietmp3" feature appear to has a problem.

By: Fernando Romo (el_pop) 2005-03-13 19:10:44.000-0600

We discover too the res_musiconhold.c stop to send audio and put a low level noise (without music) when a agent wait for a medium period of time (+- 10 minutes), if receive a call and hangup, the music start again without noise.

edited on: 03-13-05 19:12

By: Mark Spencer (markster) 2005-03-13 21:35:14.000-0600

Did you test with the procedure of ASTERISK-3510 ask I asked?  The mpg123 is a side effect of the crash and is not causing it.

By: Fernando Romo (el_pop) 2005-03-13 23:44:45.000-0600

Today i test a production system with few agents, in the firts hours, we limit the agents and only make administratives calls, the first two hours every work fine, the agents start to log-on and work, but the problem was focus in the log-off phase, the agent stay listen moh about 5 minutes and hangup, then asterisk crash,   the Time Between failures is short with more agents loged-on.

in first place i think is a res_musiconhold.c bug, but must be chan_agent.c, when take the agent logoff signal, * crash and mpg123 send a message of the broken pipe (anthm mention this in bug 0003590)

tomorrow i test again, but is not easy play with a production system, i try to make the procedure you mention above.

By: sariabod (sariabod) 2005-03-14 10:11:09.000-0600

I could not reproduce the problem, Im not sure it related to this issue. We dont use SIP transfers and we only have 1 SIP agent (the operator), everybody else is useing iaxcomm. The system is set up for you to log in to, I just need to know the best way of sending it to you.

By: Fernando Romo (el_pop) 2005-03-14 16:32:03.000-0600

i test again in a production system with CVS head but i back the chan_agent.c to revision 1.121 in place of 1.125

The system has suported the login and logout of the ACD agents, with version 1.125 the Asterisk crash in 10 ~ 15 minutes, we have more of two hours up and running, i make a cvs diff -r 1.121 -r 1.125 chan_agent.c, and the only thing i see suspicius are:

1365a1366,1368
>                       ast_device_state_changed("Agent/%s", p->agent);
>                       if (persistent_agents)
>                               dump_agents();

i don't know if this fuctions provoque the crash, but for the moment, i back to the mentionate version and asterisk still running. I still testing.

[Update (6:57 PM -6 GMT)]: Well 4 hours and half with 10 agents and asterisk still running, the agents login and logout constant without crash.

[Update (11:33 PM -6 GMT)]: The Asterisk and ACD operation still runing, 7 1/2 hours of testing.

edited on: 03-14-05 18:57

edited on: 03-14-05 23:36

By: Fernando Romo (el_pop) 2005-03-15 14:08:30.000-0600

With a full load, after 12 hours and 20 agents, asterisk crash when ACD agents logout, the problem point to chan_agent.c with the control of login and outs.

By: Mark Spencer (markster) 2005-03-15 14:27:10.000-0600

Do you have a new backtrace from the latest crash?

By: sariabod (sariabod) 2005-03-15 14:48:25.000-0600

Mark, you do realize there are 2 different people posting here? My system (sariabod) is not production anymore, When there isnt any load it doesnt crash. I am just waiting for you to log in to it. Im not sure if el_pop and me have the same problem. I tried getting to crash with the sip transfers and it didnt, I also tried the agent login / wait for 5 minutes (moh) / hang up, and it still did not crash.

I am on IRC now if you need to get a hold of me for login info.

edited on: 03-15-05 14:49

By: Fernando Romo (el_pop) 2005-03-15 16:47:14.000-0600

sariabod:

Appear to be two diferents problems, but go to the same point: make the * operation stable.

I install a full CVS Head and try to reproduce your problem, in the proccess, i see the only way to make * crash is with work load. in my test, the agents are using sip softphones (Xlite) and when Logout (hanging the phone), the condition raise and * crash without any message.

We need to check the agents operation, the sip hangup notify messages and how asterisk process the request in combination with ACD work.

For me, your report lead me to test heavy * and take the measures for the implementation of a Call Center (right now i am deploying one with real operation and the client let me test), then i discover a situation how compromise my own project.

excuse me if i try to help you, the process reveal other issues and only report what i find.

By: sariabod (sariabod) 2005-03-15 16:59:00.000-0600

This is nothing against you el_pop. I welcome the troubleshooting. I just wanted to verify who markster was replying to.

By: Clod Patry (junky) 2005-03-28 02:46:21.000-0600

On ASTERISK-365846, post from Damin (damin 03-17-05 16:47) said this bug should be fixed on HEAD, is it the case ?

Thanks.

edited on: 03-28-05 02:47

By: lannygodsey (lannygodsey) 2005-03-28 11:41:29.000-0600

Mar 28 09:29:51 WARNING[25349]: Loading module res_features.so failed!
Mar 28 09:29:53 WARNING[25353]: /usr/lib/asterisk/modules/res_features.so: undefined symbol: ast_monitor_stop

Asterisk was crashing on startup, I tried fiddling w/ music on hold but the fix was adding load => res_monitor.so to modules.conf.

/etc/asterisk/modules.conf
...
load => chan_modem.so
load => res_musiconhold.so

----
Not sure if this is the right bug to post this or not, but the startup was identical to that posted in:
el_pop
03-13-05 18:33


load => res_monitor.so

edited on: 03-28-05 11:42

By: Mark Spencer (markster) 2005-03-30 01:04:18.000-0600

Is this even still an issue then?

By: sariabod (sariabod) 2005-03-30 11:25:29.000-0600

yes, this is still an issue. Neil L from digium support is working on it. I dont know the status.

By: Fernando Romo (el_pop) 2005-04-10 18:28:03

The problem is a convination of factors, the only way to reproduce is in the agent logoff (i only test with sip phones and Softphones), anybody test with iax phones?

I deduce the chan_agent receive the logout and the must hang up the phone, but if the hang-up signal in sip arrive before the chan_agent, maybe asterisk recive a dual hangup event an can't handle this escenario and die.

I can reproduce the fault in a live call center (must remarkable in agent shift time), but i can do ahead beacouse mi client need to work faster. in laboratorie, the grade of reply of this error is very rear, and i start to figure out the posible error. reading the code chan_agent make a ast_soft_hangup(), but in the chan_sip the hang up sequence is trough a sip notify msg than generate the hangup_action in asterisk. if the two request (the agent and the sip) arrive out of order or in twisted state, then asterisk die.

Maybe we need to implement a kind of hangup_keeper() or a wait_hangup() fuction than work like a lock and avoid the crash.

Yoa can tell me how to make a deeply asterisk debug an i try to reproduce in lab the conditions of the live call center  (i mean to many calls).

By: nick (nick) 2005-04-10 19:05:36

OK, let's clear some of this up--
sariabod: do you still have a bug? Was Digium able to help you?
el_pop: Open a new bug with the issue you're having so we can close this one of sariabod's is resolved. I don't believe the two of you are having the same issue.

Nick

By: Fernando Romo (el_pop) 2005-04-10 21:51:46

In the test of this current bug i discover this behaivor, i open a new one report

By: sariabod (sariabod) 2005-04-11 10:16:23

I have not heard back from digium as of yet. Yes the problem still exists.

By: damin (damin) 2005-05-01 13:05:32

Guys, I just read through this entire bug and I feel like I have split personality disorder. Sariabod, can you confirm that this is still and active problem with the latest CVS Head as of 5/1/2005? It is now 20+ days since the last update to this bug, and it appears that no progress is being made. If this is clearly reproducible, then please clarify the EXACT sequence of events that causes the problems, get a backtrace of the core file and if possible SIP and IAX2 debugs. If you suspect a LibPri problem, then perhaps the folks from Digium that were working on the issue can post their findings here.

By: Michael Jerris (mikej) 2005-05-14 21:38:03

We need an update on this or we can not resolve this issue.  We need an updated backtrace and debugs.  

By: sariabod (sariabod) 2005-05-16 13:24:02

The system that was crashing is not in a production environment. We had too may angry customers being dropped after being on hold. I never heard back from Digium Support, checked the logs, never even saw anybody log in. I dont have the capacity or infrastructure so "simulate" load. We have been running Asterisk CVS-D2004.09.28.03.08.07-02/14/05-16:01:01 without a single crash since. I guess just close the post,

By: Olle Johansson (oej) 2005-06-05 17:30:08

Two issues in one bug report, no live system to find the suspected bug in anymore. Please re-open the issue report when we have the bug in sight again.

/Housekeeping