Summary:ASTERISK-05787: Asterisk randomly crashes when using ACD, agents and SIP channels.
Reporter:Vladimir S. Blazhkun (vovan)Labels:
Date Opened:2005-12-06 05:19:13.000-0600Date Closed:2006-02-28 09:40:21.000-0600
Versions:Frequency of
Environment:Attachments:( 0) crash.log
( 1) crash-2.log
( 2) crash-3.log
( 3) crash-4.log
( 4) crash-5.log
( 5) crash-6.log
( 6) crash-7.log
( 7) crash-8.log
Description:Redhat Linux AS 4.0.
Asterisk-1.2 release, compiled with dont-optimize.


See attached logs and backtraces.
Comments:By: opsys (opsys) 2005-12-11 22:08:54.000-0600

Can you please tell us a little about your enviorment, Phones, Number of Agents, OS.  Config files if posible.

By: Vladimir S. Blazhkun (vovan) 2005-12-12 08:25:21.000-0600

Just got another crash on Asterisk-1.2.1 release, compiled with dont-optimize.
Uptime was about 3 days. For details and backtraces please see attached file crash-2.log.
My hardware is HP DL380. Agents phones are Xten eyeBeam v1.1.3004w. Number of simul agent calls are approx. 10-12. OS: Redhat Linux AS 4.0.

callerid="CallCenter #1" <1402>


exten => s,1,Dial(SIP/${ARG1},90,tr)
exten => s,2,Hangup
exten => s,102,NoOP(Attention Agent/${ARG1} is busy.)
exten => s,103,Hangup

exten => t,1,Playback(beep)
exten => t,2,Set(QUEUE_PRIO=0)
#include "queue_prio.conf"
exten => t,3,Queue(callcenter|t)
exten => t,4,Hangup

exten => 1402,1,Macro(stdcc,${EXTEN})
exten => 1402,hint,SIP/1402


agent => 1402,,CallCenter #1

music = default
strategy = ringall
leavewhenempty = yes
maxlen = 20
timeout = 15
retry = 5
wrapuptime = 5
announce-frequency = 97
announce-holdtime = yes
announce-round-seconds = 10
queue-callswaiting = queue-callswaiting
queue-holdtime = queue-holdtime
queue-minutes = queue-minutes
queue-seconds = queue-seconds
queue-thankyou = queue-thankyou
queue-thereare  = queue-thereare
queue-youarenext = queue-youarenext
servicelevel = 60
member => Agent/1402

By: Vladimir S. Blazhkun (vovan) 2005-12-12 08:30:10.000-0600

And another crash... Uploaded file crash-3.log with all required details.

By: Vladimir S. Blazhkun (vovan) 2005-12-13 05:53:16.000-0600

Crashed once more time. But this time i cathed 'sip debug' output as well as core dumps. All details are in the crash-4.log. Can anybody tell me why i've got that frequent crashes?

By: Kevin P. Fleming (kpfleming) 2005-12-13 10:02:23.000-0600

It appears that you have major memory corruption issues occurring. We can't do anything to help that; you should run memtest86 or something similar on your system to ensure that the system operates properly before trying to report a bug like this.

By: Kevin P. Fleming (kpfleming) 2005-12-13 10:03:58.000-0600

Suspended pending memory testing.

By: Vladimir S. Blazhkun (vovan) 2005-12-13 14:55:02.000-0600

Tested with Memtest-86 v3.2, 3 times (includes one pass of test ASTERISK-5). No errors were found.
P. S. I had 1.0.7 running at this server for 2.5 months without any problems, crashes and restarts. Troubles began after upgrade to 1.2 and so on.

     Memtest-86 v3.2       | Pass 66% #########################
Xeon DP (0.13) 3056 Mhz     | Test 94% ####################################
L1 Cache:    8  25051MB/s   | Test ASTERISK-2  [Moving inversions, 32 bit pattern]
L2 Cache:  512K 21372MB/s   | Testing:  108K - 1024M 1024M
Memory  : 1024M  1456MB/s   | Pattern:   40000000
Chipset :

WallTime   Cached  RsvdMem   MemMap   Cache  ECC  Test  Pass  Errors ECC Errs
---------  ------  -------  --------  -----  ---  ----  ----  ------ --------
  4:27:13   1024M     192K  e820-Std    on   off   Std     3       0

By: Vladimir S. Blazhkun (vovan) 2005-12-14 02:55:49.000-0600

New coredumps and logs, files crash-5.log and crash-6.log.

By: Vladimir S. Blazhkun (vovan) 2005-12-14 09:27:00.000-0600

Another one crash, file crash-7.log.

By: paradise (paradise) 2005-12-14 09:57:35.000-0600

do u use hint for extensions monitoring?
if yes, disable all the hints and check if the crash occurs again or not.

By: Vladimir S. Blazhkun (vovan) 2005-12-14 10:16:51.000-0600

Removed all hints from configuration file.
Watching for stability now.

By: Vladimir S. Blazhkun (vovan) 2005-12-16 05:03:31.000-0600

Turning hints off did not help. Got another crash, crash-8.log.

By: Mark Spencer (markster) 2005-12-20 02:45:09.000-0600

Are you running gdb from within the asterisk source directory?  The backtrace is not giving any line numbers or code from which to work...

By: Mark Berry (markab21) 2006-01-02 10:34:35.000-0600

I have a nearly identical setup as vovan, and have 'hangs' on nearly a daily basis.  It requires me to stop the asterisk server and start it again, else I don't recieve any incomming IAX2 channels.

I have 6 SIP based agents.  Problems started on upgrade to 1.2

By: Kenneth Holm (saitech) 2006-01-02 11:47:34.000-0600

I am having a similar problem to "vovan's". My production asterisk server is only handeling SIP and IAX2, but about every 5-10 minutes, an error is spammed in by debug logger by asterisk.

Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again...
Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again...
Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again...

I keep getting theese errors for about 5-10 secs, and while i see this output, asterisk seems to hang, so no incoming or outgoing calls are possible.

Furthermore i got sporadic core dumps. Some times i am getting a core dump every 30minutes, and in other intances im only getting a core dump every 48 hour.

Im running asterisk 1.2.1 on a HP DL360 G3 with 2.6.14-Gentoo-r5 kernel.

I would really like to know if "vovan" is using an G3 or G4 server from HP?

I can add, that theese sip errors also happened on the same server, with asterisk 1.2 final and CentOs 4.2

By: opsys (opsys) 2006-01-02 11:53:38.000-0600

markab21 and saitech:

What OS are you using?
What do you have loaded as modules? (lsmod)
Can you also attach a crash log?

By: Mark Berry (markab21) 2006-01-02 12:40:38.000-0600

I am running CentOS release 4.2 (Final)

[root@edna ~]# lsmod
Module                  Size  Used by
loop                   19145  0
parport_pc             27904  1
lp                     15405  0
parport                37641  2 parport_pc,lp
autofs4                22085  0
i2c_dev                14273  0
i2c_core               25921  1 i2c_dev
sunrpc                138789  1
dm_mod                 58949  0
button                 10449  0
battery                12869  0
ac                      8773  0
md5                     8001  1
ipv6                  238817  34
uhci_hcd               32729  0
ehci_hcd               31813  0
tg3                    82373  0
ext3                  118729  3
jbd                    59481  1 ext3
ata_piix               13125  5
libata                 47133  1 ata_piix
sd_mod                 20545  7
scsi_mod              116429  2 libata,sd_mod

I don't have a crash log (not sure how/where to find one, I'll check the FAQ and try to provide this).

By: Mark Berry (markab21) 2006-01-02 12:46:26.000-0600

One more note I just want to make clear, Asterisk doesn't CRASH as described above for me, it just rejects new IAX2 connections.  To fix the problem I have to stop asterisk and start it again, requiring any Agents to relogin to the system.

It seems that the system will continue to take SIP connections in this state, as our agents can login/logout.

By: Kenneth Holm (saitech) 2006-01-02 15:34:22.000-0600

I'm using Gentoo 1.4 with 2.6.14 kernel.

I really dont know if it closes for IAX connections, if so, it opens up for them again. It seems like it is only hanging momentarily, because i cant see any output in the console, neither in any logs, except the debug log, where it spams

the "Failed to grab lock" message from chan_sip.c

I dont have to stop and start asterisk, it functions just perfect after af 5-10secs where it is hanging, i think its hanging on all functions, i cant even make a reload. If i try to give it a reload, asterisk does first perform the reload after it stops to hang.

I dont have a crash log, because it dont acually crashes. Though i have momentarily core dumps. i havent got the time to debug the core dumps.

I'll try to debug the next core dump, so we can see if the problem is alike.

Again to "vovan". You HP DL380, is it a G3 or G4 machine? It's kindda interesting for me, because my collegue thinks it could be a difference, due to the changed chipset. Maybe a incompatiblity or wrong kernel paramter.

My lsmod looks like this.

Module                  Size  Used by
ipv6                  195040  16
floppy                 49028  0
pcspkr                  3688  0
tg3                    81284  0
dm_mirror              17108  0
ata_piix                7300  0
ahci                    9348  0
sata_qstor              7428  0
sata_vsc                6276  0
sata_uli                5504  0
sata_sis                6144  0
sata_sx4               11012  0
sata_nv                 7044  0
sata_via                6660  0
sata_svw                5892  0
sata_sil                7172  0
sata_promise            8708  0
libata                 30088  12 ata_piix,ahci,sata_qstor,sata_vsc,sata_uli,sata_sis,sata_sx4,sata_nv,sata_via,sata_svw,sata_sil,sata_promise
sbp2                   18564  0
ohci1394               27316  0
ieee1394               60888  2 sbp2,ohci1394
sl811_hcd              10496  0
ohci_hcd               16388  0
uhci_hcd               26000  0
usb_storage            52032  0
usbhid                 30432  0
ehci_hcd               24712  0
usbcore                79360  7 sl811_hcd,ohci_hcd,uhci_hcd,usb_storage,usbhid,ehci_hcd

i have some sata modules built in the kernel, that i have not removed yet, but i dont think they are behind the problem.

By: Kenneth Holm (saitech) 2006-01-03 15:42:49.000-0600

I have been looking in the code for chan_sip.c rev 7335
line 11026

Here starts the function "sipsock_read()".
A check in this function is finding out wheter 0 headers received(Nat keep-alive), 1 or +2. +2 headers is consideret accepted, while only 1 header is trickering the "retrylock".

My probleml lies in this retrylock. I cant really find out, why im having a problem here.

Im using my asterisk 1.2.1 together with a CISCO AS5400HPX and a server using Sip Express Server.

Help is really apreciated.

And mestioned before, this message is written multiple times in my debug-log, and asterisk is hanging while this message is spammed.

By: Olle Johansson (oej) 2006-01-26 12:53:03.000-0600

vovan: Does the problem exist in svn trunk as well? I might have found something that prevents this from happening, but am very unsure. If you have a chance, please try again.

By: Kenneth Holm (saitech) 2006-01-26 14:42:43.000-0600

I've found out, that the problem is cause, because one thread that handles the sip message, is hanging while trying to insert data to a cdr table via cdr_mysql.so and the other thread of asterisk is waiting for the first to finish, therefore the deadlock.

It seems that the cdr table is locked, and therefore not able to retrieve inserts. Ive have converted my cdr table to af InnoDB instead of MyISAM, and after that, the debug message is only occuring maximum 3 times at a time. and is happeneging every 20. time im trying to insert into a mysql db server via cdr_mysql.so

That i dont classify as an error, though its only hanging for 3ms, so its not noticeable, and thats under a 40% load.

By: Olle Johansson (oej) 2006-01-26 14:44:03.000-0600

Try enabling cdr caching in cdr.conf, that way cdr storage will happen in another thread and not block the sip channel. Please report back if this helps you and solves the issue. Thanks.

By: Kenneth Holm (saitech) 2006-01-26 14:51:30.000-0600

I'm not really to happy for the batch=yes option in cdr.conf due to former core dumps. Though a got a thought. Is'nt it possible to cache via a spool file, instead of memory? Or if asterisk goes down, then thrash the cdr cache into a spool file? Or just the possibility to do so?

By: Vladimir S. Blazhkun (vovan) 2006-02-28 09:31:49.000-0600

Seems fixed in 1.2.4, i have it running for almoust 4 weeks now.

By: Olle Johansson (oej) 2006-02-28 09:39:02.000-0600

Obviously fixed in 1.2.4. Thanks for reporting this finding!