Summary: | ASTERISK-05787: Asterisk randomly crashes when using ACD, agents and SIP channels. | ||
Reporter: | Vladimir S. Blazhkun (vovan) | Labels: | |
Date Opened: | 2005-12-06 05:19:13.000-0600 | Date Closed: | 2006-02-28 09:40:21.000-0600 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) crash.log ( 1) crash-2.log ( 2) crash-3.log ( 3) crash-4.log ( 4) crash-5.log ( 5) crash-6.log ( 6) crash-7.log ( 7) crash-8.log | |
Description: | Redhat Linux AS 4.0. Asterisk-1.2 release, compiled with dont-optimize. ****** ADDITIONAL INFORMATION ****** See attached logs and backtraces. | ||
Comments: | By: opsys (opsys) 2005-12-11 22:08:54.000-0600 Can you please tell us a little about your enviorment, Phones, Number of Agents, OS. Config files if posible. By: Vladimir S. Blazhkun (vovan) 2005-12-12 08:25:21.000-0600 Just got another crash on Asterisk-1.2.1 release, compiled with dont-optimize. Uptime was about 3 days. For details and backtraces please see attached file crash-2.log. My hardware is HP DL380. Agents phones are Xten eyeBeam v1.1.3004w. Number of simul agent calls are approx. 10-12. OS: Redhat Linux AS 4.0. sip.conf: [1402] type=friend context=pbx regexten=1402 username=1402 secret=xxx callerid="CallCenter #1" <1402> host=dynamic dtmfmode=rfc2833 nat=no incominglimit=2 canreinvite=no disallow=all allow=alaw extensions.conf: [macro-stdcc] exten => s,1,Dial(SIP/${ARG1},90,tr) exten => s,2,Hangup exten => s,102,NoOP(Attention Agent/${ARG1} is busy.) exten => s,103,Hangup [default] exten => t,1,Playback(beep) exten => t,2,Set(QUEUE_PRIO=0) #include "queue_prio.conf" exten => t,3,Queue(callcenter|t) exten => t,4,Hangup [pbx] exten => 1402,1,Macro(stdcc,${EXTEN}) exten => 1402,hint,SIP/1402 agents.conf: [general] persistentagents=yes [agents] autologoff=15 ackcall=no wrapuptime=5000 agent => 1402,,CallCenter #1 recordagentcalls=no queues.conf: music = default strategy = ringall leavewhenempty = yes maxlen = 20 timeout = 15 retry = 5 wrapuptime = 5 announce-frequency = 97 announce-holdtime = yes announce-round-seconds = 10 queue-callswaiting = queue-callswaiting queue-holdtime = queue-holdtime queue-minutes = queue-minutes queue-seconds = queue-seconds queue-thankyou = queue-thankyou queue-thereare = queue-thereare queue-youarenext = queue-youarenext servicelevel = 60 member => Agent/1402 By: Vladimir S. Blazhkun (vovan) 2005-12-12 08:30:10.000-0600 And another crash... Uploaded file crash-3.log with all required details. By: Vladimir S. Blazhkun (vovan) 2005-12-13 05:53:16.000-0600 Crashed once more time. But this time i cathed 'sip debug' output as well as core dumps. All details are in the crash-4.log. Can anybody tell me why i've got that frequent crashes? By: Kevin P. Fleming (kpfleming) 2005-12-13 10:02:23.000-0600 It appears that you have major memory corruption issues occurring. We can't do anything to help that; you should run memtest86 or something similar on your system to ensure that the system operates properly before trying to report a bug like this. By: Kevin P. Fleming (kpfleming) 2005-12-13 10:03:58.000-0600 Suspended pending memory testing. By: Vladimir S. Blazhkun (vovan) 2005-12-13 14:55:02.000-0600 Tested with Memtest-86 v3.2, 3 times (includes one pass of test ASTERISK-5). No errors were found. P. S. I had 1.0.7 running at this server for 2.5 months without any problems, crashes and restarts. Troubles began after upgrade to 1.2 and so on. Memtest-86 v3.2 | Pass 66% ######################### Xeon DP (0.13) 3056 Mhz | Test 94% #################################### L1 Cache: 8 25051MB/s | Test ASTERISK-2 [Moving inversions, 32 bit pattern] L2 Cache: 512K 21372MB/s | Testing: 108K - 1024M 1024M Memory : 1024M 1456MB/s | Pattern: 40000000 Chipset : WallTime Cached RsvdMem MemMap Cache ECC Test Pass Errors ECC Errs --------- ------ ------- -------- ----- --- ---- ---- ------ -------- 4:27:13 1024M 192K e820-Std on off Std 3 0 ----------------------------------------------------------------------------- By: Vladimir S. Blazhkun (vovan) 2005-12-14 02:55:49.000-0600 New coredumps and logs, files crash-5.log and crash-6.log. By: Vladimir S. Blazhkun (vovan) 2005-12-14 09:27:00.000-0600 Another one crash, file crash-7.log. By: paradise (paradise) 2005-12-14 09:57:35.000-0600 do u use hint for extensions monitoring? if yes, disable all the hints and check if the crash occurs again or not. By: Vladimir S. Blazhkun (vovan) 2005-12-14 10:16:51.000-0600 Removed all hints from configuration file. Watching for stability now. By: Vladimir S. Blazhkun (vovan) 2005-12-16 05:03:31.000-0600 Turning hints off did not help. Got another crash, crash-8.log. By: Mark Spencer (markster) 2005-12-20 02:45:09.000-0600 Are you running gdb from within the asterisk source directory? The backtrace is not giving any line numbers or code from which to work... By: Mark Berry (markab21) 2006-01-02 10:34:35.000-0600 I have a nearly identical setup as vovan, and have 'hangs' on nearly a daily basis. It requires me to stop the asterisk server and start it again, else I don't recieve any incomming IAX2 channels. I have 6 SIP based agents. Problems started on upgrade to 1.2 By: Kenneth Holm (saitech) 2006-01-02 11:47:34.000-0600 I am having a similar problem to "vovan's". My production asterisk server is only handeling SIP and IAX2, but about every 5-10 minutes, an error is spammed in by debug logger by asterisk. Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again... Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again... Dec 29 14:51:02 DEBUG[15354] chan_sip.c: Failed to grab lock, trying again... I keep getting theese errors for about 5-10 secs, and while i see this output, asterisk seems to hang, so no incoming or outgoing calls are possible. Furthermore i got sporadic core dumps. Some times i am getting a core dump every 30minutes, and in other intances im only getting a core dump every 48 hour. Im running asterisk 1.2.1 on a HP DL360 G3 with 2.6.14-Gentoo-r5 kernel. I would really like to know if "vovan" is using an G3 or G4 server from HP? I can add, that theese sip errors also happened on the same server, with asterisk 1.2 final and CentOs 4.2 By: opsys (opsys) 2006-01-02 11:53:38.000-0600 markab21 and saitech: What OS are you using? What do you have loaded as modules? (lsmod) Can you also attach a crash log? By: Mark Berry (markab21) 2006-01-02 12:40:38.000-0600 I am running CentOS release 4.2 (Final) [root@edna ~]# lsmod Module Size Used by loop 19145 0 parport_pc 27904 1 lp 15405 0 parport 37641 2 parport_pc,lp autofs4 22085 0 i2c_dev 14273 0 i2c_core 25921 1 i2c_dev sunrpc 138789 1 dm_mod 58949 0 button 10449 0 battery 12869 0 ac 8773 0 md5 8001 1 ipv6 238817 34 uhci_hcd 32729 0 ehci_hcd 31813 0 tg3 82373 0 ext3 118729 3 jbd 59481 1 ext3 ata_piix 13125 5 libata 47133 1 ata_piix sd_mod 20545 7 scsi_mod 116429 2 libata,sd_mod I don't have a crash log (not sure how/where to find one, I'll check the FAQ and try to provide this). By: Mark Berry (markab21) 2006-01-02 12:46:26.000-0600 One more note I just want to make clear, Asterisk doesn't CRASH as described above for me, it just rejects new IAX2 connections. To fix the problem I have to stop asterisk and start it again, requiring any Agents to relogin to the system. It seems that the system will continue to take SIP connections in this state, as our agents can login/logout. By: Kenneth Holm (saitech) 2006-01-02 15:34:22.000-0600 I'm using Gentoo 1.4 with 2.6.14 kernel. I really dont know if it closes for IAX connections, if so, it opens up for them again. It seems like it is only hanging momentarily, because i cant see any output in the console, neither in any logs, except the debug log, where it spams the "Failed to grab lock" message from chan_sip.c I dont have to stop and start asterisk, it functions just perfect after af 5-10secs where it is hanging, i think its hanging on all functions, i cant even make a reload. If i try to give it a reload, asterisk does first perform the reload after it stops to hang. I dont have a crash log, because it dont acually crashes. Though i have momentarily core dumps. i havent got the time to debug the core dumps. I'll try to debug the next core dump, so we can see if the problem is alike. Again to "vovan". You HP DL380, is it a G3 or G4 machine? It's kindda interesting for me, because my collegue thinks it could be a difference, due to the changed chipset. Maybe a incompatiblity or wrong kernel paramter. My lsmod looks like this. Module Size Used by ipv6 195040 16 floppy 49028 0 pcspkr 3688 0 tg3 81284 0 dm_mirror 17108 0 ata_piix 7300 0 ahci 9348 0 sata_qstor 7428 0 sata_vsc 6276 0 sata_uli 5504 0 sata_sis 6144 0 sata_sx4 11012 0 sata_nv 7044 0 sata_via 6660 0 sata_svw 5892 0 sata_sil 7172 0 sata_promise 8708 0 libata 30088 12 ata_piix,ahci,sata_qstor,sata_vsc,sata_uli,sata_sis,sata_sx4,sata_nv,sata_via,sata_svw,sata_sil,sata_promise sbp2 18564 0 ohci1394 27316 0 ieee1394 60888 2 sbp2,ohci1394 sl811_hcd 10496 0 ohci_hcd 16388 0 uhci_hcd 26000 0 usb_storage 52032 0 usbhid 30432 0 ehci_hcd 24712 0 usbcore 79360 7 sl811_hcd,ohci_hcd,uhci_hcd,usb_storage,usbhid,ehci_hcd i have some sata modules built in the kernel, that i have not removed yet, but i dont think they are behind the problem. By: Kenneth Holm (saitech) 2006-01-03 15:42:49.000-0600 I have been looking in the code for chan_sip.c rev 7335 line 11026 Here starts the function "sipsock_read()". A check in this function is finding out wheter 0 headers received(Nat keep-alive), 1 or +2. +2 headers is consideret accepted, while only 1 header is trickering the "retrylock". My probleml lies in this retrylock. I cant really find out, why im having a problem here. Im using my asterisk 1.2.1 together with a CISCO AS5400HPX and a server using Sip Express Server. Help is really apreciated. And mestioned before, this message is written multiple times in my debug-log, and asterisk is hanging while this message is spammed. By: Olle Johansson (oej) 2006-01-26 12:53:03.000-0600 vovan: Does the problem exist in svn trunk as well? I might have found something that prevents this from happening, but am very unsure. If you have a chance, please try again. By: Kenneth Holm (saitech) 2006-01-26 14:42:43.000-0600 I've found out, that the problem is cause, because one thread that handles the sip message, is hanging while trying to insert data to a cdr table via cdr_mysql.so and the other thread of asterisk is waiting for the first to finish, therefore the deadlock. It seems that the cdr table is locked, and therefore not able to retrieve inserts. Ive have converted my cdr table to af InnoDB instead of MyISAM, and after that, the debug message is only occuring maximum 3 times at a time. and is happeneging every 20. time im trying to insert into a mysql db server via cdr_mysql.so That i dont classify as an error, though its only hanging for 3ms, so its not noticeable, and thats under a 40% load. By: Olle Johansson (oej) 2006-01-26 14:44:03.000-0600 Try enabling cdr caching in cdr.conf, that way cdr storage will happen in another thread and not block the sip channel. Please report back if this helps you and solves the issue. Thanks. By: Kenneth Holm (saitech) 2006-01-26 14:51:30.000-0600 I'm not really to happy for the batch=yes option in cdr.conf due to former core dumps. Though a got a thought. Is'nt it possible to cache via a spool file, instead of memory? Or if asterisk goes down, then thrash the cdr cache into a spool file? Or just the possibility to do so? By: Vladimir S. Blazhkun (vovan) 2006-02-28 09:31:49.000-0600 Seems fixed in 1.2.4, i have it running for almoust 4 weeks now. By: Olle Johansson (oej) 2006-02-28 09:39:02.000-0600 Obviously fixed in 1.2.4. Thanks for reporting this finding! |