Summary: | ASTERISK-00609: Asterisk hangs and uses up all the cpu time appr. once a week | ||
Reporter: | jjvainio (jjvainio) | Labels: | |
Date Opened: | 2003-12-02 08:43:17.000-0600 | Date Closed: | 2004-09-25 02:52:18 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) debug.beforehang.txt.gz ( 1) event_log.beforehang.txt.gz | |
Description: | Asterisk is used to create conference calls using the /var/spool/asterisk/outgoing -mechanism and app_meetme. Asterisk connects to the participants using SIP via a Vegastream ISDN gateway. About once 1-4 weeks asterisk stops responsing to call requests and starts using all the cpu time. The console keeps working, but "stop now" hangs. ****** ADDITIONAL INFORMATION ****** top: 15:34:23 up 100 days, 3:49, 16 users, load average: 4.25, 1.33, 1.57 190 processes: 187 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 98.0% user 1.9% system 0.0% nice 0.0% iowait 0.0% idle Mem: 513804k av, 476232k used, 37572k free, 0k shrd, 17828k buff 339624k actv, 240k in_d, 7800k in_c Swap: 1044216k av, 280328k used, 763888k free 202576k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 4938 root 15 0 3376 756 216 S 91.6 0.1 1598m 0 asterisk [root@sme root]# ps axm | grep aster 1010 pts/12 S 0:00 gdb asterisk 4938 4938 pts/0 T 0:00 ./asterisk -vvvc 24676 pts/0 T 522:25 ./asterisk -vvvc 24675 pts/0 T 0:00 ./asterisk -vvvc 24674 pts/0 T 522:35 ./asterisk -vvvc 4953 pts/0 T 0:02 ./asterisk -vvvc 4952 pts/0 T 0:01 ./asterisk -vvvc 4951 pts/0 T 0:00 ./asterisk -vvvc 4950 pts/0 T 0:00 ./asterisk -vvvc 4949 pts/0 T 0:00 ./asterisk -vvvc 4948 pts/0 T 0:00 ./asterisk -vvvc 4947 pts/0 T 0:00 ./asterisk -vvvc 4946 pts/0 T 7:51 ./asterisk -vvvc 4945 pts/0 T 0:00 ./asterisk -vvvc 4944 pts/0 T 22:33 ./asterisk -vvvc 4943 pts/0 T 522:39 ./asterisk -vvvc 4942 pts/0 T 0:00 ./asterisk -vvvc 4941 pts/0 T 0:00 ./asterisk -vvvc 4940 pts/0 T 0:00 ./asterisk -vvvc 4939 pts/0 T 0:00 ./asterisk -vvvc (gdb) info threads 19 Thread 1082883264 (LWP 4939) 0xffffe002 in ?? () 18 Thread 1091271744 (LWP 4940) 0xffffe002 in ?? () 17 Thread 1099684800 (LWP 4941) 0xffffe002 in ?? () 16 Thread 1116941120 (LWP 4942) 0xffffe002 in ?? () 15 Thread 1125329600 (LWP 4943) 0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270 14 Thread 1133718080 (LWP 4944) 0xffffe002 in ?? () 13 Thread 1142106560 (LWP 4945) 0xffffe002 in ?? () 12 Thread 1150495040 (LWP 4946) 0xffffe002 in ?? () 11 Thread 1158883520 (LWP 4947) 0xffffe002 in ?? () 10 Thread 1167272000 (LWP 4948) 0xffffe002 in ?? () 9 Thread 1175660480 (LWP 4949) 0xffffe002 in ?? () 8 Thread 1184048960 (LWP 4950) 0xffffe002 in ?? () 7 Thread 1192437440 (LWP 4951) 0xffffe002 in ?? () 6 Thread 1200825920 (LWP 4952) 0xffffe002 in ?? () 5 Thread 1209214400 (LWP 4953) 0xffffe002 in ?? () 4 Thread 1242767552 (LWP 24674) 0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:177 3 Thread 1225990464 (LWP 24675) 0xffffe002 in ?? () 2 Thread 1276322240 (LWP 24676) 0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167 1 Thread 1074494176 (LWP 4938) 0xffffe002 in ?? () (gdb) thread 2 [Switching to thread 2 (Thread 1276322240 (LWP 24676))]#0 0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167 167 if (SOONER(s->when, current->when)) (gdb) bt #0 0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167 #1 0x41b9bc8e in __sip_reliable_xmit (p=0x81121c8, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424) at chan_sip.c:497 #2 0x41b91da8 in send_request (p=0x81121c8, req=0x4c1314c0, reliable=1, seqno=1070274636) at chan_sip.c:617 #3 0x41b901b2 in transmit_request_with_auth (p=0x81121c8, msg=0x9 <Address 0x9 out of bounds>, seqno=102, reliable=1070274636) at chan_sip.c:3137 #4 0x41b9fdb7 in sip_hangup (ast=0x4ca02d78) at chan_sip.c:1097 ASTERISK-1 0x080576c8 in ast_hangup (chan=0x4ca02d78) at channel.c:664 ASTERISK-2 0x08059670 in __ast_request_and_dial (type=0x4cc044a0 "SIP", format=1070274636, data=0x4cc045a0, timeout=1276320636, outstate=0x4c131da4, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", oh=0x4c131d2c) at channel.c:1593 ASTERISK-3 0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc044a0 "SIP", format=64, data=0x4cc045a0, timeout=30000, context=0x4cc049a0 "default", exten=0x4cc048a0 "8700", priority=1, reason=0x4c131b7c, sync=1285566032, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", variable=0x4cc04ba4 "", account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803 ASTERISK-4 0x41d892ab in attempt_thread (data=0x4cc04390) at pbx_spool.c:196 ASTERISK-5 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 (gdb) thread 4 [Switching to thread 4 (Thread 1242767552 (LWP 24674))]#0 0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:177 177 con->schedq = s; (gdb) bt #0 0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:177 #1 0x41b9bc8e in __sip_reliable_xmit (p=0x8113e08, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424) at chan_sip.c:497 #2 0x41b91da8 in send_request (p=0x8113e08, req=0x4a1313d0, reliable=1, seqno=1070274636) at chan_sip.c:617 #3 0x41b8fbcb in transmit_invite (p=0x8113e08, cmd=0x80d5e28 "(^\r\b\035", sdp=1, auth=0x8113e08 "", authheader=0x0, vxml_url=0x3fcb184c <Address 0x3fcb184c out of bounds>, distinctive_ring=0x811521c "0T\021\b7T\021\b?\002", init=1) at chan_sip.c:2805 #4 0x41b9fadc in sip_call (ast=0x8118340, dest=0x4cc00690 "+358405204347@194.89.155.196", timeout=0) at chan_sip.c:879 ASTERISK-1 0x08059d1d in ast_call (chan=0x8118340, addr=0x3fcb184c <Address 0x3fcb184c out of bounds>, timeout=1070274636) at channel.c:1717 ASTERISK-2 0x08059482 in __ast_request_and_dial (type=0x4cc00590 "SIP", format=1070274636, data=0x4cc00690, timeout=-1, outstate=0x4a131ca4, callerid=0x8118340 "SIP/194.89.155.196-b0b0", oh=0x4a131c2c) at channel.c:1522 ASTERISK-3 0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc00590 "SIP", format=64, data=0x4cc00690, timeout=30000, context=0x4cc00a90 "default", exten=0x4cc00990 "8600", priority=1, reason=0xffffffff, sync=135365656, callerid=0x8118340 "SIP/194.89.155.196-b0b0", variable=0x4cc00c94 "", account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803 ASTERISK-4 0x41d892ab in attempt_thread (data=0x4cc00480) at pbx_spool.c:196 ASTERISK-5 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 (gdb) thread 15 [Switching to thread 15 (Thread 1125329600 (LWP 4943))]#0 0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270 270 s = s->next; (gdb) bt #0 0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270 #1 0x41bad335 in handle_response (p=0x4cc056b0, resp=100, rest=0x43131ffc "Trying", req=0x43131ddc) at chan_sip.c:578 #2 0x41ba3239 in handle_request (p=0x4cc056b0, req=0x43131ddc, sin=0x43131dcc) at chan_sip.c:5324 #3 0x41b9d944 in sipsock_read (id=0x80e6e48, fd=9, events=1, ignore=0x0) at chan_sip.c:5376 #4 0x080514c0 in ast_io_wait (ioc=0x80e70e8, howlong=28) at io.c:267 ASTERISK-1 0x41b99e44 in do_monitor (data=0x0) at chan_sip.c:5478 ASTERISK-2 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 (gdb) thread 4 (gdb) frame 1 #1 0x41b9bc8e in __sip_reliable_xmit (p=0x8113e08, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424) at chan_sip.c:497 497 pkt->retransid = ast_sched_add(sched, DEFAULT_RETRANS, retrans_pkt, pkt); (gdb) frame 2 #2 0x41b91da8 in send_request (p=0x8113e08, req=0x4a1313d0, reliable=1, seqno=1070274636) at chan_sip.c:617 617 res = __sip_reliable_xmit(p, seqno, 0, req->data, req->len); (gdb) frame 3 #3 0x41b8fbcb in transmit_invite (p=0x8113e08, cmd=0x80d5e28 "(^\r\b\035", sdp=1, auth=0x8113e08 "", authheader=0x0, vxml_url=0x3fcb184c <Address 0x3fcb184c out of bounds>, distinctive_ring=0x811521c "0T\021\b7T\021\b?\002", init=1) at chan_sip.c:2805 2805 return send_request(p, &req, 1, p->ocseq); (gdb) frame 4 #4 0x41b9fadc in sip_call (ast=0x8118340, dest=0x4cc00690 "+358405204347@194.89.155.196", timeout=0) at chan_sip.c:879 879 transmit_invite(p, "INVITE", 1, NULL, NULL, vxml_url,distinctive_ring, 1); (gdb) frame 5 ASTERISK-1 0x08059d1d in ast_call (chan=0x8118340, addr=0x3fcb184c <Address 0x3fcb184c out of bounds>, timeout=1070274636) at channel.c:1717 1717 res = chan->pvt->call(chan, addr, timeout); (gdb) frame 6 ASTERISK-2 0x08059482 in __ast_request_and_dial (type=0x4cc00590 "SIP", format=1070274636, data=0x4cc00690, timeout=-1, outstate=0x4a131ca4, callerid=0x8118340 "SIP/194.89.155.196-b0b0", oh=0x4a131c2c) at channel.c:1522 1522 if (!ast_call(chan, data, 0)) { (gdb) frame 7 ASTERISK-3 0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc00590 "SIP", format=64, data=0x4cc00690, timeout=30000, context=0x4cc00a90 "default", exten=0x4cc00990 "8600", priority=1, reason=0xffffffff, sync=135365656, callerid=0x8118340 "SIP/194.89.155.196-b0b0", variable=0x4cc00c94 "", account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803 3803 chan = __ast_request_and_dial(type, format, data, timeout, reason, callerid, &oh); ((gdb) frame 8 ASTERISK-4 0x41d892ab in attempt_thread (data=0x4cc00480) at pbx_spool.c:196 196 res = ast_pbx_outgoing_exten(o->tech, AST_FORMAT_SLINEAR, o->dest, o->waittime * 1000, o->context, o->exten, o->priority, &reason, 2 /* wait to finish */, o->callerid, o->variable, o->account); (gdb) frame 9 ASTERISK-5 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 (gdb) thread 2 [Switching to thread 2 (Thread 1276322240 (LWP 24676))]#0 0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167 167 if (SOONER(s->when, current->when)) (gdb) frame 1 #1 0x41b9bc8e in __sip_reliable_xmit (p=0x81121c8, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424) at chan_sip.c:497 497 pkt->retransid = ast_sched_add(sched, DEFAULT_RETRANS, retrans_pkt, pkt); (gdb) frame 2 #2 0x41b91da8 in send_request (p=0x81121c8, req=0x4c1314c0, reliable=1, seqno=1070274636) at chan_sip.c:617 617 res = __sip_reliable_xmit(p, seqno, 0, req->data, req->len); (gdb) frame 3 #3 0x41b901b2 in transmit_request_with_auth (p=0x81121c8, msg=0x9 <Address 0x9 out of bounds>, seqno=102, reliable=1070274636) at chan_sip.c:3137 3137 return send_request(p, &resp, reliable, seqno ? seqno : p->ocseq); (gdb) frame 4 #4 0x41b9fdb7 in sip_hangup (ast=0x4ca02d78) at chan_sip.c:1097 1097 transmit_request_with_auth(p, "CANCEL", p->ocseq, 1); (gdb) frame 5 ASTERISK-1 0x080576c8 in ast_hangup (chan=0x4ca02d78) at channel.c:664 664 res = chan->pvt->hangup(chan); (gdb) frame 6 ASTERISK-2 0x08059670 in __ast_request_and_dial (type=0x4cc044a0 "SIP", format=1070274636, data=0x4cc045a0, timeout=1276320636, outstate=0x4c131da4, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", oh=0x4c131d2c) at channel.c:1593 1593 ast_hangup(chan); (gdb) frame 7 ASTERISK-3 0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc044a0 "SIP", format=64, data=0x4cc045a0, timeout=30000, context=0x4cc049a0 "default", exten=0x4cc048a0 "8700", priority=1, reason=0x4c131b7c, sync=1285566032, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", variable=0x4cc04ba4 "", account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803 3803 chan = __ast_request_and_dial(type, format, data, timeout, reason, callerid, &oh); (gdb) frame 8 ASTERISK-4 0x41d892ab in attempt_thread (data=0x4cc04390) at pbx_spool.c:196 196 res = ast_pbx_outgoing_exten(o->tech, AST_FORMAT_SLINEAR, o->dest, o->waittime * 1000, o->context, o->exten, o->priority, &reason, 2 /* wait to finish */, o->callerid, o->variable, o->account); (gdb) frame 9 ASTERISK-5 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 (gdb) thread 15 [Switching to thread 15 (Thread 1125329600 (LWP 4943))]#0 0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270 270 s = s->next; (gdb) frame 1 #1 0x41bad335 in handle_response (p=0x4cc056b0, resp=100, rest=0x43131ffc "Trying", req=0x43131ddc) at chan_sip.c:578 578 ast_sched_del(sched, cur->retransid); (gdb) frame 2 #2 0x41ba3239 in handle_request (p=0x4cc056b0, req=0x43131ddc, sin=0x43131dcc) at chan_sip.c:5324 5324 handle_response(p, respid, e + len, req); (gdb) frame 3 #3 0x41b9d944 in sipsock_read (id=0x80e6e48, fd=9, events=1, ignore=0x0) at chan_sip.c:5376 5376 handle_request(p, &req, &sin); (gdb) frame 4 #4 0x080514c0 in ast_io_wait (ioc=0x80e70e8, howlong=28) at io.c:267 267 if (!ioc->ior[x].callback(ioc->ior[x].id, ioc->fds[x].fd, ioc->fds[x].revents, ioc->ior[x].data)) { (gdb) frame 5 ASTERISK-1 0x41b99e44 in do_monitor (data=0x0) at chan_sip.c:5478 5478 res = ast_io_wait(io, res); (gdb) frame 6 ASTERISK-2 0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0 | ||
Comments: | By: Brian West (bkw918) 2003-12-02 10:04:19.000-0600 What distro, and what version of CVS are you running? By: jjvainio (jjvainio) 2003-12-02 10:44:17.000-0600 Redhat 9.0, CVS HEAD checked out Nov 19 13:45 GMT Same problem was present also in CVS HEAD that I checked out during summer, probably in July. By: Brian West (bkw918) 2003-12-02 10:57:50.000-0600 Lots of issues with Redhat 9. Try to export LD_ASSUME_KERNEL=2.4.1 before you start * By: jjvainio (jjvainio) 2003-12-02 11:34:37.000-0600 Ok. I did that now, we'll see if the problem persists in a few weeks. What exactly does LD_ASSUME_KERNEL=2.4.1 do? By: Paul Cadach (pcadach) 2003-12-02 14:33:40.000-0600 Better to look for backtrace of thread 1, which uses 98% of CPU time ast pointed by header of top's output. Also, missing lock for schedule() call pointed by "interference" between threads 2, 4 and 15 which both modifies scheduler's task lists within the same scheduling context (0x80e7130)... Does "RH-9 issues" are buggy mutexes? By: Brian West (bkw918) 2003-12-02 17:14:47.000-0600 Also if it happens again please attach a bt. By: Brian West (bkw918) 2003-12-14 11:44:37.000-0600 Please provide an update on this. By: zoa (zoa) 2003-12-15 04:13:12.000-0600 i had a similar problem (the same problem??) Kram made some changes to ast_sched_add and ast_sched_del and the problem went away. Upgrade to a more recent cvs and try again. By: Brian West (bkw918) 2004-01-07 00:09:18.000-0600 Can you tell me if this still happens? By: jjvainio (jjvainio) 2004-01-08 08:51:11.000-0600 Asterisk has been running without restarts or hangs after I set the LD_ASSUME_KERNEL=2.4.1 environment variable Dec 2. By: jjvainio (jjvainio) 2004-04-01 08:36:48.000-0600 Today this happened again. And now I definitely had the LD_ASSUME_KERNEL=2.4.1 set. Symptoms were the same as described above: System load went up to 5.5, with asterisk using it almost all. Most console commands (show channels etc) work, but stop now hung. kill -9 was needed to stop it. I didn't get the gdb stuff this time, because I didn't realize in time the need for LD_ASSUME_KERNEL also in the gdb shell. Without it gdb wouldn't show any info for the threads. One of our phones rang couple of minutes after placing the file in asterisk/outgoing, so it seems that asterisk was doing some processing, but the runaway thread was slowing down everything so much that it seemed like asterisk had hung. By: Mark Spencer (markster) 2004-04-02 02:33:52.000-0600 are there any unusual messages about scheduled events? By: Mark Spencer (markster) 2004-04-02 02:34:49.000-0600 Also, just to be pedantic, please rm -rf /usr/lib/asterisk/modules and do a "make clean ; make install" and also let me know if you have patched this system in any way. By: jjvainio (jjvainio) 2004-04-02 07:05:35.000-0600 There are some messages about scheduling, but I don't really know if they are unusual or not. I uploaded part of the asterisk debug and event_log so you can see if there are any hints in them. I included quite a long bit, because I not really sure when the hang happened. First bunch of calls on april 1 went through, but there was only silence (It should have played a .gsm file and then execute app_meetme). On the second try few hours after the first only one of the phones rang with a delay of minutes. This leads me to suspect that asterisk probably hung on the apr. 1 calls or between them and the previous. I think I will now upgrade asterisk to latest cvs and do the clean up while I am at it. Only patch applied is the following for supporting dtmf tones from a Vegastream gateway (btw feel free to include this in asterisk): --- chan_sip.c.~1.247.~ 2003-11-19 15:44:21.000000000 +0200 +++ chan_sip.c 2003-11-19 15:45:06.000000000 +0200 @@ -4324,6 +4324,11 @@ if (p->owner) { if (strlen(buf)) { + if ((strlen(buf)>=14) && (strncmp(buf, "event DTMF 1 ",13)==0)) { + resp=buf[13]; + if (sipdebug) + ast_verbose("DTMF received: '%c'\n", buf[13]); + } else { if (sipdebug) ast_verbose("DTMF received: '%c'\n", buf[0]); if (buf[0] == '*') @@ -4341,6 +4346,7 @@ } else if (event < 16) { resp = 'A' + (event - 12); } + } memset(&f, 0, sizeof(f)); f.frametype = AST_FRAME_DTMF; f.subclass = resp; By: twisted (twisted) 2004-04-03 00:52:49.000-0600 If you would like your patch considered for inclusion, please post it under a different bug, marked [patch], and include a diff -u. By: Mark Spencer (markster) 2004-04-03 20:38:09.000-0600 Please check out a clean copy of asterisk head CVS and rebuild and rerun. The "167" line number here does not correspond with the mentioned line, leading me to believe this is an extremely old version of asterisk. By: Mark Spencer (markster) 2004-04-04 15:52:35 Okay this new hang you had i think is totally unrelated to the old one. If you can get us an updated backtrace of the thread taking 100% CPU, then open a *new ticket* otherwise we're chasing a phantom. |