[Home]

Summary:ASTERISK-00609: Asterisk hangs and uses up all the cpu time appr. once a week
Reporter:jjvainio (jjvainio)Labels:
Date Opened:2003-12-02 08:43:17.000-0600Date Closed:2004-09-25 02:52:18
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) debug.beforehang.txt.gz
( 1) event_log.beforehang.txt.gz
Description:Asterisk is used to create conference calls using the /var/spool/asterisk/outgoing -mechanism and app_meetme. Asterisk connects to the participants using SIP via a Vegastream ISDN gateway.

About once 1-4 weeks asterisk stops responsing to call requests and starts using all the cpu time. The console keeps working, but "stop now" hangs.

****** ADDITIONAL INFORMATION ******

top:

15:34:23  up 100 days,  3:49, 16 users,  load average: 4.25, 1.33, 1.57
190 processes: 187 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  98.0% user   1.9% system   0.0% nice   0.0% iowait   0.0% idle
Mem:   513804k av,  476232k used,   37572k free,       0k shrd,   17828k buff
                   339624k actv,     240k in_d,    7800k in_c
Swap: 1044216k av,  280328k used,  763888k free                  202576k cached

 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
4938 root      15   0  3376  756   216 S    91.6  0.1  1598m   0 asterisk


[root@sme root]# ps axm | grep aster
1010 pts/12   S      0:00 gdb asterisk 4938
4938 pts/0    T      0:00 ./asterisk -vvvc
24676 pts/0    T    522:25 ./asterisk -vvvc
24675 pts/0    T      0:00 ./asterisk -vvvc
24674 pts/0    T    522:35 ./asterisk -vvvc
4953 pts/0    T      0:02 ./asterisk -vvvc
4952 pts/0    T      0:01 ./asterisk -vvvc
4951 pts/0    T      0:00 ./asterisk -vvvc
4950 pts/0    T      0:00 ./asterisk -vvvc
4949 pts/0    T      0:00 ./asterisk -vvvc
4948 pts/0    T      0:00 ./asterisk -vvvc
4947 pts/0    T      0:00 ./asterisk -vvvc
4946 pts/0    T      7:51 ./asterisk -vvvc
4945 pts/0    T      0:00 ./asterisk -vvvc
4944 pts/0    T     22:33 ./asterisk -vvvc
4943 pts/0    T    522:39 ./asterisk -vvvc
4942 pts/0    T      0:00 ./asterisk -vvvc
4941 pts/0    T      0:00 ./asterisk -vvvc
4940 pts/0    T      0:00 ./asterisk -vvvc
4939 pts/0    T      0:00 ./asterisk -vvvc

(gdb) info threads
 19 Thread 1082883264 (LWP 4939)  0xffffe002 in ?? ()
 18 Thread 1091271744 (LWP 4940)  0xffffe002 in ?? ()
 17 Thread 1099684800 (LWP 4941)  0xffffe002 in ?? ()
 16 Thread 1116941120 (LWP 4942)  0xffffe002 in ?? ()
 15 Thread 1125329600 (LWP 4943)  0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270
 14 Thread 1133718080 (LWP 4944)  0xffffe002 in ?? ()
 13 Thread 1142106560 (LWP 4945)  0xffffe002 in ?? ()
 12 Thread 1150495040 (LWP 4946)  0xffffe002 in ?? ()
 11 Thread 1158883520 (LWP 4947)  0xffffe002 in ?? ()
 10 Thread 1167272000 (LWP 4948)  0xffffe002 in ?? ()
 9 Thread 1175660480 (LWP 4949)  0xffffe002 in ?? ()
 8 Thread 1184048960 (LWP 4950)  0xffffe002 in ?? ()
 7 Thread 1192437440 (LWP 4951)  0xffffe002 in ?? ()
 6 Thread 1200825920 (LWP 4952)  0xffffe002 in ?? ()
 5 Thread 1209214400 (LWP 4953)  0xffffe002 in ?? ()
 4 Thread 1242767552 (LWP 24674)  0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:177
 3 Thread 1225990464 (LWP 24675)  0xffffe002 in ?? ()
 2 Thread 1276322240 (LWP 24676)  0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167
 1 Thread 1074494176 (LWP 4938)  0xffffe002 in ?? ()

(gdb) thread 2
[Switching to thread 2 (Thread 1276322240 (LWP 24676))]#0  0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c,
   data=0x3fcb184c) at sched.c:167
167                     if (SOONER(s->when, current->when))
(gdb) bt
#0  0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:167
#1  0x41b9bc8e in __sip_reliable_xmit (p=0x81121c8, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424)
   at chan_sip.c:497
#2  0x41b91da8 in send_request (p=0x81121c8, req=0x4c1314c0, reliable=1, seqno=1070274636) at chan_sip.c:617
#3  0x41b901b2 in transmit_request_with_auth (p=0x81121c8, msg=0x9 <Address 0x9 out of bounds>, seqno=102, reliable=1070274636) at chan_sip.c:3137
#4  0x41b9fdb7 in sip_hangup (ast=0x4ca02d78) at chan_sip.c:1097
ASTERISK-1  0x080576c8 in ast_hangup (chan=0x4ca02d78) at channel.c:664
ASTERISK-2  0x08059670 in __ast_request_and_dial (type=0x4cc044a0 "SIP", format=1070274636, data=0x4cc045a0, timeout=1276320636, outstate=0x4c131da4,
   callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", oh=0x4c131d2c) at channel.c:1593
ASTERISK-3  0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc044a0 "SIP", format=64, data=0x4cc045a0, timeout=30000, context=0x4cc049a0 "default",
   exten=0x4cc048a0 "8700", priority=1, reason=0x4c131b7c, sync=1285566032, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", variable=0x4cc04ba4 "",
   account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803
ASTERISK-4  0x41d892ab in attempt_thread (data=0x4cc04390) at pbx_spool.c:196
ASTERISK-5  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0

(gdb) thread 4
[Switching to thread 4 (Thread 1242767552 (LWP 24674))]#0  0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c,
   data=0x3fcb184c) at sched.c:177
177                     con->schedq = s;
(gdb) bt
#0  0x08051fc8 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c, data=0x3fcb184c) at sched.c:177
#1  0x41b9bc8e in __sip_reliable_xmit (p=0x8113e08, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424)
   at chan_sip.c:497
#2  0x41b91da8 in send_request (p=0x8113e08, req=0x4a1313d0, reliable=1, seqno=1070274636) at chan_sip.c:617
#3  0x41b8fbcb in transmit_invite (p=0x8113e08, cmd=0x80d5e28 "(^\r\b\035", sdp=1, auth=0x8113e08 "", authheader=0x0,
   vxml_url=0x3fcb184c <Address 0x3fcb184c out of bounds>, distinctive_ring=0x811521c "0T\021\b7T\021\b?\002", init=1) at chan_sip.c:2805
#4  0x41b9fadc in sip_call (ast=0x8118340, dest=0x4cc00690 "+358405204347@194.89.155.196", timeout=0) at chan_sip.c:879
ASTERISK-1  0x08059d1d in ast_call (chan=0x8118340, addr=0x3fcb184c <Address 0x3fcb184c out of bounds>, timeout=1070274636) at channel.c:1717
ASTERISK-2  0x08059482 in __ast_request_and_dial (type=0x4cc00590 "SIP", format=1070274636, data=0x4cc00690, timeout=-1, outstate=0x4a131ca4,
   callerid=0x8118340 "SIP/194.89.155.196-b0b0", oh=0x4a131c2c) at channel.c:1522
ASTERISK-3  0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc00590 "SIP", format=64, data=0x4cc00690, timeout=30000, context=0x4cc00a90 "default",
   exten=0x4cc00990 "8600", priority=1, reason=0xffffffff, sync=135365656, callerid=0x8118340 "SIP/194.89.155.196-b0b0", variable=0x4cc00c94 "",
   account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803
ASTERISK-4  0x41d892ab in attempt_thread (data=0x4cc00480) at pbx_spool.c:196
ASTERISK-5  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0

(gdb) thread 15
[Switching to thread 15 (Thread 1125329600 (LWP 4943))]#0  0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270
270                     s = s->next;
(gdb) bt
#0  0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270
#1  0x41bad335 in handle_response (p=0x4cc056b0, resp=100, rest=0x43131ffc "Trying", req=0x43131ddc) at chan_sip.c:578
#2  0x41ba3239 in handle_request (p=0x4cc056b0, req=0x43131ddc, sin=0x43131dcc) at chan_sip.c:5324
#3  0x41b9d944 in sipsock_read (id=0x80e6e48, fd=9, events=1, ignore=0x0) at chan_sip.c:5376
#4  0x080514c0 in ast_io_wait (ioc=0x80e70e8, howlong=28) at io.c:267
ASTERISK-1  0x41b99e44 in do_monitor (data=0x0) at chan_sip.c:5478
ASTERISK-2  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0

(gdb) thread 4
(gdb) frame 1
#1  0x41b9bc8e in __sip_reliable_xmit (p=0x8113e08, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424)
   at chan_sip.c:497
497             pkt->retransid = ast_sched_add(sched, DEFAULT_RETRANS, retrans_pkt, pkt);
(gdb) frame 2
#2  0x41b91da8 in send_request (p=0x8113e08, req=0x4a1313d0, reliable=1, seqno=1070274636) at chan_sip.c:617
617                     res = __sip_reliable_xmit(p, seqno, 0, req->data, req->len);
(gdb) frame 3
#3  0x41b8fbcb in transmit_invite (p=0x8113e08, cmd=0x80d5e28 "(^\r\b\035", sdp=1, auth=0x8113e08 "", authheader=0x0,
   vxml_url=0x3fcb184c <Address 0x3fcb184c out of bounds>, distinctive_ring=0x811521c "0T\021\b7T\021\b?\002", init=1) at chan_sip.c:2805
2805            return send_request(p, &req, 1, p->ocseq);
(gdb) frame 4
#4  0x41b9fadc in sip_call (ast=0x8118340, dest=0x4cc00690 "+358405204347@194.89.155.196", timeout=0) at chan_sip.c:879
879             transmit_invite(p, "INVITE", 1, NULL, NULL, vxml_url,distinctive_ring, 1);
(gdb) frame 5
ASTERISK-1  0x08059d1d in ast_call (chan=0x8118340, addr=0x3fcb184c <Address 0x3fcb184c out of bounds>, timeout=1070274636) at channel.c:1717
1717                            res = chan->pvt->call(chan, addr, timeout);
(gdb) frame 6
ASTERISK-2  0x08059482 in __ast_request_and_dial (type=0x4cc00590 "SIP", format=1070274636, data=0x4cc00690, timeout=-1, outstate=0x4a131ca4,
   callerid=0x8118340 "SIP/194.89.155.196-b0b0", oh=0x4a131c2c) at channel.c:1522
1522                    if (!ast_call(chan, data, 0)) {
(gdb) frame 7
ASTERISK-3  0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc00590 "SIP", format=64, data=0x4cc00690, timeout=30000, context=0x4cc00a90 "default",
   exten=0x4cc00990 "8600", priority=1, reason=0xffffffff, sync=135365656, callerid=0x8118340 "SIP/194.89.155.196-b0b0", variable=0x4cc00c94 "",
   account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803
3803                    chan = __ast_request_and_dial(type, format, data, timeout, reason, callerid, &oh);
((gdb) frame 8
ASTERISK-4  0x41d892ab in attempt_thread (data=0x4cc00480) at pbx_spool.c:196
196                     res = ast_pbx_outgoing_exten(o->tech, AST_FORMAT_SLINEAR, o->dest, o->waittime * 1000, o->context, o->exten, o->priority, &reason, 2 /* wait to finish */, o->callerid, o->variable, o->account);
(gdb) frame 9
ASTERISK-5  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0


(gdb) thread 2
[Switching to thread 2 (Thread 1276322240 (LWP 24676))]#0  0x08051fa7 in ast_sched_add (con=0x80e7130, when=135093800, callback=0x3fcb184c,
   data=0x3fcb184c) at sched.c:167
167                     if (SOONER(s->when, current->when))
(gdb) frame 1
#1  0x41b9bc8e in __sip_reliable_xmit (p=0x81121c8, seqno=102, resp=1070274636, data=0x3fcb184c <Address 0x3fcb184c out of bounds>, len=1102705424)
   at chan_sip.c:497
497             pkt->retransid = ast_sched_add(sched, DEFAULT_RETRANS, retrans_pkt, pkt);
(gdb) frame 2
#2  0x41b91da8 in send_request (p=0x81121c8, req=0x4c1314c0, reliable=1, seqno=1070274636) at chan_sip.c:617
617                     res = __sip_reliable_xmit(p, seqno, 0, req->data, req->len);
(gdb) frame 3
#3  0x41b901b2 in transmit_request_with_auth (p=0x81121c8, msg=0x9 <Address 0x9 out of bounds>, seqno=102, reliable=1070274636) at chan_sip.c:3137
3137            return send_request(p, &resp, reliable, seqno ? seqno : p->ocseq);
(gdb) frame 4
#4  0x41b9fdb7 in sip_hangup (ast=0x4ca02d78) at chan_sip.c:1097
1097                                    transmit_request_with_auth(p, "CANCEL", p->ocseq, 1);
(gdb) frame 5
ASTERISK-1  0x080576c8 in ast_hangup (chan=0x4ca02d78) at channel.c:664
664                             res = chan->pvt->hangup(chan);
(gdb) frame 6
ASTERISK-2  0x08059670 in __ast_request_and_dial (type=0x4cc044a0 "SIP", format=1070274636, data=0x4cc045a0, timeout=1276320636, outstate=0x4c131da4,
   callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", oh=0x4c131d2c) at channel.c:1593
1593                    ast_hangup(chan);
(gdb) frame 7
ASTERISK-3  0x08067bd8 in ast_pbx_outgoing_exten (type=0x4cc044a0 "SIP", format=64, data=0x4cc045a0, timeout=30000, context=0x4cc049a0 "default",
   exten=0x4cc048a0 "8700", priority=1, reason=0x4c131b7c, sync=1285566032, callerid=0x4ca02d78 "SIP/194.89.155.196-3b17", variable=0x4cc04ba4 "",
   account=0x3fcb184c <Address 0x3fcb184c out of bounds>) at pbx.c:3803
3803                    chan = __ast_request_and_dial(type, format, data, timeout, reason, callerid, &oh);
(gdb) frame 8
ASTERISK-4  0x41d892ab in attempt_thread (data=0x4cc04390) at pbx_spool.c:196
196                     res = ast_pbx_outgoing_exten(o->tech, AST_FORMAT_SLINEAR, o->dest, o->waittime * 1000, o->context, o->exten, o->priority, &reason, 2 /* wait to finish */, o->callerid, o->variable, o->account);
(gdb) frame 9
ASTERISK-5  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0


(gdb) thread 15
[Switching to thread 15 (Thread 1125329600 (LWP 4943))]#0  0x08052049 in ast_sched_del (con=0x80e7130, id=28) at sched.c:270
270                     s = s->next;
(gdb) frame 1
#1  0x41bad335 in handle_response (p=0x4cc056b0, resp=100, rest=0x43131ffc "Trying", req=0x43131ddc) at chan_sip.c:578
578                                     ast_sched_del(sched, cur->retransid);
(gdb) frame 2
#2  0x41ba3239 in handle_request (p=0x4cc056b0, req=0x43131ddc, sin=0x43131dcc) at chan_sip.c:5324
5324                            handle_response(p, respid, e + len, req);
(gdb) frame 3
#3  0x41b9d944 in sipsock_read (id=0x80e6e48, fd=9, events=1, ignore=0x0) at chan_sip.c:5376
5376                    handle_request(p, &req, &sin);
(gdb) frame 4
#4  0x080514c0 in ast_io_wait (ioc=0x80e70e8, howlong=28) at io.c:267
267                                             if (!ioc->ior[x].callback(ioc->ior[x].id, ioc->fds[x].fd, ioc->fds[x].revents, ioc->ior[x].data)) {
(gdb) frame 5
ASTERISK-1  0x41b99e44 in do_monitor (data=0x0) at chan_sip.c:5478
5478                    res = ast_io_wait(io, res);
(gdb) frame 6
ASTERISK-2  0x4003b2b6 in start_thread () from /lib/tls/libpthread.so.0

Comments:By: Brian West (bkw918) 2003-12-02 10:04:19.000-0600

What distro, and what version of CVS are you running?

By: jjvainio (jjvainio) 2003-12-02 10:44:17.000-0600

Redhat 9.0, CVS HEAD checked out Nov 19 13:45 GMT

Same problem was present also in CVS HEAD that I checked out during summer, probably in July.

By: Brian West (bkw918) 2003-12-02 10:57:50.000-0600

Lots of issues with Redhat 9.  

Try to export LD_ASSUME_KERNEL=2.4.1

before you start *

By: jjvainio (jjvainio) 2003-12-02 11:34:37.000-0600

Ok. I did that now, we'll see if the problem persists in a few weeks. What exactly does LD_ASSUME_KERNEL=2.4.1 do?

By: Paul Cadach (pcadach) 2003-12-02 14:33:40.000-0600

Better to look for backtrace of thread 1, which uses 98% of CPU time ast pointed by header of top's output.

Also, missing lock for schedule() call pointed by "interference" between threads 2, 4 and 15 which both modifies scheduler's task lists within the same scheduling context (0x80e7130)...

Does "RH-9 issues" are buggy mutexes?

By: Brian West (bkw918) 2003-12-02 17:14:47.000-0600

Also if it happens again please attach a bt.

By: Brian West (bkw918) 2003-12-14 11:44:37.000-0600

Please provide an update on this.

By: zoa (zoa) 2003-12-15 04:13:12.000-0600

i had a similar problem (the same problem??) Kram made some changes to ast_sched_add and ast_sched_del and the problem went away.

Upgrade to a more recent cvs and try again.

By: Brian West (bkw918) 2004-01-07 00:09:18.000-0600

Can you tell me if this still happens?

By: jjvainio (jjvainio) 2004-01-08 08:51:11.000-0600

Asterisk has been running without restarts or hangs after I set the LD_ASSUME_KERNEL=2.4.1 environment variable Dec 2.

By: jjvainio (jjvainio) 2004-04-01 08:36:48.000-0600

Today this happened again. And now I definitely had the LD_ASSUME_KERNEL=2.4.1 set. Symptoms were the same as described above: System load went up to 5.5, with asterisk using it almost all. Most console commands (show channels etc) work, but stop now hung. kill -9 was needed to stop it.

I didn't get the gdb stuff this time, because I didn't realize in time the need for LD_ASSUME_KERNEL also in the gdb shell. Without it gdb wouldn't show any info for the threads.

One of our phones rang couple of minutes after placing the file in asterisk/outgoing, so it seems that asterisk was doing some processing, but the runaway thread was slowing down everything so much that it seemed like asterisk had hung.

By: Mark Spencer (markster) 2004-04-02 02:33:52.000-0600

are there any unusual messages about scheduled events?

By: Mark Spencer (markster) 2004-04-02 02:34:49.000-0600

Also, just to be pedantic, please rm -rf /usr/lib/asterisk/modules and do a "make clean ; make install" and also let me know if you have patched this system in any way.

By: jjvainio (jjvainio) 2004-04-02 07:05:35.000-0600

There are some messages about scheduling, but I don't really know if they are unusual or not. I uploaded part of the asterisk debug and event_log so you can see if there are any hints in them. I included quite a long bit, because I not really sure when the hang happened.

First bunch of calls on april 1 went through, but there was only silence (It should have played a .gsm file and then execute app_meetme). On the second try few hours after the first only one of the phones rang with a delay of minutes. This leads me to suspect that asterisk probably hung on the apr. 1 calls or between them and the previous.

I think I will now upgrade asterisk to latest cvs and do the clean up while I am at it.

Only patch applied is the following for supporting dtmf tones from a Vegastream gateway (btw feel free to include this in asterisk):

--- chan_sip.c.~1.247.~ 2003-11-19 15:44:21.000000000 +0200
+++ chan_sip.c  2003-11-19 15:45:06.000000000 +0200
@@ -4324,6 +4324,11 @@

       if (p->owner) {
               if (strlen(buf)) {
+                 if ((strlen(buf)>=14) && (strncmp(buf, "event DTMF 1 ",13)==0)) {
+                   resp=buf[13];
+                       if (sipdebug)
+                               ast_verbose("DTMF received: '%c'\n", buf[13]);
+                 } else {
                       if (sipdebug)
                               ast_verbose("DTMF received: '%c'\n", buf[0]);
                       if (buf[0] == '*')
@@ -4341,6 +4346,7 @@
                        } else if (event < 16) {
                                resp = 'A' + (event - 12);
                        }
+                 }
                        memset(&f, 0, sizeof(f));
                        f.frametype = AST_FRAME_DTMF;
                        f.subclass = resp;

By: twisted (twisted) 2004-04-03 00:52:49.000-0600

If you would like your patch considered for inclusion, please post it under a different bug, marked [patch], and include a diff -u.

By: Mark Spencer (markster) 2004-04-03 20:38:09.000-0600

Please check out a clean copy of asterisk head CVS and rebuild and rerun.  The "167" line number here does not correspond with the mentioned line, leading me to believe this is an extremely old version of asterisk.

By: Mark Spencer (markster) 2004-04-04 15:52:35

Okay this new hang you had i think is totally unrelated to the old one.  If you can get us an updated backtrace of the thread taking 100% CPU, then open a *new ticket* otherwise we're chasing a phantom.