[Home]

Summary:ASTERISK-03591: in do_monitor a call to ast_sched_runq(sched) never returns
Reporter:constfilin (constfilin)Labels:
Date Opened:2005-02-26 01:46:26.000-0600Date Closed:2008-01-15 15:28:32.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) iax_cache_rev0.diff
( 1) sip_cache_rev0.diff
Description:Greetings -

It all started from asterisk not receving any more
SIP commands and not answering to them.

CLI "sip debug" command showed no SIP traffic, whereas
"tcpdump -n -V -X dst port 5060" did show SIP packets coming
to the kernel.

Further reseach revealed that in "do_monitor" function a
call to "ast_sched_runq" never returns and asterisk never calls "ast_io_wait" to process IO packets.

How I got all this information?

I developed a little code that incremented some counters
at all places just before other functions are called from do_monitor and incremented other counters just after the calls returned. Then I developed a CLI command that typed the counters. Here's what I got:

ast-ny*CLI> sip show monitor
Monitor Status:
       entering_loop=163
       searching_to_kill=163
       end_searching_to_kill=163
       doing_the_locks=163
       end_doing_the_locks=163
       testing_cancel=163
       end_testing_cancel=163
       calling_ast_sched_wait=163
       end_calling_ast_sched_wait=163
       calling_ast_io_wait=163
       end_calling_ast_io_wait=163
       calling_ast_mutex_lock_monlock=163
       end_calling_ast_mutex_lock_monlock=163
       calling_ast_shed_runq=163
       end_calling_ast_shed_runq=162
       traversing_the_peers=162
       end_traversing_the_peers=162
       calling_ast_mutex_unlock_monlock=162
       end_calling_ast_mutex_unlock_monlock=162
       leaving_loop=162
Comments:By: Mark Spencer (markster) 2005-02-26 01:50:00.000-0600

You need a thread apply all bt, please read the instructions in the bug guidelines for attaching gdb to a deadlocked asterisk.  Has this code been modified in any way?  If so, can you duplicate the problem on unpatched Asterisk?

By: constfilin (constfilin) 2005-02-26 02:26:17.000-0600

this all started on unpatched asterisk - I am not using patched chan_sip.c
and my own realtime engine anymore.

I will try to attach gdb to asterisk and produce a complete debug report.

By: constfilin (constfilin) 2005-02-26 02:35:49.000-0600

(gdb) info thread
 14 Thread 1084648880 (LWP 6442)  0xffffe002 in ?? ()
 13 Thread 1093070256 (LWP 6443)  0xffffe002 in ?? ()
 12 Thread 1102831024 (LWP 6444)  0xffffe002 in ?? ()
 11 Thread 1111473584 (LWP 6445)  0xffffe002 in ?? ()
 10 Thread 1121267120 (LWP 6446)  0xffffe002 in ?? ()
 9 Thread 1129745840 (LWP 6447)  0xffffe002 in ?? ()
 8 Thread 1138761136 (LWP 6448)  0xffffe002 in ?? ()
 7 Thread 1147153840 (LWP 6449)  0xffffe002 in ?? ()
 6 Thread 1155567024 (LWP 6450)  0xffffe002 in ?? ()
 5 Thread 1164066224 (LWP 6451)  0xffffe002 in ?? ()
 4 Thread 1172458928 (LWP 6452)  0xffffe002 in ?? ()
 3 Thread 1183047088 (LWP 6454)  0xffffe002 in ?? ()
 2 Thread 1191439792 (LWP 6455)  0xffffe002 in ?? ()
 1 Thread 1076253312 (LWP 6440)  0xffffe002 in ?? ()

By: constfilin (constfilin) 2005-02-26 02:37:09.000-0600

(gdb) thread apply all bt

Thread 14 (Thread 1084648880 (LWP 6442)):
#0  0xffffe002 in ?? ()
#1  0x080aad7d in listener (unused=0x0) at asterisk.c:337
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 13 (Thread 1093070256 (LWP 6443)):
#0  0xffffe002 in ?? ()
#1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 12 (Thread 1102831024 (LWP 6444)):
#0  0xffffe002 in ?? ()
#1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 11 (Thread 1111473584 (LWP 6445)):
#0  0xffffe002 in ?? ()
#1  0x41bcd6c7 in expire_register (data=0x81067c0) at chan_sip.c:4578
#2  0x08053e89 in ast_sched_runq (con=0x81067c0) at sched.c:379
#3  0x41bda9ad in do_monitor (data=0x0) at chan_sip.c:8764
#4  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 10 (Thread 1121267120 (LWP 6446)):
#0  0xffffe002 in ?? ()
#1  0x080530f4 in ast_io_wait (ioc=0x812f290, howlong=-4) at io.c:258
#2  0x4241a093 in network_thread (ignore=0x0) at chan_iax2.c:7246
#3  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 9 (Thread 1129745840 (LWP 6447)):
#0  0xffffe002 in ?? ()
#1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 8 (Thread 1138761136 (LWP 6448)):
#0  0xffffe002 in ?? ()
#1  0x4358f634 in pri_dchannel (vpri=0x4359ee20) at chan_zap.c:7568
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 7 (Thread 1147153840 (LWP 6449)):
#0  0xffffe002 in ?? ()
#1  0x4357fd8d in do_monitor (data=0x0) at chan_zap.c:6007
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 6 (Thread 1155567024 (LWP 6450)):
#0  0xffffe002 in ?? ()
#1  0x4460424c in scan_thread (unused=0x0) at pbx_spool.c:327
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 5 (Thread 1164066224 (LWP 6451)):
#0  0xffffe002 in ?? ()
#1  0x080530f4 in ast_io_wait (ioc=0x816a488, howlong=-4) at io.c:258
#2  0x44e12408 in network_thread (ignore=0x0) at pbx_dundi.c:2092
#3  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 4 (Thread 1172458928 (LWP 6452)):
#0  0xffffe002 in ?? ()
#1  0x44e12531 in process_precache (ign=0x0) at pbx_dundi.c:2132
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 3 (Thread 1183047088 (LWP 6454)):
#0  0xffffe002 in ?? ()
#1  0x41bd0d85 in restart_monitor () at chan_sip.c:8811
#2  0x41bddf97 in sip_reload (fd=44, argc=2, argv=0x4683d478) at chan_sip.c:10224
#3  0x0808735a in ast_cli_command (fd=44, s=0xfffffffc <Address 0xfffffffc out of bounds>) at cli.c:1254
#4  0x080ab837 in netconsole (vconsole=0x0) at asterisk.c:298
ASTERISK-1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 2 (Thread 1191439792 (LWP 6455)):
#0  0xffffe002 in ?? ()
#1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 1 (Thread 1076253312 (LWP 6440)):
#0  0xffffe002 in ?? ()
#1  0x080a75d1 in main (argc=-1073746696, argv=0x2) at asterisk.c:2017
#2  0x00e95678 in __libc_start_main () from /lib/tls/libc.so.6
#0  0xffffe002 in ?? ()

By: constfilin (constfilin) 2005-02-26 02:48:27.000-0600

The above gdb output was produced on chan_sip.c patched with profiling code that I described in the initial posting. The asterisk process is still running
and I can produce more debug into, if necessary.

The code aroung line 4578 in my chan_sip.c is

   4568 static int expire_register(void *data)
   4569 {
   4570         struct sip_peer *p = data;
   4571         memset(&p->addr, 0, sizeof(p->addr));
   4572         ast_db_del("SIP/Registry", p->name);
   4573         manager_event(EVENT_FLAG_SYSTEM, "PeerStatus", "Peer: SIP/%s\r\nPeerStatus: Unregistered\r\nCause: Expired\r\
   4573 n", p->name);
   4574         register_peer_exten(p, 0);
   4575         p->expire = -1;
   4576         ast_device_state_changed("SIP/%s", p->name);
   4577         if (ast_test_flag(p, SIP_SELFDESTRUCT) || ast_test_flag((&p->flags_page2), SIP_PAGE2_RTAUTOCLEAR)) {
   4578                 ASTOBJ_MARK(p);
   4579                 prune_peers();
   4580         }
   4581         return 0;
   4582 }

Evidently in ASTOBJ_MARK the code can't obtain the write lock on the object.
There is a locking deadlock somewhere.

By: Mark Spencer (markster) 2005-02-26 10:18:45.000-0600

The ASTOBJ_MARK is likely only half the story.  What is going on in thread 3, which is issuing a reload?  Can you build with "make clean ; make valgrind" as well as -DDEBUG_THREADS in the Makefile and I will need to login to the system while it is in this state.  Thanks.

By: constfilin (constfilin) 2005-02-26 13:01:02.000-0600

Mark, theard #3 is unrelevant because when the system deadlocked, I ran "sip reload" from CLI. The system was deadloacked even without it.

I pulled in the latest asterisk CVS, compiled it with -DDEBUG_THREADS and make valgrind and got the system in the deadlocked state.

I am ready to give you the access to the system. Please write to cf@intermedia.net or call (408) 2022793. If you can as bug tracker administrator, please make this note private in the bug tracker or remove
my personal information out of it.

By: constfilin (constfilin) 2005-02-26 13:04:17.000-0600

(gdb) info threads
 14 Thread 1084648880 (LWP 16457)  0xffffe002 in ?? ()
 13 Thread 1093074352 (LWP 16458)  0xffffe002 in ?? ()
 12 Thread 1102847408 (LWP 16460)  0xffffe002 in ?? ()
 11 Thread 1111514544 (LWP 16461)  0xffffe002 in ?? ()
 10 Thread 1122000304 (LWP 16462)  0xffffe002 in ?? ()
 9 Thread 1130393008 (LWP 16463)  0xffffe002 in ?? ()
 8 Thread 1138879920 (LWP 16464)  0xffffe002 in ?? ()
 7 Thread 1147919792 (LWP 16465)  0xffffe002 in ?? ()
 6 Thread 1156312496 (LWP 16466)  0xffffe002 in ?? ()
 5 Thread 1164725680 (LWP 16467)  0xffffe002 in ?? ()
 4 Thread 1173237168 (LWP 16468)  0xffffe002 in ?? ()
 3 Thread 1181629872 (LWP 16469)  0xffffe002 in ?? ()
 2 Thread 1192324528 (LWP 16470)  0xffffe002 in ?? ()
 1 Thread 1076253312 (LWP 16455)  0xffffe002 in ?? ()

By: constfilin (constfilin) 2005-02-26 13:05:56.000-0600

(gdb) thread apply all bt

Thread 14 (Thread 1084648880 (LWP 16457)):
#0  0xffffe002 in ?? ()
#1  0x080aa8b7 in listener (unused=0x0) at asterisk.c:337
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 13 (Thread 1093074352 (LWP 16458)):
#0  0xffffe002 in ?? ()
#1  0x40a6a45e in do_monitor (data=0x0) at chan_modem.c:641
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 12 (Thread 1102847408 (LWP 16460)):
#0  0xffffe002 in ?? ()
#1  0x413bc34e in do_parking_thread (ignore=0x0) at res_features.c:1244
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 11 (Thread 1111514544 (LWP 16461)):
#0  0xffffe002 in ?? ()
#1  0x41bfadcb in __ast_pthread_mutex_lock (filename=0x41bfb21a "chan_sip.c", lineno=4578,
   func=0x41bfda0b "expire_register", mutex_name=0x41bfba47 "&(p)->_lock", t=0x81c261c) at lock.h:218
#2  0x41be1354 in expire_register (data=0x81c25c0) at chan_sip.c:4578
#3  0x080545f8 in ast_sched_runq (con=0x81382c0) at sched.c:379
#4  0x41bf281d in do_monitor (data=0x0) at chan_sip.c:8674
ASTERISK-1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 10 (Thread 1122000304 (LWP 16462)):
#0  0xffffe002 in ?? ()
#1  0x080aa693 in netconsole (vconsole=0x810cbe0) at asterisk.c:286
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 9 (Thread 1130393008 (LWP 16463)):
#0  0xffffe002 in ?? ()
#1  0x08053889 in ast_io_wait (ioc=0x8135eb0, howlong=1000) at io.c:254
#2  0x42431c2b in network_thread (ignore=0x0) at chan_iax2.c:7246
#3  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 8 (Thread 1138879920 (LWP 16464)):
#0  0xffffe002 in ?? ()
#1  0x4361a31b in do_monitor (data=0x0) at chan_phone.c:967
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 7 (Thread 1147919792 (LWP 16465)):
#0  0xffffe002 in ?? ()
#1  0x43e41f18 in pri_dchannel (vpri=0x43e5a160) at chan_zap.c:7568
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 6 (Thread 1156312496 (LWP 16466)):
#0  0xffffe002 in ?? ()
#1  0x43e3c9ae in do_monitor (data=0x0) at chan_zap.c:6007
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 5 (Thread 1164725680 (LWP 16467)):
#0  0xffffe002 in ?? ()
#1  0x44ec13a1 in scan_thread (unused=0x0) at pbx_spool.c:327
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 4 (Thread 1173237168 (LWP 16468)):
#0  0xffffe002 in ?? ()
#1  0x08053889 in ast_io_wait (ioc=0x8173698, howlong=1000) at io.c:254
#2  0x456cf6db in network_thread (ignore=0x0) at pbx_dundi.c:2092
#3  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 3 (Thread 1181629872 (LWP 16469)):
#0  0xffffe002 in ?? ()
#1  0x456cf90b in process_precache (ign=0x0) at pbx_dundi.c:2132
#2  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 2 (Thread 1192324528 (LWP 16470)):
#0  0xffffe002 in ?? ()
#1  0x4002c355 in start_thread () from /lib/tls/libpthread.so.0

Thread 1 (Thread 1076253312 (LWP 16455)):
#0  0xffffe002 in ?? ()
#1  0x080af257 in main (argc=1, argv=0xbfffe6e4) at asterisk.c:2017
#2  0x00e95678 in __libc_start_main () from /lib/tls/libc.so.6
#0  0xffffe002 in ?? ()

By: Mark Spencer (markster) 2005-02-26 13:21:41.000-0600

This is not, as it turns out, a deadlock, but is occuring because expire is being called on an object which has already been freed.  I have asked him to try turning off the rtcache to see if that makes a difference.

By: constfilin (constfilin) 2005-02-26 13:30:36.000-0600

Mark, I turned RT caching of SIP peers off and asterisk is running for several minutes without deadlocks.

I need caching of realtime peers because without it do_monitor does not send message waiting indicators to the peers (or course, the peerl is empty).

Thanks

-c

By: constfilin (constfilin) 2005-02-26 13:43:59.000-0600

Without RT SIP peers caching asterisk has got so much behind on querying the database and replying on SIP requests that the telephones started re-sending their registrations again and again. This way, 10 minutes after fresh start of asterisk the queue of incoming packets to UDP port 5060 grew to the maximum
of about 64K. Check this out:

[root@ast-ny /usr/src/asterisk]# netstat -nap | grep 5060
udp    63840      0 0.0.0.0:5060            0.0.0.0:*        17898/asterisk

I see asterisk getting the SIP packets (meaning that there is no deadlocking problem anymore) but RT SIP peers caching needs to be turned on.

By: Mark Spencer (markster) 2005-02-27 11:02:33.000-0600

Okay so this is clearly a bug in the SIP caching mechanism then.  Apparently things are being freed without the expire being cancelled.  I'll try to get anthm to take a look since that part is his code :)

By: Kevin P. Fleming (kpfleming) 2005-02-27 15:46:06.000-0600

This is tricky... As I've worked on the code in ASTERISK-3323355, I've been noticing that there are cases in chan_sip where this can happen.

I did at one time convert all the timers to actually ref/unref the objects they are related to, but the code was very ugly and I'm not sure it solved any problems, so I pulled it back out.

As it stands today, chan_sip's code _MUST_ ensure that all timers are removed before an object is freed, because the timers do not hold refs on the object to keep it from being freed. This also means that all timer functions must take extra precautions to ensure that the object they are working with will stay alive while they are running; this could be done by taking a ref when the timer function starts, or taking the object's internal lock. Neither of these are perfect solutions, because the timers can fire while sip_destroy_peer is running.

This needs to be discussed during the developer's conference call this week if at all possible; the solution to this problem will likely have to be well understood by everyone involved, documented, and then replicated in other modules that have similar needs.

By: Mark Spencer (markster) 2005-03-03 14:41:00.000-0600

Can we run this under valgrind on a test system to try to identify if this is a race or if this is something more systemic?

By: Olle Johansson (oej) 2005-03-17 08:43:30.000-0600

kpfleming/anthm: Any updates on this?

By: Kevin P. Fleming (kpfleming) 2005-03-17 09:16:58.000-0600

I'll bring it up during the conference call today to see if we can get moving on it.

By: Kevin P. Fleming (kpfleming) 2005-03-17 17:55:37.000-0600

Nothing from my end... I haven't had even 30 minutes to look at this stuff since I got back from VON yet, but I will have some time soon.

By: Anthony Minessale (anthm) 2005-03-22 16:05:01.000-0600

Try this patch it looks like the sched didnt cancel pending events it just assigned a new one right away.

also see iax verison attached as well.

By: Kevin P. Fleming (kpfleming) 2005-03-22 16:20:27.000-0600

Yes, that does appear to be a valid problem you found there... Nice job :-)

By: Mark Spencer (markster) 2005-03-23 23:08:38.000-0600

Fixed in CVS head.  Feel free to reopen if you still have trouble, but I think anthm nailed it.

By: Digium Subversion (svnbot) 2008-01-15 15:28:32.000-0600

Repository: asterisk
Revision: 5249

U   trunk/channels/chan_iax2.c
U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r5249 | markster | 2008-01-15 15:28:32 -0600 (Tue, 15 Jan 2008) | 2 lines

Fix SIP/IAX caching (bug ASTERISK-3591)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=5249