[Home]

Summary:ASTERISK-11417: tls transport often causes asterisk to lock
Reporter:pj (pj)Labels:
Date Opened:2008-02-11 11:04:27.000-0600Date Closed:2008-07-21 12:01:05
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk-lock8.txt
( 1) asterisk-locks.txt
( 2) asterisk-locks2.txt
( 3) asterisk-locks3.txt
( 4) asterisk-locks4.txt
( 5) asterisk-locks5.txt
( 6) asterisk-locks6.txt
( 7) asterisk-locks7.txt
( 8) bug11972-1.diff.txt
( 9) bug11972-3.diff.txt
(10) bug11972-4.diff.txt
(11) bug11972-5.diff.txt
(12) bug11972-6.diff.txt
(13) bug11972-7.diff.txt
(14) bug11972-8.diff.txt
(15) console-log.txt
(16) coredebug.txt
(17) coredebug2.txt
(18) coredebug3.txt
(19) coredebug4.txt
(20) coredebug5.txt
(21) gdb.txt
(22) gdb2.txt
(23) gdb3.txt
(24) gdb4.txt
(25) gdb5.txt
(26) gdb6.txt
(27) gdb7.txt
(28) gdb8.txt
(29) gdb9.txt
(30) sip-channels.txt
Description:when asterisk is locked, no call or registration is processed,
this happened several times today and yesterday (tried several trunk revisions)
currently haven't found thing, what causes the locking situation,
it seems, that locking problem start appearing, after more peers start using tls transport
my tls only peers are: asterisk trunk, asterisk beta2 and eyebeam sofphone
+several udp clients.


Comments:By: pj (pj) 2008-02-11 14:28:37.000-0600

another deadlock, uploaded asterisk-locks2.txt and sip-channels.txt (list of frozen sip channels).

By: pj (pj) 2008-02-12 01:07:24.000-0600

asterisk crashes today when issuing 'core show locks',
generated core is probably useless :-\

(gdb) bt full
#0  0xb7ce8d8b in strlen () from /lib/i686/libc.so.6
No symbol table info available.
#1  0xb7cba030 in vfprintf () from /lib/i686/libc.so.6
No symbol table info available.
#2  0x081a2e08 in ?? ()
No symbol table info available.
#3  0x0000001a in ?? ()
No symbol table info available.
#4  0x00000000 in ?? ()
No symbol table info available.

By: James Golovich (jamesgolovich) 2008-02-12 02:04:09.000-0600

I've been trying to reproduce this locally and haven't been able to yet.  

From your first output it looks like you don't have asterisk compiled with DETECT_DEADLOCKS since dialoglist_lock is hiding where its locked from, but it looks to me like somehow sip_pvt_lock is called in find_call without dialoglist_lock being locked.

The second output looks to me like a TCP/TLS session went away during a read so the p->socket.lock never got unlocked.  I'm going to attach a patch that should prevent this from happening.  Can you try the patch out and see if it pops up again?

By: pj (pj) 2008-02-12 03:02:20.000-0600

I have applied your patch and recompiled with don't optimize, debug threads and detect deadlock. I will report, if another deadlock or crash occur. thanks

By: pj (pj) 2008-02-12 05:38:21.000-0600

sad to say, crash again, with folowing last message on console:

[Feb 12 12:38:01] ERROR[22240]: /root/ast14/asterisk-r103313/include/asterisk/lock.h:345 __ast_pthread_mutex_lock: chan_sip.c line 2516 (__sip_xmit): Deadlock? waited 5 sec for mutex 'p->socket.lock'?


(gdb) bt full
#0  0xb7d25d8b in strlen () from /lib/i686/libc.so.6
No symbol table info available.
#1  0xb7cf7030 in vfprintf () from /lib/i686/libc.so.6
No symbol table info available.
#2  0xb7a67d6c in SSL_accept () from /usr/lib/asterisk/modules/chan_sip.so
No symbol table info available.
#3  0x00000000 in ?? ()
No symbol table info available.

By: pj (pj) 2008-02-13 10:56:16.000-0600

another crash with same error (unexpected crash, when no call was processed):

[Feb 13 17:53:12] ERROR[7790]: /root/src/asterisk-r103387/include/asterisk/lock.h:345 __ast_pthread_mutex_lock: chan_sip.c line 2516 (__sip_xmit): Deadlock? waited 5 sec for mutex 'p->socket.lock'?

any way, how to get better debug output? 'bt full' from generated core is probably useless :-\  'bt full' output from crashdump I posted yesterday, same is today.

By: James Golovich (jamesgolovich) 2008-02-13 11:44:49.000-0600

I'm working on this issue.  I figured out a way to reproduce the issue (or at least an issue), by generating tons of calls from one asterisk to another in a short amount of time eventually it will deadlock and sometimes crash.

The crashes that have openssl code in the backtrace are likely happening because 2 threads are acting on one ssl session concurrently.  It seems in some cases the sip_socket is setup without the lock initialized.

By: pj (pj) 2008-02-14 03:19:41.000-0600

I'm sending asterisk-locks3.txt and console-log.txt, that I grabed, today when asterisk locks unexpectively (at this time, no calls was processed).

By: James Golovich (jamesgolovich) 2008-02-15 04:06:54.000-0600

Try out bug11972-3.diff.txt, I suspect that will resolve your deadlock and crashing issues.  It's not all the way done, I know its leaking a bit of memory and there is probably a better way to deal with the cleanup when a connection goes away.  I just wanted to get something out there, and I'll do some more work on it after I get some sleep.

By: pj (pj) 2008-02-15 04:46:28.000-0600

thanks for patch, I'm recompiling right now...

another weird message, that appears today (before your last patch applied):

[Feb 15 11:41:40] ERROR[8570]: /root/src/asterisk-r103387/include/asterisk/lock.h:386 __ast_pthread_mutex_lock: chan_sip.c line 2516 (__sip_xmit): Error obtaining mutex: Invalid argument
[Feb 15 11:41:41] ERROR[8570]: /root/src/asterisk-r103387/include/asterisk/lock.h:469 __ast_pthread_mutex_unlock: chan_sip.c line 2528 (__sip_xmit): mutex 'p->socket.lock' freed more times than we've locked!
[Feb 15 11:41:41] ERROR[8570]: /root/src/asterisk-r103387/include/asterisk/lock.h:486 __ast_pthread_mutex_unlock: chan_sip.c line 2528 (__sip_xmit): Error releasing mutex: Invalid argument
[Feb 15 11:41:41] WARNING[8570]: chan_sip.c:2540 __sip_xmit: sip_xmit of 0xb7972168 (len 587) to 85.216.201.137:49486 returned 0: Success
[Feb 15 11:41:49] NOTICE[8570]: chan_sip.c:18601 sip_poke_noanswer: Peer '726' is now UNREACHABLE!  Last qualify: 15
[Feb 15 11:41:57]     -- Registered SIP '726' at 85.216.201.137 port 50146 expires 200
[Feb 15 11:41:57] NOTICE[8570]: chan_sip.c:14860 handle_response_peerpoke: Peer '726' is now Reachable. (18ms / 4000ms)

By: pj (pj) 2008-02-16 01:55:38.000-0600

crashed again, gdb.txt attached, maybe asterisk doesn't handle correctly, when tls asterisk peer (client) unregisters (but isn't easy reproducible).
last message before crash was...

[Feb 15 23:59:26] WARNING[1378]: chan_sip.c:2532 __sip_xmit: sip_xmit of 0xb798116c (len 587) to 85.216.201.137:41284 returned -1: Success
[Feb 15 23:59:34] NOTICE[1378]: chan_sip.c:18600 sip_poke_noanswer: Peer '726' is now UNREACHABLE!  Last qualify: 18

By: pj (pj) 2008-02-16 05:46:34.000-0600

another crash, gdb2.txt, after...

[Feb 16 11:47:36] WARNING[2899]: chan_sip.c:2532 __sip_xmit: sip_xmit of 0xb797c16c (len 587) to 85.216.201.137:34365 returned -1: Success
[Feb 16 11:47:36] WARNING[2899]: chan_sip.c:4207 __sip_destroy: Trying to destroy "(null)", not found in dialog list?!?!

By: James Golovich (jamesgolovich) 2008-02-17 23:46:55.000-0600

I've uploaded bug11972-4.diff.txt, which is identical to the -3 patch except it removes the code that makes do_monitor remove the dialogs.  Which should definitely lead to some leaked memory but I'd like to make sure its not this code causing your crashes.  

Your gdb2.txt looks like its related to this, but gdb.txt and gdb3.txt look to be related to session timers.  I'll have to dig into session timers more to see how those are handled or find someone else with more knowledge about it to look at those 2 dumps.

By: pj (pj) 2008-02-18 05:25:45.000-0600

now I'm running asterisk with your bug11972-4.diff.txt patch.
seems, that all my previous crashes was related to situation, when tls peer becomes unreachable (asterisk as tls client).
I have all session timers stuff in default values (no ses timer options in my sip.conf).



By: James Golovich (jamesgolovich) 2008-02-18 11:28:10.000-0600

On further checking it looks like the other 2 traces could be related because it was forcing the dialog to be destroyed early (when the TCP/TLS session went away).  

I added a bit of code in __sip_destroy to set p->stimer = NULL after freeing it, so even if somehow code is checking if (p->stimer) it wont segfault.

Also after sending around 3600 calls through 2 asterisks and occasionaly removing network connectivity I was able to produce another crash.

I'm uploading another patch to try out.

By: pj (pj) 2008-02-19 01:32:06.000-0600

James, it seems, that your patch bug11972-5.diff.txt solved my crashes, especially, when tls peer/client becomes unreacheble.
today, I got same messages as before on console, but asterisk doesn't crash. I will continue testing...
Shouldn't be this warnings moved to debug log level?

[Feb 19 00:02:39] NOTICE[9887]: chan_sip.c:18626 sip_poke_noanswer: Peer '726' is now UNREACHABLE!  Last qualify: 14
[Feb 19 00:02:49] WARNING[9887]: chan_sip.c:2531 __sip_xmit: sip_xmit of 0xb799716c (len 587) to 85.216.201.137:53612 returned -1: Success
[Feb 19 00:03:07] WARNING[9887]: chan_sip.c:2531 __sip_xmit: sip_xmit of 0xb799716c (len 587) to 85.216.201.137:53612 returned -1: Success
[Feb 19 00:03:25] WARNING[9887]: chan_sip.c:2531 __sip_xmit: sip_xmit of 0xb799716c (len 587) to 85.216.201.137:53612 returned -1: Success
[Feb 19 00:03:43] WARNING[9887]: chan_sip.c:2531 __sip_xmit: sip_xmit of 0xb799716c (len 587) to 85.216.201.137:53612 returned -1: Success
[Feb 19 00:04:01] WARNING[9887]: chan_sip.c:2531 __sip_xmit: sip_xmit of 0xb799716c (len 587) to 85.216.201.137:53612 returned -1: Success

By: James Golovich (jamesgolovich) 2008-02-19 10:55:18.000-0600

I don't think the log message should be changed to DEBUG, I think it should stay where its at with level WARNING.  One thing I do think I need to do is change the code to set errno to something like EHOSTUNREACH.  I'm going to upload a new patch that changes this but since its simply cosmetic I don't see a reason for you to try this one out instead of what your using now.

By: pj (pj) 2008-02-19 12:15:51.000-0600

do you think, that this warning messages are really important to administrators? should admins do something, if this warnings messages appears on console? I think, that it can rather confuse asterisk administrators.

By: pj (pj) 2008-02-19 14:55:10.000-0600

again partially locked, not crash, uploading asterisk-locks4.txt with show locks, frozen sip channels and console messages...

By: James Golovich (jamesgolovich) 2008-02-19 15:23:47.000-0600

Does this only happen when you 'module unload chan_sip.so'?  Thats what it looks like is happening from the bt

By: pj (pj) 2008-02-20 01:50:38.000-0600

definitively I do not do 'module unload chan_sip.so' in any cases,
my last issue is, that asterisk unexpectivelly locks, even when no calls was processed in that time (after this happen, cli was frozen, can't do 'restart' from cli, etc...), only one tls peer unregisters right before asterisk locks, all debug info I posted in asterisk-locks4.txt

By: James Golovich (jamesgolovich) 2008-02-20 02:10:02.000-0600

I just uploaded another patch, that doesn't address your last issue but addresses a segfault I got when trying to find your issue.  The problem with this was that on an unload of the module the dlclose was happening too quickly, so I put in a usleep to wait after dropping each active tcp session.

Now about your issue, can you catch full debug info and upload it here?

By: pj (pj) 2008-02-20 02:52:28.000-0600

OK, what do you mean with 'catch full debug info '?
next, I can try to catch 'bt full' from gdb attaching to frozen asterisk process, will be this sufficient?

By: James Golovich (jamesgolovich) 2008-02-20 03:20:22.000-0600

I mean to make sure you are either starting asterisk with -ddd or 'core set debug 3' and also making sure the debug log level is going somewhere in logger.conf

By: pj (pj) 2008-02-22 02:17:21.000-0600

Hi James, positive message, almost two days, without crash or locks!
but sometimes, calls fail to establish, with messages below on cli, coredebug.txt attached, next call is mostly successfull. this happen when tried to calling peer over tls (to asterisk client).

[Feb 22 09:06:10]     -- Called bill-gw/355
[Feb 22 09:06:10] ERROR[27296]: chan_sip.c:18009 handle_request_do: We could NOT get the channel lock for SIP/bill-gw-08252878!
[Feb 22 09:06:10] ERROR[27296]: chan_sip.c:18010 handle_request_do: SIP transaction failed: 717f1cbc7f0207b956ce125a740fd529@192.168.38.20

By: James Golovich (jamesgolovich) 2008-02-23 22:29:40.000-0600

Unfortunately I still think that looks like a deadlock, but the code has some deadlock avoidance builtin so it won't behave like a normal deadlock.  Are you able to do a 'core show locks' when this happens?  I'd like to see where p->owner->lock_dont_use was locked and what else was held that was keeping it locked

By: pj (pj) 2008-02-24 01:33:56.000-0600

I last case, I don't saw any "locked here..." entries in 'core show locks', only some 'tried and failed to get lock' entries.

By: pj (pj) 2008-02-24 15:27:59.000-0600

another crash, uploading gdb4.txt, unfortunatelly, I forgot to turn on core debug, after I last restarted asterisk
SVN-trunk-r104045 with bug11972-7.diff.txt

By: James Golovich (jamesgolovich) 2008-02-24 16:02:40.000-0600

Are you able to try switching all these clients over to UDP to see if this still happens?  The last crash was related to session timers which were introduced about the same time as TCP/TLS support.  Maybe someone who knows more about how session timers are working can look into it.  I'll see if I can dig through the code if I get a chance though

By: pj (pj) 2008-02-25 08:29:53.000-0600

uploading asterisk-locks5.txt that shows where is asterisk locked and coredebug2.txt (two minutes around time, when lock was catched), failed call (when asterisk was locked) was at 14:59:09

By: James Golovich (jamesgolovich) 2008-02-25 12:57:03.000-0600

I don't see any mention of a lock in your coredebug2.txt

By: pj (pj) 2008-02-26 01:54:52.000-0600

what should I find in coredebug to supply usefull informations? log is very big, so I can't upload all file. coredebug2.txt was catched around time, when some call was not established successfully (looking to cli), so I issue 'core show locks' and saw some locking state and posted coredebug2 from this time, maybe locks happened before I catch this. but I have log stored, so if you tell me, what should I find, I can post it.
Right now, I'm uploading coredebug3.txt, where you can also see some weird states.
(it was not locked today, 'core show locks' shows only tried and failed entries)

By: pj (pj) 2008-02-26 02:01:44.000-0600

it's probably unrelated to issues with tls, but when looking to coredebug3.txt:

[Feb 25 20:46:01] DEBUG[23606] devicestate.c: Notification of state change to be queued on device/channel SIP/prec
[Feb 25 20:46:01] DEBUG[23606] chan_sip.c: Checking device state for peer prec

I haven't any peer (or user) called 'prec' have only 'prec-gw'

By: pj (pj) 2008-02-27 10:33:41.000-0600

uploading asterisk-locks6.txt and coredebug4.txt,
seems, that today something wrong happened from about Feb 27 17:10:39 ...

By: Russell Bryant (russell) 2008-02-28 13:07:36.000-0600

asterisk-locks 4 and 6 seem to show the same lock up.  I would like to see what the thread that executed __sip_xmit() and has the socket lock held is doing.  It's probably hanging on SSL_write() ...

Can you get a core dump of the process when it is locked up like this and get a thread apply all bt?  Also, make sure you comile without optimizations.

By: pj (pj) 2008-02-28 13:18:29.000-0600

uploading another asterisk-locks7.txt and coredebug5.txt, I have:
cat .asterisk.makeopts
MENUSELECT_CFLAGS=DONT_OPTIMIZE DEBUG_THREADS LOADABLE_MODULES DETECT_DEADLOCKS
please tell me, how should I do 'thread apply all bt', gdb attaching to locked asterisk process? should I still have coredebug enabled? contains it some usefull info?

By: Russell Bryant (russell) 2008-02-28 22:25:42.000-0600

Yeah, gdb attached to the locked process will work.  Alternatively, you can run the ast_grab_core script in contrib/scripts, which will grab a core dump from the locked process.  So, you can immediately restart Asterisk and then go use gdb on the core dump.

By: pj (pj) 2008-02-29 15:17:22.000-0600

uploading asterisk-lock8.txt and gdb5.txt,
I found some lock today, but was not locked totally as before, calls was possible, when this lock happened today.

By: Russell Bryant (russell) 2008-03-04 09:09:39.000-0600

I just uploaded a patch.  Give it a try and see if it improves the situation.

The patch changes the socket mutex to a rwlock, instead.  For a TCP/TLS socket, we know there will only be one thread reading from the socket at a time.  But, we have to be careful to ensure that only one thread will write to it at a time.  That's what the lock was for.

By using a rwlock, the socket read thread will only use a read lock.  Then, other threads writing to it will use a write-lock.  Then, a thread trying to send something won't be blocked by the socket reading thread while it's sleeping waiting for data to arrive on the socket.

That's what I think the problem was from the debug output I've looked through on here ...

By: pj (pj) 2008-03-04 09:40:30.000-0600

OK, I will try your latest patch,
anyway, seems that also help commit to trunk r105410, repairing something in autoservice. I have two days uptime, without any locks or even 'tried and failed' entries in 'core show lock'
currently running SVN-trunk-r105509 + bug11972-7.diff.txt

By: Russell Bryant (russell) 2008-03-04 10:25:36.000-0600

I guess this patch won't apply against James' changes.  I'll go back and look at what he has done and then get my changes merged in with those if it seems appropriate.

By: pj (pj) 2008-03-04 11:18:08.000-0600

crashed right now:( uploading gdb6.txt

last msg on console before crash:

[Mar  4 18:04:25] ERROR[19572]: /root/src/asterisk-r105553/include/asterisk/lock.h:278 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error: attempt to destroy locked mutex '&p->pvt_lock'.
[Mar  4 18:04:25] ERROR[19572]: /root/src/asterisk-r105553/include/asterisk/lock.h:281 __ast_pthread_mutex_destroy: chan_sip.c line 5853 (find_call): Error: '&p->pvt_lock' was locked here.
[Mar  4 18:04:25] ERROR[19572]: /root/src/asterisk-r105553/include/asterisk/lock.h:288 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error destroying mutex &p->pvt_lock: Device or resource busy



By: pj (pj) 2008-03-04 13:41:17.000-0600

uff, another crash!


=== -------------------------------------------------------------------
===
=== Thread ID: -1219318896 (ast_make_file_from_fd started at [  154] tcptls.c server_root())
=== ---> Lock #0 (chan_sip.c): MUTEX 2510 __sip_xmit p->socket.ser->lock 0x825f670 (1)
=== -------------------------------------------------------------------


[Mar  4 20:39:36] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:278 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error: attempt to destroy locked mutex '&p->pvt_lock'.
[Mar  4 20:39:36] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:281 __ast_pthread_mutex_destroy: chan_sip.c line 5853 (find_call): Error: '&p->pvt_lock' was locked here.
[Mar  4 20:39:36] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:288 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error destroying mutex &p->pvt_lock: Device or resource busy
[Mar  4 20:40:50] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:469 __ast_pthread_mutex_unlock: chan_sip.c line 18038 (handle_request_do): mutex '&p->pvt_lock' freed more times than we've locked!
[Mar  4 20:40:50] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:486 __ast_pthread_mutex_unlock: chan_sip.c line 18038 (handle_request_do): Error releasing mutex: Operation not permitted
[Mar  4 20:41:22] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:278 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error: attempt to destroy locked mutex '&p->pvt_lock'.
[Mar  4 20:41:22] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:281 __ast_pthread_mutex_destroy: chan_sip.c line 5853 (find_call): Error: '&p->pvt_lock' was locked here.
[Mar  4 20:41:22] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:288 __ast_pthread_mutex_destroy: chan_sip.c line 4319 (__sip_destroy): Error destroying mutex &p->pvt_lock: Device or resource busy
[Mar  4 20:41:52] ERROR[20982]: /root/src/asterisk-r105553/include/asterisk/lock.h:469 __ast_pthread_mutex_unlock: chan_sip.c line 18038 (handle_request_do): mutex '&p->pvt_lock' freed more times than we've locked!





(gdb) bt
#0  0xb7988e6c in __sip_destroy (p=0x82c8c70, lockowner=1, lockdialoglist=1) at chan_sip.c:4275
#1  0xb7989c9c in sip_destroy (p=0x82c8c70) at chan_sip.c:4450
#2  0xb79820d9 in __sip_autodestruct (data=0x82c8c70) at chan_sip.c:2877
#3  0x08131d7a in ast_sched_runq (con=0x81f3610) at sched.c:365
#4  0xb79ccf34 in do_monitor (data=0x0) at chan_sip.c:18352
ASTERISK-1  0x08146219 in dummy_start (data=0x81fdab8) at utils.c:870
ASTERISK-2  0xb7c194e2 in start_thread () from /lib/i686/libpthread.so.0
ASTERISK-3  0xb7d0146e in clone () from /lib/i686/libc.so.6

By: Digium Subversion (svnbot) 2008-03-04 14:26:15.000-0600

Repository: asterisk
Revision: 105731

A   team/group/issue_11972/

------------------------------------------------------------------------
r105731 | russell | 2008-03-04 14:26:11 -0600 (Tue, 04 Mar 2008) | 2 lines

Create a branch for the patches related to issue ASTERISK-11417

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105731

By: James Golovich (jamesgolovich) 2008-03-04 14:29:07.000-0600

I just uploaded a new patch (bug11972-8.diff.txt that integrates Russell's changes with my previous ones.  Sorry I haven't been on this lately, things got busy.  I'll see if I can get back into it and figure anything out

By: Digium Subversion (svnbot) 2008-03-04 14:32:58.000-0600

Repository: asterisk
Revision: 105734

U   trunk/channels/chan_sip.c

------------------------------------------------------------------------
r105734 | russell | 2008-03-04 14:32:31 -0600 (Tue, 04 Mar 2008) | 6 lines

Fix some bugs in the SIP tcp helper thread.
- fix a spot where a lock wouldn't get unlocked in an error condition
- call ast_mutex_destroy() on the lock before freeing its memory

(related to issue ASTERISK-11417)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105734

By: Digium Subversion (svnbot) 2008-03-04 14:51:00.000-0600

Repository: asterisk
Revision: 105737

U   team/group/issue_11972/channels/chan_sip.c

------------------------------------------------------------------------
r105737 | russell | 2008-03-04 14:50:59 -0600 (Tue, 04 Mar 2008) | 12 lines

This patch changes the socket mutex to a rwlock, instead. For a TCP/TLS socket,
we know there will only be one thread reading from the socket at a time. But,
we have to be careful to ensure that only one thread will write to it at a time.
That's what the lock was for.

By using a rwlock, the socket read thread will only use a read lock.
Then, other threads writing to it will use a write-lock. Then, a thread trying
to send something won't be blocked by the socket reading thread while it's
sleeping and waiting for data to arrive on the socket.

(related to issue ASTERISK-11417, patch by me)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105737

By: Digium Subversion (svnbot) 2008-03-04 15:08:00.000-0600

Repository: asterisk
Revision: 105738

U   team/group/issue_11972/channels/chan_sip.c

------------------------------------------------------------------------
r105738 | russell | 2008-03-04 15:07:59 -0600 (Tue, 04 Mar 2008) | 3 lines

revert rev 105737 for now ... need to re-think this ...
(related to issue ASTERISK-11417)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105738

By: Russell Bryant (russell) 2008-03-04 15:21:09.000-0600

deleted my patch, as it doesn't actually make any sense ...

By: Russell Bryant (russell) 2008-03-04 15:31:24.000-0600

James, thanks for taking a look and for updating your patch.  As you probably saw in a previous comment, I ended up throwing them away because they weren't actually useful.  I was thinking about the r/w lock backwards.

I'm currently trying to figure out what you have done in your patch and why.  I'll post some more detailed comments/questions here in a few minutes ...

By: Russell Bryant (russell) 2008-03-04 15:53:13.000-0600

So, the biggest thing I have noticed in your patch is that the server_instance objects aren't getting free'd anywhere.  It's also not clear where that should go, as there are pointers to it being stored in various places ... I'm thinking that some reference counting may be in order here.  I'll look into it.

Also, what was the inspiration for moving the lock to inside the server instance instead of the sip_socket obj?

Basically, any information that you can provide that explains your changes would be really helpful.

By: Russell Bryant (russell) 2008-03-04 15:55:31.000-0600

Actually, I can see why moving the lock into the server_instance would be useful.  I'll go ahead and merge that part in.

By: Digium Subversion (svnbot) 2008-03-04 16:59:26.000-0600

Repository: asterisk
Revision: 105839

U   team/group/issue_11972/channels/chan_sip.c
U   team/group/issue_11972/include/asterisk/tcptls.h
U   team/group/issue_11972/main/http.c
U   team/group/issue_11972/main/manager.c
U   team/group/issue_11972/main/tcptls.c

------------------------------------------------------------------------
r105839 | russell | 2008-03-04 16:59:24 -0600 (Tue, 04 Mar 2008) | 4 lines

First pass at making the server instance object reference counted to fix issues
in chan_sip
(related to issue ASTERISK-11417)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105839

By: Russell Bryant (russell) 2008-03-04 17:01:54.000-0600

pj, can you start testing from this branch, please?

svn co http://svn.digium.com/svn/asterisk/team/group/issue_11972 asterisk-issue_11972

By: James Golovich (jamesgolovich) 2008-03-04 17:32:04.000-0600

Theres a few other changes in here that were necessary to stop deadlocks.  One was to only lock netlock for udp channels and the other was to change the local in the monitor thread from a lock to a trylock.

Also I added a little sleeping when unloading the module to give it enough time for any threads to die.  I was getting a strange crash when unloading if the threads didn't die quick enough.

There were some other crashes happening when the SSL fd would go away but server_write was still being called, so thats why I added the check for ser->fd and set errno and return -1 there.

I'm compiling your branch right now and gonna hammer at it to see if I can reproduce any crash or deadlocks

By: Russell Bryant (russell) 2008-03-04 17:44:30.000-0600

> Theres a few other changes in here that were necessary to stop deadlocks.
> One was to only lock netlock for udp channels and the other was to change
> the local in the monitor thread from a lock to a trylock.

I didn't merge in those netlock changes yet, because I was a bit worried about it.  Basically, my concern is that the netlock currently ensures that only one SIP request is being processed at a time.  Changing that could cause a lot of bugs to come out due to that being an assumption in the code everywhere.  I'd be interested in seeing the deadlock that this avoided, though.  The same goes for the trylock change.  I'd like to see exactly what the deadlock is that it avoids before we put that in.

> Also I added a little sleeping when unloading the module to give it enough
> time for any threads to die. I was getting a strange crash when unloading
> if the threads didn't die quick enough.

Hm, I don't understand what the problem was here, either.  The pthread_join() above your usleep() waits until the thread exist before continuing.

> There were some other crashes happening when the SSL fd would go away but
> server_write was still being called, so thats why I added the check for
> ser->fd and set errno and return -1 there.

Ok, I'm fine with this santiy check.  I'd like to see an ERROR message in there, too, though, since really, this should never happen, and we should try to track down how it does happen.  I'll add this change into the branch.

> I'm compiling your branch right now and gonna hammer at it to see if I
> can reproduce any crash or deadlocks

Sounds great.  If you can reproduce anything, let me know here.  Thanks for the help!

By: Digium Subversion (svnbot) 2008-03-04 17:46:23.000-0600

Repository: asterisk
Revision: 105896

U   team/group/issue_11972/main/tcptls.c

------------------------------------------------------------------------
r105896 | russell | 2008-03-04 17:46:17 -0600 (Tue, 04 Mar 2008) | 3 lines

Add some fd sanity checking to server read/write
(from jamesgolovich's patch on issue ASTERISK-11417)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105896

By: James Golovich (jamesgolovich) 2008-03-04 17:57:12.000-0600

I sent about 100 calls through asterisk then killed the originating asterisk box, and the receiving asterisk box started printing these after a while.

[Mar  4 15:57:39] ERROR[27740]: astobj2.c:115 INTERNAL_OBJ: bad magic number 0x0 for 0x8329eb8
WARNING: Freeing unused memory at 0x86a6580, in ao2_ref of astobj2.c, line 198
[Mar  4 15:58:23] ERROR[27740]: astobj2.c:183 ao2_ref: refcount -1 on object 0x86a6668

*CLI> WARNING: Freeing unused memory at 0x86a6580, in ao2_ref of astobj2.c, line 198
[Mar  4 15:58:51] ERROR[27740]: astobj2.c:183 ao2_ref: refcount -1 on object 0x86a6668
WARNING: Freeing unused memory at 0x86a6580, in ao2_ref of astobj2.c, line 198
WARNING: Freeing unused memory at 0x86a6580, in ao2_ref of astobj2.c, line 198
WARNING: Freeing unused memory at 0x86a6580, in ao2_ref of astobj2.c, line 198
[Mar  4 15:58:52] ERROR[27740]: astobj2.c:183 ao2_ref: refcount -1 on object 0x86a6668

By: James Golovich (jamesgolovich) 2008-03-04 18:16:44.000-0600

I figured out its easy to reproduce the free'ng memory problem, and at least once trying this caused a segfault.  Just startup asterisk, telnet to port 5060, then disconnect.

By: James Golovich (jamesgolovich) 2008-03-04 18:57:16.000-0600

Russell,

I don't know enough about how ao2 works, but in sip_tcp_helper_thread you call ao2_ref(pvt->socket.ser, +1); but pvt will only exist for client connections and not server connections.

Edit: forget what I said.  sip_tcp_worker_fn is used for server connections, not sip_tcp helper_thread



By: Digium Subversion (svnbot) 2008-03-04 19:15:42.000-0600

Repository: asterisk
Revision: 105911

U   team/group/issue_11972/channels/chan_sip.c

------------------------------------------------------------------------
r105911 | russell | 2008-03-04 19:15:41 -0600 (Tue, 04 Mar 2008) | 4 lines

Remove an extra unref that caused a crash

(issue ASTERISK-11417, thanks to jamesgolovich for the testing and info for how to reproduce)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=105911

By: Russell Bryant (russell) 2008-03-04 19:23:49.000-0600

So, the general idea with using astobj2 for this is to make sure that the server instance won't get destroyed until it is no longer possible for it to be referenced anywhere in the code.  So, when a pointer to the server instance is copied into another object, (usually one with a sip_socket in it, like sip_peer, sip_pvt, whatever), then the reference count of the object reflects it.  Then, when that object that contains a pointer to the server instance goes away, it decrements the reference count.  When the ref count hits zero, the object gets free'd.

By: pj (pj) 2008-03-05 02:18:55.000-0600

branch http://svn.digium.com/svn/asterisk/team/group/issue_11972
is not working good, I got two crashes after few minutes,
now I'm running SVN-trunk-r105677 + bug11972-8.diff.txt, fine for now.

some msg from console (branch issue_11972):

[Mar  5 09:10:43]     -- Called bill-gw/xxxx
[Mar  5 09:10:43] ERROR[21429]: chan_sip.c:18041 handle_request_do: We could NOT get the channel lock for SIP/bill-gw-08260420!
[Mar  5 09:10:43] ERROR[21429]: chan_sip.c:18042 handle_request_do: SIP transaction failed: 016b7d6d7fcf33d515fa5a2d7e57b308@xxxxxxx
[Mar  5 09:10:53] WARNING[21429]: chan_sip.c:15209 handle_response: Remote host can't match request CANCEL to call '016b7d6d7fcf33d515fa5a2d7e57b308@xxxxx'. Giving up.



[Mar  5 09:11:02]     -- Executing [xxxx@zamestnanci:3] Dial("SIP/xxx-0825c010", "SIP/bill-gw/xxx") in new stack
[Mar  5 09:11:02]   == Using SIP RTP CoS mark 5
[Mar  5 09:11:02] ERROR[21429]: astobj2.c:115 INTERNAL_OBJ: bad magic number 0x101 for 0x8248cc8
*CLI>
Disconnected from Asterisk server


(gdb) bt full
#0  0xb7c6c4c5 in _start () at rtld.c:788
       preloadlist = Cannot access memory at address 0xb7c86b89

By: Digium Subversion (svnbot) 2008-03-05 18:06:40.000-0600

Repository: asterisk
Revision: 106302

_U  branches/1.6.0/
U   branches/1.6.0/channels/chan_sip.c

------------------------------------------------------------------------
r106302 | russell | 2008-03-05 18:06:39 -0600 (Wed, 05 Mar 2008) | 14 lines

Merged revisions 105734 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
r105734 | russell | 2008-03-04 14:36:16 -0600 (Tue, 04 Mar 2008) | 6 lines

Fix some bugs in the SIP tcp helper thread.
- fix a spot where a lock wouldn't get unlocked in an error condition
- call ast_mutex_destroy() on the lock before freeing its memory

(related to issue ASTERISK-11417)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=106302

By: pj (pj) 2008-03-07 01:29:52.000-0600

crash after about two days uptime and about 200 processed calls, uploading gdb7.txt
SVN-trunk-r105677 + bug11972-8.diff.txt



By: Russell Bryant (russell) 2008-03-07 16:31:07.000-0600

I made another commit to the issue_11972 branch which is making things not crash for me.  Now I'm trying to figure out why the sip_pvt's aren't going away after I hang up the asterisk channel associated with them ...

By: pj (pj) 2008-03-08 03:14:21.000-0600

Russell, contains branch 'issue_11972' all things, that James have done in  bug11972-8.diff.txt? When I last tested your branch it was really unstable in comparison to trunk + James patch.

By: Russell Bryant (russell) 2008-03-08 12:11:16.000-0600

It has some of the things in the patch.  There were some deadlock avoidance pieces that I have not merged yet, because I don't yet understand why they are required.  Perhaps the reasoning is in some old debug attached here ... I'll go looking later.

The biggest thing that the branch has, that the patch does not, is an attempt to fix a memory leak that occurs on every TCP or TLS session.  The "server_instance" struct gets stored in various places, and it's tricky to figure out the right time to free it.

So, if you can make the branch crash, then it probably means that I still don't have the logic right for when to free the memory, and I need to fix it.  If it locks up, then I would like to see the locks output so I can understand the deadlocks we are facing in this code.

By: pj (pj) 2008-03-08 13:36:03.000-0600

sorry Russell, this branch is still useless, it crashes after first call, uploading gdb8.txt

By: pj (pj) 2008-03-17 15:14:27

Hi James and Russel, after chan_sip changes in trunk during some last weeks, your patch fails to apply, can you update? also, if you can, please take a look at bugreports: 12169 and 12170, that are also related to sip/tls.

By: pj (pj) 2008-03-22 02:40:38

gdb9.txt, uploading another backtraces
from SVN-trunk-r105677 + bug11972-8.diff.txt

By: pj (pj) 2008-03-25 11:22:27

as a precursor of crash, I catch folowing locks:

=== Thread ID: -1219798128 (ast_make_file_from_fd started at [  154] tcptls.c server_root())
=== ---> Lock #0 (chan_sip.c): MUTEX 5858 find_call &p->pvt_lock 0x826d464 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 5858 find_call &p->pvt_lock 0x826bf34 (1)
=== ---> Lock #2 (chan_sip.c): WRLOCK 2510 __sip_xmit p->socket.ser->lock 0x8279098 (1)
=== -------------------------------------------------------------------

last on CLI
[Mar 25 17:16:33] WARNING[24559]: chan_sip.c:2534 __sip_xmit: sip_xmit of 0xb74b0f24 (len 525) to 0.0.0.0:0 returned -1: Connection reset by peer
[Mar 25 17:16:33] WARNING[24559]: chan_sip.c:2896 sip_scheddestroy: Unable to cancel SIP destruction.  Expect bad things.

By: pj (pj) 2008-04-21 16:37:41

Would be possible to update this patch according to current trunk? I'm using asterisk with this patch regularly and its working quite well, no locks, only  sometimes crashes (after about 500 calls, maybe due some memory leaks(?)). Without this patch, tls stuff is useless in clean trunk, it locks or crashes almost immediatelly, so it should be solved before 1.6 will be officially released.
This patch applies cleanly last with trunk rev 105733, next commit makes some changes, that are incompatible with this patch.

By: Olle Johansson (oej) 2008-07-06 09:32:46

james, russell: What's the status here?

By: Brett Bryant (bbryant) 2008-07-07 18:13:39

This issue was fixed with revision 123546 in trunk.

===========================
URL: http://svn.digium.com/view/asterisk?view=rev&rev=123546
Log:
Updates all usages of ast_tcptls_session_instance to be managed by reference counts so that they only get destroyed when all threads are done using
them, and memory does not get free'd causing strange issues with SIP.

This code was originally written by russellb in the team/group/issue_11972/ branch.

Modified:
   trunk/apps/app_externalivr.c
   trunk/channels/chan_sip.c
   trunk/include/asterisk/tcptls.h
   trunk/main/astobj2.c
   trunk/main/http.c
   trunk/main/manager.c
   trunk/main/tcptls.c
===========================

By: Brett Bryant (bbryant) 2008-07-07 18:14:54

If you still have issues that that commit didn't resolve, please open a new bug report.