Summary: | ASTERISK-12433: [patch] sip peer qualified failed, asterisk lock. | ||
Reporter: | Paul Belanger (pabelanger) | Labels: | |
Date Opened: | 2008-07-23 02:32:56 | Date Closed: | 2009-10-22 17:11:36 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Channels/chan_sip/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) 20081110__bug13136.diff.txt ( 1) 20081111__bug13136__2.diff.txt ( 2) 20081111__bug13136.diff.txt ( 3) 20081112__bug13136__2.diff.txt ( 4) 20081112__bug13136.diff.txt ( 5) 20081113__bug13136.diff.txt ( 6) bt.txt ( 7) bt_full.txt ( 8) gdb_trunk_176138.txt ( 9) gdb.2.txt (10) gdb.3.txt (11) gdb.txt (12) lock_trunk_176138.txt (13) locks.txt (14) mylog | |
Description: | We just had asterisk lock on us tonight. Best we can guess is because we lost our SIP PEER (via qualify). See output from 'show core locks'. ****** ADDITIONAL INFORMATION ****** ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: -1214731376 (do_monitor started at [18523] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 18493 do_monitor &monlock 0xb7a49ea0 (1) === ------------------------------------------------------------------- === === Thread ID: -1222354032 (ss_thread started at [ 7544] chan_zap.c handle_init_event()) === ---> Waiting for Lock #0 (chan_sip.c): MUTEX 18512 restart_monitor &monlock 0xb7a49ea0 (1) === --- ---> Locked Here: chan_sip.c line 18493 (do_monitor) === ------------------------------------------------------------------- === === Thread ID: -1223337072 (ss_thread started at [ 7544] chan_zap.c handle_init_event()) === ---> Waiting for Lock #0 (chan_sip.c): MUTEX 18512 restart_monitor &monlock 0xb7a49ea0 (1) === --- ---> Locked Here: chan_sip.c line 18493 (do_monitor) === ------------------------------------------------------------------- === ======================================================================= | ||
Comments: | By: Mark Michelson (mmichelson) 2008-07-23 09:36:48 Wow, there's something majorly wrong here. chan_sip and chan_zap both have functions called "restart_monitor" and for some reason chan_zap is calling the one defined in chan_sip. That, however, does not explain why the lock which was legitimately acquired in chan_sip has not been unlocked yet. It would be helpful to get a backtrace of that thread to see where it's stuck. If you can provide the output of "thread apply all bt full" from gdb, it would be helpful. Thanks! By: Mark Michelson (mmichelson) 2008-07-23 09:46:46 "Wow, there's something majorly wrong here. chan_sip and chan_zap both have functions called "restart_monitor" and for some reason chan_zap is calling the one defined in chan_sip." Actually, on second thought, I may be jumping to conclusions with regards to this. The backtrace will provide the necessary answers. By: Paul Belanger (pabelanger) 2008-07-23 12:11:41 putnopvut: Thanks for the information, however I'm not sure how to get a backtrace because asterisk did not core. ATM the system is back up and processing calls, but I think I can reproduces this issue with another system. Once it happens again I'll try and jump on IRC and get assistance. By: Mark Michelson (mmichelson) 2008-07-23 12:19:40 In case I'm not on, then you can attach to Asterisk with gdb using # gdb /usr/sbin/asterisk `pidof asterisk` From here, just issue the "thread apply all bt full" command and you will get a backtrace of all the currently running threads. By: Paul Belanger (pabelanger) 2008-07-25 10:27:09 putnopvut: We managed to reproduce the lock in our lab this morning. See attached files. By: Paul Belanger (pabelanger) 2008-07-25 10:28:57 I should note we used 1.6.0 branch (svn 133524). By: Mark Michelson (mmichelson) 2008-07-28 16:46:00 I took a look at the backtrace, and it appears as though you've pretty much hit the nail on the head with your analysis. If you check thread 2 in the backtrace, it's clear that Asterisk is blocking on a call to connect(2). Asterisk has sent a TCP SYN to the far end and is blocking until it receives a SYN/ACK in return. Unfortunately, if the peer is unreachable, Asterisk will wait forever. Unfortunately, this also means that *no* SIP processing is possible since the monlock is currently held by this thread. I haven't looked too closely at the code in question, but I have a suspicion of what is happening. I think that Asterisk detects that the peer in question is unreachable, and so the TCP connection is torn down. Then, when Asterisk attempts to re-connect so it can send another OPTIONS request (to see if the peer is now reachable) it can never connect in the first place. I will look further in-depth to see if my analysis is correct. By: Paul Belanger (pabelanger) 2008-07-28 21:41:50 Hi putnopvut, I would agree with what you said. We found if you create a sip peer to a host:port that is firewall to drop TCP packets, you will reproduce the issue everytime. By: Paul Belanger (pabelanger) 2008-09-18 15:52:42 Anything I can do to re-jump start this issue? By: Tilghman Lesher (tilghman) 2008-11-10 19:21:12.000-0600 Let's try something like this, which should only allow the connect to hang for up to 2 seconds, before it will be cancelled. By: Matt Riddell (zx81) 2008-11-10 21:47:52.000-0600 What do you reckon about making the 2000 a #define PORT_TIMEOUT 2000 or whatever? By: Tilghman Lesher (tilghman) 2008-11-10 22:27:08.000-0600 ZX81: I want to know whether it works FIRST, and only THEN we can discuss those things. By: Matt Riddell (zx81) 2008-11-11 01:59:35.000-0600 :) Fair enough By: Paul Belanger (pabelanger) 2008-11-11 12:02:21.000-0600 Corydon76: We decided to try and reproduce this issue with the latest trunk (155928) [without your patch] and there does not seem to be a problem. The issue still lies in the 1.6.0 (155815) branch. We are unable to cleanly apply your patch against it. Suggestions? By: Paul Belanger (pabelanger) 2008-11-11 12:19:40.000-0600 It seems 1.6.1 branch has also had the problem fixed, cannot reproduce. By: Tilghman Lesher (tilghman) 2008-11-11 13:21:25.000-0600 Updated to apply against 1.6.0. One question: do you get this same issue in 1.4? D'oh, nevermind. TCP support was not in 1.4, so you couldn't have gotten this issue in 1.4. I suspect that the reason it was not fixed in 1.6.0 is that it wasn't a regression over 1.4, it being a new, experimental feature and all. Still, the SIP driver didn't lock up in 1.4, so I think this behavior may still be a regression. By: Paul Belanger (pabelanger) 2008-11-11 13:40:02.000-0600 Corydon76: thanks for the fast backport. We'll get back shortly with the results. Have not tested 1.4, since we are using SIP over TCP. By: Paul Belanger (pabelanger) 2008-11-11 16:06:08.000-0600 Corydon76: No luck with attached patch (20081111__bug13136.diff.txt). Asterisk core's on startup. See attached bt.txt and bt full.txt By: Tilghman Lesher (tilghman) 2008-11-11 16:34:17.000-0600 Okay, silly mistake. New patch uploaded. By: Paul Belanger (pabelanger) 2008-11-12 08:03:56.000-0600 Morning Corydon76: After apply your patch we still see the lock (attached mylog for gdb output). --- ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: -1228268656 (do_monitor started at [19062] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0xb6d32c20 (1) === ------------------------------------------------------------------- === ======================================================================= By: Tilghman Lesher (tilghman) 2008-11-12 11:20:32.000-0600 pabelanger: so even with this patch, Asterisk still locks up? By: Tilghman Lesher (tilghman) 2008-11-12 12:23:37.000-0600 Reading the gdb log, it appears that it is locked up. So, new patch uploaded. By: Paul Belanger (pabelanger) 2008-11-12 12:33:01.000-0600 Sorry Corydon76, same problem. The lock is still there. --- ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: -1228665968 (do_monitor started at [19062] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0xb6cd1c20 (1) === ------------------------------------------------------------------- === ======================================================================= By: Paul Belanger (pabelanger) 2008-11-12 12:46:33.000-0600 I should note it seems to lock will clear after 3mins and 20seconds. By: Tilghman Lesher (tilghman) 2008-11-12 14:13:14.000-0600 Switching to a different signal, on the theory that connect(2) handles SIGURG, because SIGURG is normally supposed to be used with sockets. New patch uploaded. By: Paul Belanger (pabelanger) 2008-11-12 15:23:35.000-0600 lock is still there, uploaded gdb.3.txt for review. By: Matt Riddell (zx81) 2008-11-12 15:40:04.000-0600 Is there not a way to catch all signals and then print out which one you got? Like make an array of signal numbers, loop through, use the same handler and then print out the signal: void sighand(int signo) { printf("Close Server Connections called to close: %d\n", signo); return; } Just to debug. By: Tilghman Lesher (tilghman) 2008-11-12 16:12:40.000-0600 ZX81: the issue is NOT whether we can detect a signal. The issue is whether a signal will INTERRUPT the connect(2) system call. That has been the entire problem in this issue. By: Tilghman Lesher (tilghman) 2008-11-12 16:15:28.000-0600 And, I might add, the thread calling connect(2) is clearly receiving the signal, and also, the connect(2) is not returning, which is a violation of the specification. Where we go from here is troubling. By: Matt Riddell (zx81) 2008-11-12 16:28:20.000-0600 Ah, yeah we had the same problem in SmoothTorque, but it was solved by writing periodically to the socket which caused it to throw the interrupt. Obviously that's not useful in this situation though :) By: Tilghman Lesher (tilghman) 2008-11-13 00:15:57.000-0600 Okay, whole new approach. Instead of waiting for connect(2) to return, we set the socket nonblocking, then wait for the socket to become writable, for a maximum of 2 seconds. We should then be able to fail the socket, immediately. Patch uploaded, ready for testing. By: Leif Madsen (lmadsen) 2008-11-18 15:58:29.000-0600 Was able to reproduce this issue (what a pain that was!) but it seems even with the patch I'm still getting locks: ======================================================================= === Currently Held Locks ============================================== ======================================================================= === === <file> <line num> <function> <lock name> <lock addr> (times locked) === === Thread ID: 1079638336 (do_monitor started at [19062] chan_sip.c restart_monitor()) === ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0x2aaac1bb0900 (1) === ---> Lock #1 (chan_sip.c): MUTEX 2593 __sip_xmit &p->socket.ser->lock 0x4da89e0 (1) === ------------------------------------------------------------------- === ======================================================================= By: Tilghman Lesher (tilghman) 2008-11-18 18:40:07.000-0600 Well, unfortunately, I'm out of ideas. Anybody else want to take a stab at this? By: Mark Michelson (mmichelson) 2008-11-18 19:22:38.000-0600 Back in the early days of this bug, when I was looking at it, my idea was to do exactly what you did in your latest patch, Corydon76. That is, set the socket to non-blocking prior to calling connect and then poll/select until the socket is ready for writing. What I found when initially writing this is that there are a bunch of places that you have to check for errors. Every call to fcntl, the call to connect, the call to getsockopt, the call to select/poll, etc. It may be that some of these functions are erroring out for some reason. Adding error checking/logging may point to some portion that is failing to execute properly. Also, I'm not sure about the legality/correctness of this, but my original intention was also to clear the O_NONBLOCK flag on the socket after I had determined that the connection was successful. I don't think this would directly affect the problem encountered, but hey, just throwing something else out there. Another possibility is that the non-blocking connect patch is properly solving the original problem but blitzrage's testing may have uncovered something else wrong. The fact that there are locks held, but there is no contention between threads makes me think that the monitor thread in chan_sip is blocking somewhere, somehow. I would assume that the call to write in ast_tcptls_server_write is now blocking for some reason. Something else to consider is that pabelanger said that the issue is not present in trunk or 1.6.1. It may be worthwhile to see what fixes were made there with regards to TCP support in chan_sip and see if we can identify which revision fixed this issue. By: Tilghman Lesher (tilghman) 2008-11-19 00:19:41.000-0600 I've evaluated most of those commands and determined that most of the errors were not relevant or possible, with the given arguments, but it's possible that I might have missed some. In terms of O_NONBLOCK, the existing code already clears that flag in code following the connect, so it was unnecessary to do that again. In fact, I'm not really sure why it was there in the first place, as the socket would otherwise block, by default. It's possible that the write() blocks, but I don't think that's likely, given that poll() indicated that a write() should not block. As to what exactly is blocking, that can be determined by connecting to Asterisk with gdb and running a 'thread apply all bt'. I suspect it's still blocking in the connect() for some reason that I can't divine, but testing is needed to confirm that. By: Paul Belanger (pabelanger) 2008-11-19 00:45:28.000-0600 Sorry for the lack of testing on my part, just got Internet access back. I'm currently in China for business, and will be back next week. I'll be able to resume testing. By: Mark Michelson (mmichelson) 2008-11-19 09:53:32.000-0600 I based my assumption that the write() was blocking because of the locks which are currently held in blitzrage's latest core show locks output. I'm not sure exactly what blitzrage's test setup is, but if there was something which was previously causing the connect() to block, then I would be willing to bet that in his test setup, the same obstruction is causing the write() to also block. However, if his setup was causing the connect() to block, I would expect that the error returned would have caused the subsequent write() to not occur. Are we certain that the connect() succeeded? blitzrage, just to clarify, could you describe how you set up your tests for this? Also, did you ever see a console error message like the following: "Unable to connect <something> to <something else>:<some reason>" ? By: Leif Madsen (lmadsen) 2008-11-19 11:35:51.000-0600 My test setup includes 3 devices: a Polycom (that has TCP support), Zoiper (which I don't think has TCP support, but seems to respond to the NOTIFY packets regardless...) and X-Lite, which explicitly seems to reject the packets. I believe I was seeing the error you mentioned when trying X-Lite (...some time passes as he gets the exact error) *CLI> [Nov 19 08:29:34] ERROR[26147]: tcptls.c:252 ast_tcptls_client_start: Unable to connect SIP socket to 192.168.128.129:59668: Connection refused And some more if it is useful... [Nov 19 08:29:35] ERROR[26147]: chan_sip.c:10543 register_verify: 'UDP' is not a valid transport for '102'. we only use 'TCP'! ending call. [Nov 19 08:29:35] NOTICE[26147]: chan_sip.c:18293 handle_request_register: Registration from '"X-Lite"<sip:102@192.168.128.50>' failed for '192.168.128.129' - Device not configured to use this transport type The way I reproduce is a bit tricky... Basically I block the incoming SIP signalling port (in my case it is 5430, but most devices would be 5060) in the RH-Firewall-1-INPUT chain. Then after I see the UNREACHABLE message, I then block the connection in the OUTPUT chain of iptables. After a little bit of time (60-120 seconds roughly), then I get the lock showing in 'core show locks'. That seems to be the only way I was able to reliably reproduce it. Sometimes is hit or miss, but that seems to be the most reliable way that I could find. By: Leif Madsen (lmadsen) 2009-01-20 13:35:16.000-0600 Just thought I'd ping this issue again to see if there is anything we can do to move this issue forward? By: Paul Belanger (pabelanger) 2009-01-20 13:57:53.000-0600 I can re-test with the latest 1.6.0 / 1.6.1 branches and trunk to see if there is still an issue. By: Leif Madsen (lmadsen) 2009-01-20 14:30:48.000-0600 Great, please do and let us know if any of them resolve your issue. Thanks! By: Leif Madsen (lmadsen) 2009-02-02 13:36:00.000-0600 Still an issue? By: frank koster (notthematrix) 2009-02-09 17:16:03.000-0600 Dont know if it is the same issue but I somethimes get this problem when I get a lockup asterisk completly locks up all channels when I doe a sip show channels all channals are filled and it can not make new callas anymore We are using a ht-502 and ht-503 from grandstream and using it in TLS mode with the SRTP branch. I could reproduce this bug when I disconect power from my ht-503 and bring it to my niebours reconecting it over HIS connection (difrent provider and public IP) connection both via NAT with nat enabled. It also seems to happen when a provider changes ip on the fly. I suspect this tyo happen with an austrailian friend who is connected via bigpond adsl while my asterisk is in europe. It looks that if I disable session timers by doing session-timers=refuse session-expires=110 session-minse=90 session-refresher=uas The problem seems to be gone but I only tryed it ones with a ht-502 again doing the running to nighbour trick. I cant use the patch provided because of SRTP (srtp branch) But If it is the same bug it looks like tls does not disconect porperly sip show peers Power disconnect. 123456789/123456789 (Unspecified) D N 0 UNKNOWN while this is a normal peer in rest when not yet loged in after restart Normal situation 987654321 (Unspecified) D N 5061 UNKNOWN strange thing is that it defaults to port 5061 while I setted an other default port 443 If this is the same bug I hope this can help since it is a very anoing bug and locks my asterisk too manny times. By: Paul Belanger (pabelanger) 2009-02-16 12:22:17.000-0600 Sorry for the delay on this... I've been traveling over the last few weeks. I retested using 1.6.0.5 and trunk, the problem is still there. I've uploaded the lock and bt against trunk (gdb_trunk_176138.txt and lock_trunk_176138.txt). I'll hold off on uploading anything with 1.6.0 branch until trunk is fixed, or unless somebody asks for it. PB By: Leif Madsen (lmadsen) 2009-02-19 09:28:28.000-0600 Assigned this to Corydon76 to see if the bt or lock files show anything useful. Please reassign should you not be able to move the issue forward. Thanks! By: Tilghman Lesher (tilghman) 2009-02-19 11:40:03.000-0600 blitzrage: No, the lock isn't showing anything helpful, and the bt isn't, either. By: Leif Madsen (lmadsen) 2009-02-19 12:27:13.000-0600 Corydon76: so do you have any idea where to go with this issue? By: frank koster (notthematrix) 2009-02-19 16:00:29.000-0600 Maby it is a deeper problem and it has to do with the kernel tcp stack and synflood protection. when a qualify fails it migt retry to connect and the kernel might see this as somekind of synflood attack? so the kernel might be blocking the asterisk from flooding? and asterisk locks up whit waiting connections? sip show channels Peer User/ANR Call ID Format Hold Last Message Expiry 82.95.212.35 (None) 0bc40c8b3267d88 0x0 (nothing) No Init: OPTIONS 89.231.82.39 (None) 3c3b93d04c9274d 0x0 (nothing) No Init: OPTIONS 194.221.62.198 (None) 7f439a7143a3110 0x0 (nothing) No Init: OPTIONS 89.231.82.45 (None) 40f7adc27820bd4 0x0 (nothing) No Init: OPTIONS 195.243.173.117 (None) 0x0 (nothing) No 89.231.84.195 (None) 47026f666d03812 0x0 (nothing) No Init: OPTIONS 94.208.243.114 (None) 5bf9ba752dc6efe 0x0 (nothing) No Init: OPTIONS 89.231.84.45 (None) 251517d403f5617 0x0 (nothing) No Init: OPTIONS 94.208.243.114 (None) 702438ee49a701a 0x0 (nothing) No Init: OPTIONS 89.231.86.45 (None) 41a9ff0944d5769 0x0 (nothing) No Init: OPTIONS 194.109.4.200 (None) 44b0915742ab4d8 0x0 (nothing) No Init: OPTIONS 89.231.83.45 (None) 4c13d69819c4225 0x0 (nothing) No Init: OPTIONS 191.243.177.117 (None) 4c06c96e2eab6b8 0x0 (nothing) No Init: OPTIONS 13 active SIP dialogs sip show peers Name/username Host Dyn Nat ACL Port Status 31108920309 (Unspecified) D N 5061 UNKNOWN 37418080095/37418080095 94.208.243.114 D N 5060 UNKNOWN 37418080642 (Unspecified) D N 5061 UNKNOWN 37418080966 (Unspecified) D N 5061 UNKNOWN 37418080967 (Unspecified) D N 5061 UNKNOWN 41852788101 (Unspecified) D N 5061 UNKNOWN 41852788102/41852788102 194.109.4.200 D N 5894 UNKNOWN 41852788103 (Unspecified) D N 5061 UNKNOWN 41852788104 (Unspecified) D N 5061 UNKNOWN 41852788105/41852788105 82.95.212.35 D N 62421 UNKNOWN 41852788111 (Unspecified) D N 5061 UNKNOWN 41852788115 (Unspecified) D N 5060 UNKNOWN 41852788116 (Unspecified) D N 5061 UNKNOWN 39710615403/39710615403 94.208.243.114 D N 5060 UNKNOWN 43961124284 (Unspecified) D N 5060 UNKNOWN 560256 195.243.173.117 5060 UNKNOWN 61889214119 (Unspecified) D N 5060 UNKNOWN 89.231.82.39 89.231.82.39 5060 UNKNOWN 89.231.82.45 89.231.82.45 5060 UNKNOWN 89.231.83.45 89.231.83.45 5060 UNKNOWN 89.231.84.195 89.231.84.195 5060 UNKNOWN 89.231.84.45 89.231.84.45 5060 UNKNOWN 89.231.86.45 89.231.86.45 5060 UNKNOWN coscdxbads/coscdxbrads 199.221.69.128 5060 UNKNOWN 24 sip peers [Monitored: 0 online, 24 offline Unmonitored: 0 online, 0 offline] maby it helps it is a very anoying bug. By: Leif Madsen (lmadsen) 2009-02-27 16:37:09.000-0600 Setting this status to new as we're stalled. By: frank koster (notthematrix) 2009-02-27 16:46:55.000-0600 Recently This bug did not acounter anymore. I changed 3 things. 1 set session timers to session-timers=refuse session-expires=110 session-minse=90 session-refresher=uas 2 set my grandstream ht-503 devices so they dont restart evryday when checking for new firmware. it wil be off air for 15 sec and then reconects. 3 I chached qualify to 5000 I noteced that when qualify response gets over 1999 it wil get (too lagged) Is it not that the default in qualify=yes valuw is 2000? maby it conflicts whith (too lagged) in some way? atleast one of theese things must be related to it. maby this info helps By: Miguel Molina (coolmig) 2009-06-24 09:07:13 A few questions about this: 1. Does this happen on SIP through UDP or only on TCP mode? 2. Setting qualify = no for the peers is a valid workaround? 3. What about the notthematrix workarounds, are they valid? (pabelanger tests could be useful) By: David Vossel (dvossel) 2009-08-20 13:34:48 can we get an update on this issue? By: Dan Radio (whys) 2009-09-02 15:19:10 Coolmig: 1. ??? 2. No. Qualify = no, only hides the (my) problem until a call is placed. 3. I tried notthematrix workarounds. They did nothing for me. I am not sure if this is the same problem. For what it's worth, I make a UDP sip channel to an IP that does not exist (doesn't answer) The call takes 33 seconds to timeout. If it's a TCP channel it takes 190 seconds to timeout. If it dial two extensions, where one answers and the other one (as above) can not (tcp), the call never bridges. This behavior is the same on branches 1.6.0, 1.6.1, 1.6.2 By: frank koster (notthematrix) 2009-09-02 15:54:24 What kind of voip hardware doe you use? By: Dan Radio (whys) 2009-09-03 10:44:46 Notthematrix: currently testing on a centos 5.3, 2.6.18-128.2.1.el5, on a VM on a hefty underutilized ESX server. (Very snappy!) - Also mainly testing Cisco VOIP phones 7960's By: David Vossel (dvossel) 2009-09-28 16:12:46 here is what is going on... The TCP connection setup is done while it holds the monitor thread. If the connect() function in ast_tcptls_client_start does not timeout, the monitor lock may be held indefinitely. There is a patch for this up on reviewboard, https://reviewboard.asterisk.org/r/380/. --------- This thread is stuck at the connect() function while holding the monitor lock ----- #0 0xb7f0f410 in __kernel_vsyscall () No symbol table info available. #1 0xb7e91e18 in connect () from /lib/tls/i686/cmov/libc.so.6 No symbol table info available. #2 0x08153933 in ast_tcptls_client_start (desc=0xb6bdb910) at tcptls.c:301 flags = 135607986 ser = (struct ast_tcptls_session_instance *) 0x0 ssu = {tid = 3065899920, schedid = -1} __PRETTY_FUNCTION__ = "ast_tcptls_client_start" __FUNCTION__ = "ast_tcptls_client_start" #3 0xb6c482df in sip_prepare_socket (p=0x8319f28) at chan_sip.c:18733 By: Dan Radio (whys) 2009-09-30 12:48:03 Kudos to dvossel! This fix now allows a trunk to connect to Microsoft Exchange Unified messaging, with it's stealth firewall on. (need to dial two extensions simultaneously, knowing that only one will answer. Previously this call blocked waiting for both extensions to respond, but now it seems to work flawlessly. ) I can't speak for all the issues above, but my tests work. By: David Vossel (dvossel) 2009-09-30 13:52:15 Awesome, I can see how that should work now, but it is still not perfect. Connections that never timeout still never go away, they are just moved into a separate thread. This is at least a step in the right direction though. I plan on following up with a separate patch to address this. By: Digium Subversion (svnbot) 2009-10-22 15:00:20 Repository: asterisk Revision: 225445 U trunk/apps/app_externalivr.c U trunk/channels/chan_sip.c U trunk/include/asterisk/tcptls.h U trunk/main/tcptls.c ------------------------------------------------------------------------ r225445 | dvossel | 2009-10-22 15:00:19 -0500 (Thu, 22 Oct 2009) | 50 lines SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes. What this patch fixes 1.Moves sip TCP/TLS connection setup into the TCP helper thread: Connection setup takes awhile and before this it was being done while holding the monitor lock. 2.Moves TCP/TLS writing to the TCP helper thread: Through the use of a packet queue and an alert pipe, the TCP helper thread can now be woken up to write data as well as read data. 3.Locking error: sip_xmit returned an XMIT_ERROR without giving up the tcptls_session lock. This lock has been completely removed from sip_xmit and placed in the new sip_tcptls_write() function. 4.Memory leak: When creating a tcptls_client the tls_cfg was alloced but never freed unless the tcptls_session failed to start. Now the session_args for a sip client are an ao2 object which frees the tls_cfg on destruction. 5.Pointer to stack variable: During sip_prepare_socket the creation of a client's ast_tcptls_session_args was done on the stack and stored as a pointer in the newly created tcptls_session. Depending on the events that followed, there was a slight possibility that pointer could have been accessed after the stack returned. Given the new changes, it is always accessed after the stack returns which is why I found it. Notable code changes 1.I broke tcptls.c's ast_tcptls_client_start() function into two functions. One for creating and allocating the new tcptls_session, and a separate one for starting and handling the new connection. This allowed me to create the tcptls_session, launch the helper thread, and then establish the connection within the helper thread. 2.Writes to a tcptls_session are now done within the helper thread. This is done by using an alert pipe to wake up the thread if new data needs to be sent. The thread's sip_threadinfo object contains the alert pipe as well as the packet queue. 3.Since the threadinfo object contains the alert pipe, it must now be accessed outside of the helper thread for every write (queuing of a packet). For easy lookup, I moved the threadinfo objects from a linked list to an ao2_container. (closes issue ASTERISK-12433) Reported by: pabelanger Tested by: dvossel, whys (closes issue ASTERISK-14765) Reported by: dvossel Tested by: dvossel Review: https://reviewboard.asterisk.org/r/380/ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=225445 By: Digium Subversion (svnbot) 2009-10-22 17:00:16 Repository: asterisk Revision: 225489 _U branches/1.6.2/ U branches/1.6.2/apps/app_externalivr.c U branches/1.6.2/channels/chan_sip.c U branches/1.6.2/include/asterisk/tcptls.h U branches/1.6.2/main/tcptls.c ------------------------------------------------------------------------ r225489 | dvossel | 2009-10-22 17:00:16 -0500 (Thu, 22 Oct 2009) | 56 lines Merged revisions 225445 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ........ r225445 | dvossel | 2009-10-22 14:55:51 -0500 (Thu, 22 Oct 2009) | 50 lines SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes. What this patch fixes 1.Moves sip TCP/TLS connection setup into the TCP helper thread: Connection setup takes awhile and before this it was being done while holding the monitor lock. 2.Moves TCP/TLS writing to the TCP helper thread: Through the use of a packet queue and an alert pipe, the TCP helper thread can now be woken up to write data as well as read data. 3.Locking error: sip_xmit returned an XMIT_ERROR without giving up the tcptls_session lock. This lock has been completely removed from sip_xmit and placed in the new sip_tcptls_write() function. 4.Memory leak: When creating a tcptls_client the tls_cfg was alloced but never freed unless the tcptls_session failed to start. Now the session_args for a sip client are an ao2 object which frees the tls_cfg on destruction. 5.Pointer to stack variable: During sip_prepare_socket the creation of a client's ast_tcptls_session_args was done on the stack and stored as a pointer in the newly created tcptls_session. Depending on the events that followed, there was a slight possibility that pointer could have been accessed after the stack returned. Given the new changes, it is always accessed after the stack returns which is why I found it. Notable code changes 1.I broke tcptls.c's ast_tcptls_client_start() function into two functions. One for creating and allocating the new tcptls_session, and a separate one for starting and handling the new connection. This allowed me to create the tcptls_session, launch the helper thread, and then establish the connection within the helper thread. 2.Writes to a tcptls_session are now done within the helper thread. This is done by using an alert pipe to wake up the thread if new data needs to be sent. The thread's sip_threadinfo object contains the alert pipe as well as the packet queue. 3.Since the threadinfo object contains the alert pipe, it must now be accessed outside of the helper thread for every write (queuing of a packet). For easy lookup, I moved the threadinfo objects from a linked list to an ao2_container. (closes issue ASTERISK-12433) Reported by: pabelanger Tested by: dvossel, whys (closes issue ASTERISK-14765) Reported by: dvossel Tested by: dvossel Review: https://reviewboard.asterisk.org/r/380/ ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=225489 By: Digium Subversion (svnbot) 2009-10-22 17:11:33 Repository: asterisk Revision: 225490 _U branches/1.6.1/ U branches/1.6.1/apps/app_externalivr.c U branches/1.6.1/channels/chan_sip.c U branches/1.6.1/include/asterisk/tcptls.h U branches/1.6.1/main/tcptls.c ------------------------------------------------------------------------ r225490 | dvossel | 2009-10-22 17:11:33 -0500 (Thu, 22 Oct 2009) | 56 lines Merged revisions 225445 via svnmerge from https://origsvn.digium.com/svn/asterisk/trunk ........ r225445 | dvossel | 2009-10-22 14:55:51 -0500 (Thu, 22 Oct 2009) | 50 lines SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes. What this patch fixes 1.Moves sip TCP/TLS connection setup into the TCP helper thread: Connection setup takes awhile and before this it was being done while holding the monitor lock. 2.Moves TCP/TLS writing to the TCP helper thread: Through the use of a packet queue and an alert pipe, the TCP helper thread can now be woken up to write data as well as read data. 3.Locking error: sip_xmit returned an XMIT_ERROR without giving up the tcptls_session lock. This lock has been completely removed from sip_xmit and placed in the new sip_tcptls_write() function. 4.Memory leak: When creating a tcptls_client the tls_cfg was alloced but never freed unless the tcptls_session failed to start. Now the session_args for a sip client are an ao2 object which frees the tls_cfg on destruction. 5.Pointer to stack variable: During sip_prepare_socket the creation of a client's ast_tcptls_session_args was done on the stack and stored as a pointer in the newly created tcptls_session. Depending on the events that followed, there was a slight possibility that pointer could have been accessed after the stack returned. Given the new changes, it is always accessed after the stack returns which is why I found it. Notable code changes 1.I broke tcptls.c's ast_tcptls_client_start() function into two functions. One for creating and allocating the new tcptls_session, and a separate one for starting and handling the new connection. This allowed me to create the tcptls_session, launch the helper thread, and then establish the connection within the helper thread. 2.Writes to a tcptls_session are now done within the helper thread. This is done by using an alert pipe to wake up the thread if new data needs to be sent. The thread's sip_threadinfo object contains the alert pipe as well as the packet queue. 3.Since the threadinfo object contains the alert pipe, it must now be accessed outside of the helper thread for every write (queuing of a packet). For easy lookup, I moved the threadinfo objects from a linked list to an ao2_container. (closes issue ASTERISK-12433) Reported by: pabelanger Tested by: dvossel, whys (closes issue ASTERISK-14765) Reported by: dvossel Tested by: dvossel Review: https://reviewboard.asterisk.org/r/380/ ........ ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=225490 |