[Home]

Summary:ASTERISK-12433: [patch] sip peer qualified failed, asterisk lock.
Reporter:Paul Belanger (pabelanger)Labels:
Date Opened:2008-07-23 02:32:56Date Closed:2009-10-22 17:11:36
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20081110__bug13136.diff.txt
( 1) 20081111__bug13136__2.diff.txt
( 2) 20081111__bug13136.diff.txt
( 3) 20081112__bug13136__2.diff.txt
( 4) 20081112__bug13136.diff.txt
( 5) 20081113__bug13136.diff.txt
( 6) bt.txt
( 7) bt_full.txt
( 8) gdb_trunk_176138.txt
( 9) gdb.2.txt
(10) gdb.3.txt
(11) gdb.txt
(12) lock_trunk_176138.txt
(13) locks.txt
(14) mylog
Description:We just had asterisk lock on us tonight.  Best we can guess is because we lost our SIP PEER (via qualify).

See output from 'show core locks'.

****** ADDITIONAL INFORMATION ******

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: -1214731376 (do_monitor           started at [18523] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 18493 do_monitor &monlock 0xb7a49ea0 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: -1222354032 (ss_thread            started at [ 7544] chan_zap.c handle_init_event())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 18512 restart_monitor &monlock 0xb7a49ea0 (1)
=== --- ---> Locked Here: chan_sip.c line 18493 (do_monitor)
=== -------------------------------------------------------------------
===
=== Thread ID: -1223337072 (ss_thread            started at [ 7544] chan_zap.c handle_init_event())
=== ---> Waiting for Lock #0 (chan_sip.c): MUTEX 18512 restart_monitor &monlock 0xb7a49ea0 (1)
=== --- ---> Locked Here: chan_sip.c line 18493 (do_monitor)
=== -------------------------------------------------------------------
===
=======================================================================
Comments:By: Mark Michelson (mmichelson) 2008-07-23 09:36:48

Wow, there's something majorly wrong here. chan_sip and chan_zap both have functions called "restart_monitor" and for some reason chan_zap is calling the one defined in chan_sip.

That, however, does not explain why the lock which was legitimately acquired in chan_sip has not been unlocked yet. It would be helpful to get a backtrace of that thread to see where it's stuck. If you can provide the output of "thread apply all bt full" from gdb, it would be helpful.

Thanks!

By: Mark Michelson (mmichelson) 2008-07-23 09:46:46

"Wow, there's something majorly wrong here. chan_sip and chan_zap both have functions called "restart_monitor" and for some reason chan_zap is calling the one defined in chan_sip."

Actually, on second thought, I may be jumping to conclusions with regards to this. The backtrace will provide the necessary answers.

By: Paul Belanger (pabelanger) 2008-07-23 12:11:41

putnopvut: Thanks for the information, however I'm not sure how to get a backtrace because asterisk did not core.  ATM the system is back up and processing calls, but I think I can reproduces this issue with another system.  Once it happens again I'll try and jump on IRC and get assistance.

By: Mark Michelson (mmichelson) 2008-07-23 12:19:40

In case I'm not on, then you can attach to Asterisk with gdb using

# gdb /usr/sbin/asterisk `pidof asterisk`

From here, just issue the "thread apply all bt full" command and you will get a backtrace of all the currently running threads.

By: Paul Belanger (pabelanger) 2008-07-25 10:27:09

putnopvut: We managed to reproduce the lock in our lab this morning.  See attached files.

By: Paul Belanger (pabelanger) 2008-07-25 10:28:57

I should note we used 1.6.0 branch (svn 133524).

By: Mark Michelson (mmichelson) 2008-07-28 16:46:00

I took a look at the backtrace, and it appears as though you've pretty much hit the nail on the head with your analysis. If you check thread 2 in the backtrace, it's clear that Asterisk is blocking on a call to connect(2).

Asterisk has sent a TCP SYN to the far end and is blocking until it receives a SYN/ACK in return. Unfortunately, if the peer is unreachable, Asterisk will wait forever. Unfortunately, this also means that *no* SIP processing is possible since the monlock is currently held by this thread.

I haven't looked too closely at the code in question, but I have a suspicion of what is happening. I think that Asterisk detects that the peer in question is unreachable, and so the TCP connection is torn down. Then, when Asterisk attempts to re-connect so it can send another OPTIONS request (to see if the peer is now reachable) it can never connect in the first place.

I will look further in-depth to see if my analysis is correct.



By: Paul Belanger (pabelanger) 2008-07-28 21:41:50

Hi putnopvut, I would agree with what you said.  We found if you create a sip peer to a host:port that is firewall to drop TCP packets, you will reproduce the issue everytime.

By: Paul Belanger (pabelanger) 2008-09-18 15:52:42

Anything I can do to re-jump start this issue?

By: Tilghman Lesher (tilghman) 2008-11-10 19:21:12.000-0600

Let's try something like this, which should only allow the connect to hang for up to 2 seconds, before it will be cancelled.

By: Matt Riddell (zx81) 2008-11-10 21:47:52.000-0600

What do you reckon about making the 2000 a #define PORT_TIMEOUT 2000 or whatever?

By: Tilghman Lesher (tilghman) 2008-11-10 22:27:08.000-0600

ZX81: I want to know whether it works FIRST, and only THEN we can discuss those things.

By: Matt Riddell (zx81) 2008-11-11 01:59:35.000-0600

:) Fair enough

By: Paul Belanger (pabelanger) 2008-11-11 12:02:21.000-0600

Corydon76: We decided to try and reproduce this issue with the latest trunk (155928) [without your patch] and there does not seem to be a problem.

The issue still lies in the 1.6.0 (155815) branch.  We are unable to cleanly apply your patch against it.

Suggestions?

By: Paul Belanger (pabelanger) 2008-11-11 12:19:40.000-0600

It seems 1.6.1 branch has also had the problem fixed, cannot reproduce.

By: Tilghman Lesher (tilghman) 2008-11-11 13:21:25.000-0600

Updated to apply against 1.6.0.  One question:  do you get this same issue in 1.4?

D'oh, nevermind.  TCP support was not in 1.4, so you couldn't have gotten this issue in 1.4.  I suspect that the reason it was not fixed in 1.6.0 is that it wasn't a regression over 1.4, it being a new, experimental feature and all.  Still, the SIP driver didn't lock up in 1.4, so I think this behavior may still be a regression.



By: Paul Belanger (pabelanger) 2008-11-11 13:40:02.000-0600

Corydon76: thanks for the fast backport.  We'll get back shortly with the results.

Have not tested 1.4, since we are using SIP over TCP.



By: Paul Belanger (pabelanger) 2008-11-11 16:06:08.000-0600

Corydon76: No luck with attached patch (20081111__bug13136.diff.txt). Asterisk core's on startup.  See attached bt.txt and bt full.txt

By: Tilghman Lesher (tilghman) 2008-11-11 16:34:17.000-0600

Okay, silly mistake.  New patch uploaded.

By: Paul Belanger (pabelanger) 2008-11-12 08:03:56.000-0600

Morning Corydon76: After apply your patch we still see the lock (attached mylog for gdb output).

---
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: -1228268656 (do_monitor           started at [19062] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0xb6d32c20 (1)
=== -------------------------------------------------------------------
===
=======================================================================

By: Tilghman Lesher (tilghman) 2008-11-12 11:20:32.000-0600

pabelanger: so even with this patch, Asterisk still locks up?

By: Tilghman Lesher (tilghman) 2008-11-12 12:23:37.000-0600

Reading the gdb log, it appears that it is locked up.  So, new patch uploaded.

By: Paul Belanger (pabelanger) 2008-11-12 12:33:01.000-0600

Sorry Corydon76, same problem.  The lock is still there.

---
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: -1228665968 (do_monitor           started at [19062] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0xb6cd1c20 (1)
=== -------------------------------------------------------------------
===
=======================================================================

By: Paul Belanger (pabelanger) 2008-11-12 12:46:33.000-0600

I should note it seems to lock will clear after 3mins and 20seconds.

By: Tilghman Lesher (tilghman) 2008-11-12 14:13:14.000-0600

Switching to a different signal, on the theory that connect(2) handles SIGURG, because SIGURG is normally supposed to be used with sockets.  New patch uploaded.

By: Paul Belanger (pabelanger) 2008-11-12 15:23:35.000-0600

lock is still there, uploaded gdb.3.txt for review.

By: Matt Riddell (zx81) 2008-11-12 15:40:04.000-0600

Is there not a way to catch all signals and then print out which one you got?

Like make an array of signal numbers, loop through, use the same handler and then print out the signal:

void sighand(int signo)
{
       printf("Close Server Connections called to close: %d\n", signo);
       return;
}

Just to debug.

By: Tilghman Lesher (tilghman) 2008-11-12 16:12:40.000-0600

ZX81:  the issue is NOT whether we can detect a signal.  The issue is whether a signal will INTERRUPT the connect(2) system call.  That has been the entire problem in this issue.

By: Tilghman Lesher (tilghman) 2008-11-12 16:15:28.000-0600

And, I might add, the thread calling connect(2) is clearly receiving the signal, and also, the connect(2) is not returning, which is a violation of the specification.  Where we go from here is troubling.

By: Matt Riddell (zx81) 2008-11-12 16:28:20.000-0600

Ah, yeah we had the same problem in SmoothTorque, but it was solved by writing periodically to the socket which caused it to throw the interrupt.

Obviously that's not useful in this situation though :)

By: Tilghman Lesher (tilghman) 2008-11-13 00:15:57.000-0600

Okay, whole new approach.  Instead of waiting for connect(2) to return, we set the socket nonblocking, then wait for the socket to become writable, for a maximum of 2 seconds.  We should then be able to fail the socket, immediately.

Patch uploaded, ready for testing.

By: Leif Madsen (lmadsen) 2008-11-18 15:58:29.000-0600

Was able to reproduce this issue (what a pain that was!) but it seems even with the patch I'm still getting locks:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 1079638336 (do_monitor           started at [19062] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 19032 do_monitor &monlock 0x2aaac1bb0900 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 2593 __sip_xmit &p->socket.ser->lock 0x4da89e0 (1)
=== -------------------------------------------------------------------
===
=======================================================================

By: Tilghman Lesher (tilghman) 2008-11-18 18:40:07.000-0600

Well, unfortunately, I'm out of ideas.  Anybody else want to take a stab at this?

By: Mark Michelson (mmichelson) 2008-11-18 19:22:38.000-0600

Back in the early days of this bug, when I was looking at it, my idea was to do exactly what you did in your latest patch, Corydon76. That is, set the socket to non-blocking prior to calling connect and then poll/select until the socket is ready for writing.

What I found when initially writing this is that there are a bunch of places that you have to check for errors. Every call to fcntl, the call to connect, the call to getsockopt, the call to select/poll, etc. It may be that some of these functions are erroring out for some reason. Adding error checking/logging may point to some portion that is failing to execute properly.

Also, I'm not sure about the legality/correctness of this, but my original intention was also to clear the O_NONBLOCK flag on the socket after I had determined that the connection was successful. I don't think this would directly affect the problem encountered, but hey, just throwing something else out there.

Another possibility is that the non-blocking connect patch is properly solving the original problem but blitzrage's testing may have uncovered something else wrong. The fact that there are locks held, but there is no contention between threads makes me think that the monitor thread in chan_sip is blocking somewhere, somehow. I would assume that the call to write in ast_tcptls_server_write is now blocking for some reason.

Something else to consider is that pabelanger said that the issue is not present in trunk or 1.6.1. It may be worthwhile to see what fixes were made there with regards to TCP support in chan_sip and see if we can identify which revision fixed this issue.

By: Tilghman Lesher (tilghman) 2008-11-19 00:19:41.000-0600

I've evaluated most of those commands and determined that most of the errors were not relevant or possible, with the given arguments, but it's possible that I might have missed some.

In terms of O_NONBLOCK, the existing code already clears that flag in code following the connect, so it was unnecessary to do that again.  In fact, I'm not really sure why it was there in the first place, as the socket would otherwise block, by default.

It's possible that the write() blocks, but I don't think that's likely, given that poll() indicated that a write() should not block.  As to what exactly is blocking, that can be determined by connecting to Asterisk with gdb and running a 'thread apply all bt'.  I suspect it's still blocking in the connect() for some reason that I can't divine, but testing is needed to confirm that.

By: Paul Belanger (pabelanger) 2008-11-19 00:45:28.000-0600

Sorry for the lack of testing on my part, just got Internet access back.  I'm currently in China for business, and will be back next week.  I'll be able to resume testing.

By: Mark Michelson (mmichelson) 2008-11-19 09:53:32.000-0600

I based my assumption that the write() was blocking because of the locks which are currently held in blitzrage's latest core show locks output. I'm not sure exactly what blitzrage's test setup is, but if there was something which was previously causing the connect() to block, then I would be willing to bet that in his test setup, the same obstruction is causing the write() to also block.

However, if his setup was causing the connect() to block, I would expect that the error returned would have caused the subsequent write() to not occur. Are we certain that the connect() succeeded?

blitzrage, just to clarify, could you describe how you set up your tests for this? Also, did you ever see a console error message like the following:
"Unable to connect <something> to <something else>:<some reason>" ?

By: Leif Madsen (lmadsen) 2008-11-19 11:35:51.000-0600

My test setup includes 3 devices: a Polycom (that has TCP support), Zoiper (which I don't think has TCP support, but seems to respond to the NOTIFY packets regardless...) and X-Lite, which explicitly seems to reject the packets.

I believe I was seeing the error you mentioned when trying X-Lite

(...some time passes as he gets the exact error)

*CLI> [Nov 19 08:29:34] ERROR[26147]: tcptls.c:252 ast_tcptls_client_start: Unable to connect SIP socket to 192.168.128.129:59668: Connection refused

And some more if it is useful...


[Nov 19 08:29:35] ERROR[26147]: chan_sip.c:10543 register_verify: 'UDP' is not a valid transport for '102'. we only use 'TCP'! ending call.
[Nov 19 08:29:35] NOTICE[26147]: chan_sip.c:18293 handle_request_register: Registration from '"X-Lite"<sip:102@192.168.128.50>' failed for '192.168.128.129' - Device not configured to use this transport type


The way I reproduce is a bit tricky...

Basically I block the incoming SIP signalling port (in my case it is 5430, but most devices would be 5060) in the RH-Firewall-1-INPUT chain. Then after I see the UNREACHABLE message, I then block the connection in the OUTPUT chain of iptables. After a little bit of time (60-120 seconds roughly), then I get the lock showing in 'core show locks'.


That seems to be the only way I was able to reliably reproduce it. Sometimes is hit or miss, but that seems to be the most reliable way that I could find.

By: Leif Madsen (lmadsen) 2009-01-20 13:35:16.000-0600

Just thought I'd ping this issue again to see if there is anything we can do to move this issue forward?

By: Paul Belanger (pabelanger) 2009-01-20 13:57:53.000-0600

I can re-test with the latest 1.6.0 / 1.6.1 branches and trunk to see if there is still an issue.

By: Leif Madsen (lmadsen) 2009-01-20 14:30:48.000-0600

Great, please do and let us know if any of them resolve your issue. Thanks!

By: Leif Madsen (lmadsen) 2009-02-02 13:36:00.000-0600

Still an issue?

By: frank koster (notthematrix) 2009-02-09 17:16:03.000-0600

Dont know if it is the same issue but I somethimes get this problem when I get a lockup asterisk completly locks up all channels when I doe a sip show channels
all channals are filled and it can not make new callas anymore
We are using a ht-502 and ht-503 from grandstream and using it in TLS mode with the SRTP branch.
I could reproduce this bug when I disconect power from my ht-503 and bring it to my
niebours reconecting it over HIS connection (difrent provider and public IP) connection both via NAT with nat enabled.
It also seems to happen when a provider changes ip on the fly.
I suspect this tyo happen with an austrailian friend who is connected via bigpond adsl while my asterisk is in europe.
It looks that if I disable session timers
by doing
session-timers=refuse
session-expires=110
session-minse=90
session-refresher=uas
The problem seems to be gone but I only tryed it ones with a ht-502 again doing the running to nighbour trick.
I cant use the patch provided because of SRTP (srtp branch)
But If it is the same bug it looks like tls does not disconect porperly

sip show peers

Power disconnect.
123456789/123456789    (Unspecified)    D   N      0        UNKNOWN

while this is a normal peer in rest when not yet loged in after restart

Normal situation
987654321              (Unspecified)    D   N      5061     UNKNOWN

strange thing is that it defaults to port 5061 while I setted an other default port 443

If this is the same bug I hope this can help since it is a very anoing bug and locks my asterisk too manny times.



By: Paul Belanger (pabelanger) 2009-02-16 12:22:17.000-0600

Sorry for the delay on this... I've been traveling over the last few weeks.

I retested using 1.6.0.5 and trunk, the problem is still there.  I've uploaded the lock and bt against trunk (gdb_trunk_176138.txt and lock_trunk_176138.txt).

I'll hold off on uploading anything with 1.6.0 branch until trunk is fixed, or unless somebody asks for it.

PB

By: Leif Madsen (lmadsen) 2009-02-19 09:28:28.000-0600

Assigned this to Corydon76 to see if the bt or lock files show anything useful. Please reassign should you not be able to move the issue forward. Thanks!

By: Tilghman Lesher (tilghman) 2009-02-19 11:40:03.000-0600

blitzrage:  No, the lock isn't showing anything helpful, and the bt isn't, either.

By: Leif Madsen (lmadsen) 2009-02-19 12:27:13.000-0600

Corydon76: so do you have any idea where to go with this issue?

By: frank koster (notthematrix) 2009-02-19 16:00:29.000-0600

Maby it is a deeper problem and it has to do with the kernel tcp stack and synflood protection.
when a qualify fails it migt retry to connect and the kernel might see this as somekind of synflood attack?
so the kernel might be blocking the asterisk from flooding?
and asterisk locks up whit waiting connections?


sip show channels
Peer             User/ANR    Call ID          Format           Hold     Last Message    Expiry

82.95.212.35     (None)      0bc40c8b3267d88  0x0 (nothing)    No       Init: OPTIONS
89.231.82.39     (None)      3c3b93d04c9274d  0x0 (nothing)    No       Init: OPTIONS
194.221.62.198   (None)      7f439a7143a3110  0x0 (nothing)    No       Init: OPTIONS
89.231.82.45     (None)      40f7adc27820bd4  0x0 (nothing)    No       Init: OPTIONS
195.243.173.117  (None)                       0x0 (nothing)    No
89.231.84.195    (None)      47026f666d03812  0x0 (nothing)    No       Init: OPTIONS
94.208.243.114   (None)      5bf9ba752dc6efe  0x0 (nothing)    No       Init: OPTIONS
89.231.84.45     (None)      251517d403f5617  0x0 (nothing)    No       Init: OPTIONS
94.208.243.114   (None)      702438ee49a701a  0x0 (nothing)    No       Init: OPTIONS
89.231.86.45     (None)      41a9ff0944d5769  0x0 (nothing)    No       Init: OPTIONS
194.109.4.200    (None)      44b0915742ab4d8  0x0 (nothing)    No       Init: OPTIONS
89.231.83.45     (None)      4c13d69819c4225  0x0 (nothing)    No       Init: OPTIONS
191.243.177.117  (None)      4c06c96e2eab6b8  0x0 (nothing)    No       Init: OPTIONS
13 active SIP dialogs

sip show peers

Name/username              Host            Dyn Nat ACL Port     Status
31108920309                (Unspecified)    D   N      5061     UNKNOWN
37418080095/37418080095    94.208.243.114   D   N      5060     UNKNOWN
37418080642                (Unspecified)    D   N      5061     UNKNOWN
37418080966                (Unspecified)    D   N      5061     UNKNOWN
37418080967                (Unspecified)    D   N      5061     UNKNOWN
41852788101                (Unspecified)    D   N      5061     UNKNOWN
41852788102/41852788102    194.109.4.200    D   N      5894     UNKNOWN
41852788103                (Unspecified)    D   N      5061     UNKNOWN
41852788104                (Unspecified)    D   N      5061     UNKNOWN
41852788105/41852788105    82.95.212.35     D   N      62421    UNKNOWN
41852788111                (Unspecified)    D   N      5061     UNKNOWN
41852788115                (Unspecified)    D   N      5060     UNKNOWN
41852788116                (Unspecified)    D   N      5061     UNKNOWN
39710615403/39710615403    94.208.243.114   D   N      5060     UNKNOWN
43961124284                (Unspecified)    D   N      5060     UNKNOWN
560256                     195.243.173.117             5060     UNKNOWN
61889214119                (Unspecified)    D   N      5060     UNKNOWN
89.231.82.39               89.231.82.39                5060     UNKNOWN
89.231.82.45               89.231.82.45                5060     UNKNOWN
89.231.83.45               89.231.83.45                5060     UNKNOWN
89.231.84.195              89.231.84.195               5060     UNKNOWN
89.231.84.45               89.231.84.45                5060     UNKNOWN
89.231.86.45               89.231.86.45                5060     UNKNOWN
coscdxbads/coscdxbrads     199.221.69.128              5060     UNKNOWN
24 sip peers [Monitored: 0 online, 24 offline Unmonitored: 0 online, 0 offline]

maby it helps it is a very anoying bug.



By: Leif Madsen (lmadsen) 2009-02-27 16:37:09.000-0600

Setting this status to new as we're stalled.

By: frank koster (notthematrix) 2009-02-27 16:46:55.000-0600

Recently This bug did not acounter anymore.
I changed 3 things.

1 set session timers to

session-timers=refuse
session-expires=110
session-minse=90
session-refresher=uas

2 set my grandstream ht-503 devices so they dont restart evryday when checking for new firmware.
it wil be off air for 15 sec and then reconects.

3 I chached qualify to 5000

I noteced that when qualify response gets over 1999 it wil get (too lagged)
Is it not that the default in qualify=yes valuw is 2000?
maby it conflicts whith (too lagged) in some way?

atleast one of theese things must be related to it.
maby this info helps



By: Miguel Molina (coolmig) 2009-06-24 09:07:13

A few questions about this:

1. Does this happen on SIP through UDP or only on TCP mode?
2. Setting qualify = no for the peers is a valid workaround?
3. What about the notthematrix workarounds, are they valid? (pabelanger tests could be useful)

By: David Vossel (dvossel) 2009-08-20 13:34:48

can we get an update on this issue?

By: Dan Radio (whys) 2009-09-02 15:19:10

Coolmig:
1. ???
2. No. Qualify = no, only hides the (my) problem until a call is placed.
3. I tried notthematrix workarounds. They did nothing for me.

I am not sure if this is the same problem. For what it's worth, I make a UDP sip channel to an IP that does not exist (doesn't answer)  The call takes 33 seconds to timeout. If it's a TCP channel it takes 190 seconds to timeout.  If it dial two extensions, where one answers and the other one (as above) can not (tcp), the call never bridges.

This behavior is the same on branches 1.6.0, 1.6.1, 1.6.2

By: frank koster (notthematrix) 2009-09-02 15:54:24

What kind of voip hardware doe you use?

By: Dan Radio (whys) 2009-09-03 10:44:46

Notthematrix: currently testing on a centos 5.3,  2.6.18-128.2.1.el5, on a VM on a hefty underutilized ESX server. (Very snappy!) - Also mainly testing Cisco VOIP phones 7960's



By: David Vossel (dvossel) 2009-09-28 16:12:46

here is what is going on... The TCP connection setup is done while it holds the monitor thread.  If the connect() function in ast_tcptls_client_start does not timeout, the monitor lock may be held indefinitely.  There is a patch for this up on reviewboard, https://reviewboard.asterisk.org/r/380/.


--------- This thread is stuck at the connect() function while holding the monitor lock -----
#0  0xb7f0f410 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7e91e18 in connect () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
#2  0x08153933 in ast_tcptls_client_start (desc=0xb6bdb910) at tcptls.c:301
flags = 135607986
ser = (struct ast_tcptls_session_instance *) 0x0
ssu = {tid = 3065899920, schedid = -1}
__PRETTY_FUNCTION__ = "ast_tcptls_client_start"
__FUNCTION__ = "ast_tcptls_client_start"
#3  0xb6c482df in sip_prepare_socket (p=0x8319f28) at chan_sip.c:18733

By: Dan Radio (whys) 2009-09-30 12:48:03

Kudos to dvossel!  This fix now allows a trunk to connect to Microsoft Exchange Unified messaging, with it's stealth firewall on. (need to dial two extensions simultaneously, knowing that only one will answer. Previously this call blocked waiting for both extensions to respond, but now it seems to work flawlessly. )

I can't speak for all the issues above, but my tests work.

By: David Vossel (dvossel) 2009-09-30 13:52:15

Awesome, I can see how that should work now, but it is still not perfect. Connections that never timeout still never go away, they are just moved into a separate thread.  This is at least a step in the right direction though.  I plan on following up with a separate patch to address this.

By: Digium Subversion (svnbot) 2009-10-22 15:00:20

Repository: asterisk
Revision: 225445

U   trunk/apps/app_externalivr.c
U   trunk/channels/chan_sip.c
U   trunk/include/asterisk/tcptls.h
U   trunk/main/tcptls.c

------------------------------------------------------------------------
r225445 | dvossel | 2009-10-22 15:00:19 -0500 (Thu, 22 Oct 2009) | 50 lines

SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes.

       What this patch fixes
1.Moves sip TCP/TLS connection setup into the TCP helper thread:
 Connection setup takes awhile and before this it was being
 done while holding the monitor lock.
2.Moves TCP/TLS writing to the TCP helper thread:  Through the
 use of a packet queue and an alert pipe, the TCP helper thread
 can now be woken up to write data as well as read data.
3.Locking error: sip_xmit returned an XMIT_ERROR without giving
 up the tcptls_session lock.  This lock has been completely removed
 from sip_xmit and placed in the new sip_tcptls_write() function.
4.Memory leak:  When creating a tcptls_client the tls_cfg was alloced
 but never freed unless the tcptls_session failed to start.  Now the
 session_args for a sip client are an ao2 object which frees the
 tls_cfg on destruction.
5.Pointer to stack variable: During sip_prepare_socket the creation
 of a client's ast_tcptls_session_args was done on the stack and
 stored as a pointer in the newly created tcptls_session.  Depending
 on the events that followed, there was a slight possibility that
 pointer could have been accessed after the stack returned.  Given
 the new changes, it is always accessed after the stack returns
 which is why I found it.

Notable code changes
1.I broke tcptls.c's ast_tcptls_client_start() function into two
 functions.  One for creating and allocating the new tcptls_session,
 and a separate one for starting and handling the new connection.
 This allowed me to create the tcptls_session, launch the helper
 thread, and then establish the connection within the helper thread.
2.Writes to a tcptls_session are now done within the helper thread.
 This is done by using an alert pipe to wake up the thread if new
 data needs to be sent.  The thread's sip_threadinfo object contains
 the alert pipe as well as the packet queue.
3.Since the threadinfo object contains the alert pipe, it must now be
 accessed outside of the helper thread for every write (queuing of a
 packet).  For easy lookup, I moved the threadinfo objects from a
 linked list to an ao2_container.

(closes issue ASTERISK-12433)
Reported by: pabelanger
Tested by: dvossel, whys

(closes issue ASTERISK-14765)
Reported by: dvossel
Tested by: dvossel

Review: https://reviewboard.asterisk.org/r/380/


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=225445

By: Digium Subversion (svnbot) 2009-10-22 17:00:16

Repository: asterisk
Revision: 225489

_U  branches/1.6.2/
U   branches/1.6.2/apps/app_externalivr.c
U   branches/1.6.2/channels/chan_sip.c
U   branches/1.6.2/include/asterisk/tcptls.h
U   branches/1.6.2/main/tcptls.c

------------------------------------------------------------------------
r225489 | dvossel | 2009-10-22 17:00:16 -0500 (Thu, 22 Oct 2009) | 56 lines

Merged revisions 225445 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r225445 | dvossel | 2009-10-22 14:55:51 -0500 (Thu, 22 Oct 2009) | 50 lines
 
 SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes.
 
         What this patch fixes
 1.Moves sip TCP/TLS connection setup into the TCP helper thread:
   Connection setup takes awhile and before this it was being
   done while holding the monitor lock.
 2.Moves TCP/TLS writing to the TCP helper thread:  Through the
   use of a packet queue and an alert pipe, the TCP helper thread
   can now be woken up to write data as well as read data.
 3.Locking error: sip_xmit returned an XMIT_ERROR without giving
   up the tcptls_session lock.  This lock has been completely removed
   from sip_xmit and placed in the new sip_tcptls_write() function.
 4.Memory leak:  When creating a tcptls_client the tls_cfg was alloced
   but never freed unless the tcptls_session failed to start.  Now the
   session_args for a sip client are an ao2 object which frees the
   tls_cfg on destruction.
 5.Pointer to stack variable: During sip_prepare_socket the creation
   of a client's ast_tcptls_session_args was done on the stack and
   stored as a pointer in the newly created tcptls_session.  Depending
   on the events that followed, there was a slight possibility that
   pointer could have been accessed after the stack returned.  Given
   the new changes, it is always accessed after the stack returns
   which is why I found it.
 
 Notable code changes
 1.I broke tcptls.c's ast_tcptls_client_start() function into two
   functions.  One for creating and allocating the new tcptls_session,
   and a separate one for starting and handling the new connection.
   This allowed me to create the tcptls_session, launch the helper
   thread, and then establish the connection within the helper thread.
 2.Writes to a tcptls_session are now done within the helper thread.
   This is done by using an alert pipe to wake up the thread if new
   data needs to be sent.  The thread's sip_threadinfo object contains
   the alert pipe as well as the packet queue.
 3.Since the threadinfo object contains the alert pipe, it must now be
   accessed outside of the helper thread for every write (queuing of a
   packet).  For easy lookup, I moved the threadinfo objects from a
   linked list to an ao2_container.
 
 (closes issue ASTERISK-12433)
 Reported by: pabelanger
 Tested by: dvossel, whys
 
 (closes issue ASTERISK-14765)
 Reported by: dvossel
 Tested by: dvossel
 
 Review: https://reviewboard.asterisk.org/r/380/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=225489

By: Digium Subversion (svnbot) 2009-10-22 17:11:33

Repository: asterisk
Revision: 225490

_U  branches/1.6.1/
U   branches/1.6.1/apps/app_externalivr.c
U   branches/1.6.1/channels/chan_sip.c
U   branches/1.6.1/include/asterisk/tcptls.h
U   branches/1.6.1/main/tcptls.c

------------------------------------------------------------------------
r225490 | dvossel | 2009-10-22 17:11:33 -0500 (Thu, 22 Oct 2009) | 56 lines

Merged revisions 225445 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r225445 | dvossel | 2009-10-22 14:55:51 -0500 (Thu, 22 Oct 2009) | 50 lines
 
 SIP TCP/TLS: move client connection setup/write into tcp helper thread, various related locking/memory fixes.
 
         What this patch fixes
 1.Moves sip TCP/TLS connection setup into the TCP helper thread:
   Connection setup takes awhile and before this it was being
   done while holding the monitor lock.
 2.Moves TCP/TLS writing to the TCP helper thread:  Through the
   use of a packet queue and an alert pipe, the TCP helper thread
   can now be woken up to write data as well as read data.
 3.Locking error: sip_xmit returned an XMIT_ERROR without giving
   up the tcptls_session lock.  This lock has been completely removed
   from sip_xmit and placed in the new sip_tcptls_write() function.
 4.Memory leak:  When creating a tcptls_client the tls_cfg was alloced
   but never freed unless the tcptls_session failed to start.  Now the
   session_args for a sip client are an ao2 object which frees the
   tls_cfg on destruction.
 5.Pointer to stack variable: During sip_prepare_socket the creation
   of a client's ast_tcptls_session_args was done on the stack and
   stored as a pointer in the newly created tcptls_session.  Depending
   on the events that followed, there was a slight possibility that
   pointer could have been accessed after the stack returned.  Given
   the new changes, it is always accessed after the stack returns
   which is why I found it.
 
 Notable code changes
 1.I broke tcptls.c's ast_tcptls_client_start() function into two
   functions.  One for creating and allocating the new tcptls_session,
   and a separate one for starting and handling the new connection.
   This allowed me to create the tcptls_session, launch the helper
   thread, and then establish the connection within the helper thread.
 2.Writes to a tcptls_session are now done within the helper thread.
   This is done by using an alert pipe to wake up the thread if new
   data needs to be sent.  The thread's sip_threadinfo object contains
   the alert pipe as well as the packet queue.
 3.Since the threadinfo object contains the alert pipe, it must now be
   accessed outside of the helper thread for every write (queuing of a
   packet).  For easy lookup, I moved the threadinfo objects from a
   linked list to an ao2_container.
 
 (closes issue ASTERISK-12433)
 Reported by: pabelanger
 Tested by: dvossel, whys
 
 (closes issue ASTERISK-14765)
 Reported by: dvossel
 Tested by: dvossel
 
 Review: https://reviewboard.asterisk.org/r/380/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=225490