|Summary:||ASTERISK-05757: [patch] USing qualify=yes - connection eventually reports UNREACHABLE and never recovers|
|Reporter:||Paul Hewlett (paulhewlett)||Labels:|
|Date Opened:||2005-12-02 05:11:57.000-0600||Date Closed:||2011-06-07 14:02:47|
|Environment:||Attachments:||( 0) 20051202__bug5912.diff.txt|
|Description:||[This is my first bug report] I am currently using asterisk at 5 sites each connected to the Internet via ADSL. The object is to assess the feasibility of iax to connect PBX's in south africa. I am using qualify=yes (and not authenticated registration) so that the Asterisk's POKE each other every minute. The problem is that after some time the 'iax2 show peers' command shows that remote asterisk site is UNREACHABLE and never recovers. Eventually all remote sites become UNREACHABLE. I instrumented chan_iax2.c extensively adding a new debugging command so that I could switch my diagnostics on or off and this revealed nothing. The only bug I found is that failure of the POKE command is erroneously handled in 2 places - iax2_poke_noanswer() and attempt_transmit(). Fixing this had no effect - the sendto() call reported that the message was sent but it never got to the destination. I was unable to track the packets with ethereal and suchlike and was (after 2 weeks) suffering borderline insanity when I noticed that chan_iax2.c uses the rand() function. On linux this function is not threadsafe so I replaced it with rand_r(). Amazingly my IAX interconnections are now reliable whereas before I could guarantee that they would die within 12 hours.|
I have added code that transmits the DynDNS name allocated to each site with the POKE command and now the IAX interconnections stay REACHABLE even when the ISP changes the IP address - no DNS lookups
Now can anyone explain to me why using rand_r() should fix my original problem ?
****** ADDITIONAL INFORMATION ******
There is a bug 5712 about the use of rand() in chan_sip.c. The fix is to use random() which uses an internal mutex. rand_r() may possibly be a better solution as it does not require this overhead.
The Open Group standard states that rand() need not be reentrant or threadsafe. The linux man page states that it is not threadsafe.
Eventually I will repeat this whole exercise using the latest version of asterisk - for now I am using 1.0.9.
The five sites will be expanding to eight in the near future so this may be an opportunity to some testing.
|Comments:||By: Tilghman Lesher (tilghman) 2005-12-02 10:58:11.000-0600|
Solution adds ast_random() API to utils.c. Needs exception for merging to 1.2.
By: Tilghman Lesher (tilghman) 2005-12-02 11:01:25.000-0600
The problem with using rand_r is that we still need a mutex to protect the prior result, otherwise two competing threads could use the same prior result and return exactly the same random number. Protecting the internal value with a mutex is the only practical method to resolve this race.
By: Tilghman Lesher (tilghman) 2005-12-02 11:04:09.000-0600
And if that's not concurrent enough for you, we could also go with a pool of different random sequences, each protected with a mutex, and the pool selected with a series of mutexes, each attempted with ast_mutex_trylock(). This is unlikely to be necessary, unless you have a much more concurrent need for random number generation and the single mutex is slowing you down.
By: Kevin P. Fleming (kpfleming) 2005-12-12 22:12:16.000-0600
I don't see how this patch (or your proposed solution) touches any code paths involved in peer reachability checking. Can you duplicate this problem with 1.2.1 and then provide an 'iax2 debug' of the ping/poke packets being sent out to the peer who is UNREACHABLE?
By: Kevin P. Fleming (kpfleming) 2005-12-13 11:20:27.000-0600
A significant bug in timestamp handling was just corrected today in SVN trunk and branch/1.2. Please retest with that fix and let us know if you still experience this problem.
By: Tilghman Lesher (tilghman) 2005-12-27 00:33:56.000-0600
paulhewlett: we need a response from you to proceed
By: Paul Hewlett (paulhewlett) 2005-12-27 07:16:19.000-0600
Thanx for the feedback. I am currently building a Gentoo box on which I will install latest asterisk (1.2.1 and/or HEAD). At that point I will reimplement those changes and report back.
By: Tilghman Lesher (tilghman) 2006-01-09 17:37:58.000-0600
Reopen if the timestamp change did not fix your issue.
By: Clod Patry (junky) 2006-01-12 00:30:58.000-0600
Cory: since that one has not been commited, you made a link on ASTERISK-5806, still want it instead of random() ?