ASTERISK-16417: user get unreachable after some minutes, deadlock in ao2

[Home]

Summary: ASTERISK-16417: user get unreachable after some minutes, deadlock in ao2_lock

Reporter: Stefan Schmidt (schmidts) Labels:

Date Opened: 2010-07-24 17:30:10 Date Closed: 2010-09-30 09:13:17

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) backtrace4times.txt
( 2) threads+locks.txt

Description: i´ve tried the second time to upgrade from 1.2.40 to 1.6.2.9 / 10 and after 10 to 15 minutes it looks like asterisk couldnt send any sip packets out.
i thought that the last revision r278465 could also solve this problem so i´ve tried 1.6.2.10 today but i got the same problem.

i have around 3100 peers in static config files and normally around 2100 are registered and reachable in asterisk. there are also around 1300 hints in the extensions.conf and more than 1300 subcribes on that hint.
when i restart asterisk, the peers get the qualify packet and are reachable but after some time, i see more and more of these messages:
== Extension Changed userB[outcust] new state Idle for Notify User userA (queued)
also the phones cant reach asterisk any more and show they are offline, and in asterisk i can see them as unreachable.
with 1.6.2.9 after 1 hour only 100 phones has keep rechable and in 1.6.2.10 only 1000 after 30 mins.

i can also see that their are many (>500) open sip channels which i dont see in 1.2.40
this server is a hosted pbx solution for our customers and when i´ve tested this there is no phone traffic on the server just registrations and subscribes.
with load this server handles > 200 concurrent calls without any problems.

****** ADDITIONAL INFORMATION ******

i have patched asterisk pending on revision 278617 of main/channel.c row 1258 setting _prev=c; instead of prev=c; but this doesnt help either.

backtrace as suggested in doc/backtrace is attached.

Comments: By: rsw686 (rsw686) 2010-07-25 14:22:34

The scheduler was changed for the 1.6.2 branch (https://reviewboard.asterisk.org/r/160/). It had a few issues with the heap structure that were fixed for 1.6.2.8 (https://issues.asterisk.org/view.php?id=17277). I had problems with phones not working in 1.6.2, but the issues were resolved with 1.6.2.8. You might try 1.6.1.20 out of curiosity to see if it is the scheduler.
By: Stefan Schmidt (schmidts) 2010-07-25 16:08:49

i´ve allready tried 1.6.1.20 one month ago but had to get back to 1.2.40, cause i´ve hit several bugs which allready are solved in 1.6.2 branch (like infinite loop in udptl38 and others)
but you are right, in 1.6.1.20 it didnt had this problems the system runned about 8 hours without any problem like this, and i had to go back cause of some deadlock problems which started when daily traffic starts.
By: Stefan Schmidt (schmidts) 2010-07-26 08:32:32

i can reproduce this problem with the following setup:

server A with asterisk 1.6.2.10 and sip.conf of 2500 friends
server B with asterisk 1.6.2.10 which register these 2500 peers to Server A

when server A runs and i start asterisk on server B i see that all of the 2500 users come online and stay reachable until i do a sip reload on server B. Then users starting getting unreachable until only 500 of the 2500 are online. until server B finish the registration server a also held some locks. When its finish everything again is wealthy.

i´ve done 4 backtraces, first when all is ok, second and third during the problem and fourth after everything is ok, again.

have a look at the attached file, this thread is the only one which changes and also has a look situation.
By: Stefan Schmidt (schmidts) 2010-07-26 09:34:25

i am now trying to reproduce it on 1.8.0 beta and it looks like this problem exists in 1.8 too
By: Leif Madsen (lmadsen) 2010-07-29 14:46:10

Could you also provide output of 'core show locks' when this happens? Please enable DEBUG_THREADS in the Compiler Flags menu of menuselect then recompile. Thanks!
By: Leif Madsen (lmadsen) 2010-07-29 15:48:11

Based on some comments from jtodd, I'm marking this as loosely related to ASTERISK-16455
By: Stefan Schmidt (schmidts) 2010-07-30 01:58:49

in the backtrace.txt at the bottow there is allready a core show locks output short after doing the gdb, but i´ve attached a new txt file with core show locks + core show threads.

cause of this problem i´ve started my investigation in the scheduler structure and found this "problem" which is described in 17741. When this happens sending notify packages, makes it even worser but its not a fix for this problem at all.

i´ve allready tried the patch from 17610 but with no improvment of the situation.
By: Stefan Schmidt (schmidts) 2010-08-24 02:33:25

i´found that on my test system (which is old and slow) the astdb_put tooks up to 40ms which cause that peers got unreachable. on another test server which is faster this doesnt happens. Also when i deactivate the astdb->sync in astdb_put everything runs fine on the test system.
i will do another run on the production system to see if i could find the problem i´ve there.
i think the ao2_callback for dialog_needdestroy which could took up to 2 ms, could slow down the system long enough to cause this problems if it runs everytime in the do_monitor loop. i´ve allready written a patch for this, and after i´ve tried this on our production system i will submit this patch here.
By: Russell Bryant (russell) 2010-09-14 15:52:57

Now that we put in the astdb improvement, is this still an issue we should keep open?
By: Stefan Schmidt (schmidts) 2010-09-15 02:36:57

the issue itself could be closed.

there are 2 reviews open which are connected to this also, but the issue itself is solved, cause i know why this happens.

thank you russell.