ASTERISK-18760: Deadlock in SVN 1.8 version 342359

[Home]

Summary: ASTERISK-18760: Deadlock in SVN 1.8 version 342359

Reporter: David Brillert (aragon) Labels: Deadlock

Date Opened: 2011-10-27 11:00:41 Date Closed: 2011-11-08 13:47:52.000-0600

Priority: Minor Regression? No

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: SVN 1.8.8.0 Frequency of
Occurrence Occasional

Related
Issues:
must be completed before resolving ASTERISK-18499 Asterisk 1.8.8.0 Release Blockers

must be completed before resolving ASTERISK-18608 Asterisk 10.0.0-rc1 Blockers

is related to ASTERISK-18747 Deadlock in chan_sip / event on send mwi / unsubscribe

is related to ASTERISK-18663 BLF Subscriptions Causes SIP Deadlock

Environment: CentOS 5.7 Attachments: ( 0) chan_sip_mwi_blf_patch_1_8.diff
( 1) messages
( 2) patched_1.8_ngrep_5060.txt
( 3) rev342539_deadlock.txt
( 4) rev342539resdahdi_deadlock.txt
( 5) sip.conf
( 6) sip_set_debug_on_1.txt
( 7) sip_show_channels.txt

Description: All SIP peers become unreachable during deadlock.
All core show locks and gdb pid backtraces in attachment.
Value looks optimized out but Asterisk was built with debug options enabled.

System was not locking in 1.8.7

Comments: By: David Brillert (aragon) 2011-10-27 11:01:30.877-0500

core show locks and gdb output
By: David Brillert (aragon) 2011-10-27 11:31:35.890-0500

Two deadlocks in 30 minutes; 1.8 SVN is very dangerous.
I have had to downgrade to 1.8.7

By: Leif Madsen (lmadsen) 2011-10-31 13:15:53.077-0500

Looks like you're using res_timing_pthread. You should avoid use of that module. Please disable res_timing_pthread and enable res_timing_dahdi.
By: Leif Madsen (lmadsen) 2011-10-31 13:18:31.039-0500

Requesting replacement of res_timing_pthread with res_timing_dahdi and a re-test.
By: David Brillert (aragon) 2011-11-01 08:29:49.067-0500

I removed the pthread timing module and confirm the usage of res_timing_dahdi.
*CLI> module show like res_timing_dahdi.so
Module Description Use Count
res_timing_dahdi.so DAHDI Timing Interface 1
1 modules loaded

Now a new deadlock with 15 minutes on idle system.

bt attached
By: David Brillert (aragon) 2011-11-01 08:39:25.132-0500

gdb thread apply all bt attachment
By: David Brillert (aragon) 2011-11-01 08:46:53.970-0500

*CLI> module show like res_timing_pthread.so
Module Description Use Count
0 modules loaded

By: David Vossel (dvossel) 2011-11-01 11:06:01.003-0500

I don't see a deadlock here. No threads are waiting for a lock in the "core show locks" output. The monitor lock is held, but there is no contention for that lock taking place. Based on the information provided, it is not obvious what the problem is.

What are the symptoms of this problem? Are calls no longer getting routed? You mentioned that this takes place on a idle system. If the system is idle how is this deadlock detected? Are the sip qualify messages failing resulting in the peer becoming unreachable or is Asterisk completely un-responsive?
By: David Brillert (aragon) 2011-11-01 11:14:05.500-0500

Each time I upgrade this site from 1.8.7.0 to 1.8.8.0 (SVN) all of the SIP peers start becoming unreachable until none are registered.
The CLI is still usable.
core show locks command continues to show the same lock info each time it is entered.
By idle I mean I that no one had come into work yet to use the system and no SIP channels are up.
Luckily the site is in another time zone so I have been able to my 1.8.8.0 testing before the system gets used.
By: Matt Jordan (mjordan) 2011-11-01 14:00:21.120-0500

Can you attach the output of a sip show channels while the behavior is occurring? Are there any CLI notices, warnings, or errors that occur while this is happening?
By: Matt Jordan (mjordan) 2011-11-01 14:23:20.623-0500

While I'm thinking about it, can you also provide the configuration of one (or more) of your sip peers that goes unreachable?
By: David Brillert (aragon) 2011-11-01 14:32:43.665-0500

Since I have been trying to debug some segfaults on 1.8.7.0 I installed this 1.8.8.0 SVN version with DEBUG_THREADS and DONT_OPTIMIZE options. However this seems to be very expensive and I believe is either the root cause of this problem or a great contributor. I'm shocked at how much these options slow down Asterisk. Can I get away with DONT_OPTIMIZE and not check DEBUG_THREADS and still get access to good segfault data and core show locks output?
I used to be able to compile with both of these options and still run in production but now it appears this is not possible :(
By: David Brillert (aragon) 2011-11-01 14:35:17.196-0500

Matt Jordan - there were no weird notices, warning, or errors during the registration breakdown. Debug options were not enabled during my maintenance/test window.
By: David Brillert (aragon) 2011-11-01 14:39:27.345-0500

For example with DEBUG_THREAD and DONT_OPTIMIZE I get CLI output during a small paging application that I don't see without those options:
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a
[2011-11-01 15:36:43] WARNING[15051]: app_meetme.c:3678 conf_run: Unable to write frame to channel SIP/6057-0000001a

and

[2011-11-01 15:36:39] WARNING[15006]: channel.c:1475 __ast_queue_frame: Exceptionally long voice queue length queuing to Local/6002@default-local-paging-9c7d;1

By: Matt Jordan (mjordan) 2011-11-01 14:49:03.540-0500

aragon - with regards to DEBUG_THREADS and DONT_OPTIMIZE - no, they're pretty much needed if you want a useful 'core show locks' and backtrace. You can get away without DEBUG_THREADS, maybe, but I'd keep both of them in. While they might slow things down a bit, I don't think the issue you're seeing has anything to do with these two options. Your later problems of frames not being able to be written to SIP channels and frames getting left on the queue in channel.c indicate much larger problems.

The WARNING messages on the CLI indicate that something is preventing those channel's queues from being serviced, so they're more of a symptom of the problem rather than whats actually causing it. It would be very helpful to get an idea as to how your SIP peers are set up and what is going on in Asterisk with regards to them during start up. To that end, during your next maintenance / test window, your sip.conf and a debug log are probably going to be needed. If you could also captures 'SIP show channels' when this is occurring, that would be really helpful. In the meantime, we could at least peak at your sip.conf so we can try to configure our test setup here to mimic those settings and try to reproduce it.

Thanks!
By: David Brillert (aragon) 2011-11-01 15:32:24.861-0500

scrubbed sip.conf with bogus secrets
By: Matt Jordan (mjordan) 2011-11-01 16:34:28.342-0500

No luck reproducing so far. I'm using a configuration based on your provided sip.conf and can't reproduce the errors you're seeing. I think we'll have to get a debug log the next time you're able.
By: David Brillert (aragon) 2011-11-02 08:12:26.287-0500

I was logged into CLI this morning and put back 1.8.8.0 SVN with debug options.
DEBUG was set to file level so I am attaching messages file.
CLI managed to capture sip show channels and core show locks and sip show peers.
Uploading.

Also phones are Polycoms and reception monitors all BLF
By: Matt Jordan (mjordan) 2011-11-02 16:00:41.479-0500

Thanks for all of the info. We've been looking through it - is the mwi_from field set in sip.conf?
By: Matt Jordan (mjordan) 2011-11-02 16:44:53.780-0500

A couple of other things to try when you next get a chance.

1. When you first noticed the problem with 1.8.8, did you have DEBUG_THREADS and DONT_OPTIMIZE turned on? Do you still have the problem if you have both of those turned off?

2. From your logs, the responses back for the OPTIONS packet from the phones are exceeding the qualify time set for the peers. What happens if the qualify time is set to something around 4000 ms (or greater)?

3. Are you using outbound MWI subscriptions at all in sip.conf, e.g., mwi => 1234:password@otherserver.com/1234?

If you do get another chance to do a run, can you try to put Wireshark on the network and capture a SIP call flow? If that's not possible, a log with SIP DEBUG ON would also show some of the SIP packets going back and forth. We have some working theories on whats going on, but we haven't nailed down the culprit yet.
By: David Brillert (aragon) 2011-11-03 09:55:06.758-0500

1. When you first noticed the problem with 1.8.8, did you have DEBUG_THREADS and DONT_OPTIMIZE turned on? Do you still have the problem if you have both of those turned off?
-I think I saw the problem without those options in the beginning but I am not 100% certain.

2. From your logs, the responses back for the OPTIONS packet from the phones are exceeding the qualify time set for the peers. What happens if the qualify time is set to something around 4000 ms (or greater)?
-I don't want to mess with qualify settings right now although I see where you are going with this.

3. Are you using outbound MWI subscriptions at all in sip.conf, e.g., mwi => 1234:password@otherserver.com/1234?
-No (full sip.conf is attached to report)

If you do get another chance to do a run, can you try to put Wireshark on the network and capture a SIP call flow? If that's not possible, a log with SIP DEBUG ON would also show some of the SIP packets going back and forth. We have some working theories on whats going on, but we haven't nailed down the culprit yet.
-cli with debug and sip set debug on attached 'sip set debug on 1.txt'
By: David Brillert (aragon) 2011-11-03 09:56:32.039-0500

debug with SIP output file attached. Domain and names are scrubbed.
By: Matt Jordan (mjordan) 2011-11-03 17:11:01.472-0500

Attaching potential fix for BLF / MWI subscription issues
By: Matt Jordan (mjordan) 2011-11-03 17:11:12.890-0500

aragon -

Thanks for attaching the sip output - that pointed us in some possibly useful directions. Attached is a patch that may resolve the situation you're seeing. Hopefully it will resolve the issue outright - but if not, can you get a pcap trace (or tcpdump) showing the SIP traffic the next time you test? That might actually be more useful then the sip debug output - the sheer number of BLF subscriptions (with the NOTIFY packets flying around) can make it tough to isolate exactly what's happened with one of the sequence of dialog transactions. Plus, the pcap will also rule out some interference on Asterisk's part on exactly when the packets arrive.

Thanks!

Matt
By: David Brillert (aragon) 2011-11-04 09:13:14.324-0500

I can't do pcap since the server is across the country and I have no immediate access.
I did test the patch.
The unreachable SIP peers are much less frequent but can be reproduced on a full reload.
Or when I mass reboot the Polycom phones using check-sync phones become unreachable and then become reachable.
ngrep trace attached of SIP traffic.
By: David Brillert (aragon) 2011-11-04 09:14:09.277-0500

patched 1.8 ngrep 5060.txt is SIP trace with .diff
By: Matt Jordan (mjordan) 2011-11-07 15:19:27.030-0600

We've updated the 1.8 branch with some patches that may help you. Please test the latest 1.8 doing the following:
1. Disable DEBUG_THREADS
2. Disable DONT_OPTIMIZE
3. Make sure SIP DEBUG is off

If there is still a problem, please also test with the current tagged version of 1.8.7.

You should also disable the serverMissedCall setting in your Polycom configuration files. Setting serverMissedCall.x.enabled to a non-zero value is not supported by Asterisk (this was a setting designed by Polycom for a different server) - it will send out SUBSCRIBE packets for an event type of 'missed-call-summary', which is not understood by Asterisk. It isn't going to cause a problem per se, but is additional network traffic that isn't needed.

By: David Brillert (aragon) 2011-11-08 09:01:12.546-0600

It will take a little time to test the 1.8 SVN version with the 3 patches you are referring to.
Meantime I checked my Polycom device.cfg settings and I found this
<serverMissedCall call.serverMissedCall.1.enabled="0"/>

Can you confirm the correct value?
By: Matt Jordan (mjordan) 2011-11-08 09:37:07.365-0600

There should be at least one phone that has that set to 1. Setting it to 1 will send out a SUBSCRIBE to the event 'missed-call-summary' - there aren't a lot of them in your logs, but a few. You could grep through the logs to find those phones.
By: David Brillert (aragon) 2011-11-08 10:11:38.559-0600

I found the missed-call-summary settings and changed them to 0 (thanks).
Tested SVN r343791 in production and cannot reproduce the locking anymore (thanks again).

By: Matt Jordan (mjordan) 2011-11-08 13:39:25.369-0600

No problem - thanks for testing the various patches!
By: David Brillert (aragon) 2011-11-08 13:47:52.527-0600

Fixed by commits in 1.8 SVN