[Home]

Summary:ASTERISK-17655: pri lockup. no futher calls able to be taken
Reporter:Alec Davis (alecdavis)Labels:
Date Opened:2011-04-06 22:00:33Date Closed:2011-05-06 17:30:21
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Channels/chan_dahdi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace-apr08.txt
( 1) backtrace-apr7.txt
( 2) backtrace-apr7-2.txt
( 3) core-show-locks.txt
( 4) core-show-locks2.txt
( 5) core-show-locks-apr08.txt
Description:attached backtrace and core show locks

****** ADDITIONAL INFORMATION ******

[2011-04-07 10:58:43.199] VERBOSE[29802] pbx.c:     -- Executing [8320@incoming:9] Dial("DAHDI/i1/5604947-c3b", "DAHDI/g0/033417052:8320,,r") in new stack
[2011-04-07 10:58:43.199] DEBUG[5255] chan_dahdi.c: Enabled echo cancellation on channel 31
[2011-04-07 10:58:43.199] DEBUG[29802] chan_dahdi.c: Using channel 3
[2011-04-07 10:58:43.199] DEBUG[29802] sig_pri.c: sig_pri_request 3
[2011-04-07 10:58:43.199] DEBUG[29802] dsp.c: Setup tone 1100 Hz, 500 ms, block_size=160, hits_required=21
[2011-04-07 10:58:43.199] DEBUG[29802] dsp.c: Setup tone 2100 Hz, 2600 ms, block_size=160, hits_required=116
[2011-04-07 10:58:43.199] DEBUG[29802] devicestate.c: device 'DAHDI/i1/033417052:8320-c3c' state '2'
[2011-04-07 10:58:43.199] ERROR[29802] lock.c: chan_dahdi.c line 2078 (my_unlock_private): mutex '&p->lock' freed more times than we've locked!
[2011-04-07 10:58:43.199] ERROR[29802] lock.c: chan_dahdi.c line 2078 (my_unlock_private): Error releasing mutex: Operation not permitted
Comments:By: Alec Davis (alecdavis) 2011-04-06 22:49:31

message below for 8 minutes!!

[2011-04-07 11:06:07.435] DEBUG[29802] chan_dahdi.c: Dropping frame since I'm still dialing on DAHDI/i1/033417052:8320-c3c...
[2011-04-07 11:06:07.435] DEBUG[29802] chan_dahdi.c: Dropping frame since I'm still dialing on DAHDI/i1/033417052:8320-c3c...

By: Richard Mudgett (rmudgett) 2011-04-07 10:59:46

This may no longer be a problem.  I have recently committed a major patch on the 1.8 branch to fix many B channel shifting issues.  Part of that patch also fixed a deadlock issue I ran into.

By: Alec Davis (alecdavis) 2011-04-07 17:03:23

richard: being a production box, not exactly keen.
It's been stable for weeks, then yesterday 2 pri lockups about an hour apart. I only filed the first.

Are you able to pinpoint the deadlock issue, so that I can apply a focused patch. Was it in chan_dahdi, or sig_pri?

By: Richard Mudgett (rmudgett) 2011-04-07 17:45:44

The particular deadlock in core-show-locks.txt doesn't make sense to me.  I think the holder of the private lock should not still have it.  It looks like someone also tried a pri_grab() without already having the private lock based upon your ERROR messages above.

The deadlock issue I ran into was because the private lock was held by pri_dchannel() while it was trying to create the ast_channel for an incoming call.

Part of the B channel shifting patch added channel allocation protection between incoming and outgoing calls picking the same B channel.  See sig_pri_available() and sig_pri_is_chan_available().

B channel shifting patch is -r312575 and -r312949 for v1.8.

What is the call pattern for this particular box?
Is it in/out calls?
Is it very busy?

Collision mitigation may be possible by using a different channel allocation strategy to avoid both sides being likely to pick the same B channel.

By: Alec Davis (alecdavis) 2011-04-07 18:51:31

core-show-locks2.txt and backtrace-apr7-2.txt

debug log related to the 2nd lockup.

[2011-04-07 13:07:22.125] VERBOSE[10963] pbx.c:     -- Executing [s@macro-dialout:16] Dial("SIP/GXP0018-000000a1", "DAHDI/g0/0800268743#") in new stack
[2011-04-07 13:07:22.125] DEBUG[10963] chan_dahdi.c: Using channel 3
[2011-04-07 13:07:22.125] DEBUG[10963] sig_pri.c: sig_pri_request 3
[2011-04-07 13:07:22.125] DEBUG[10963] dsp.c: Setup tone 1100 Hz, 500 ms, block_size=160, hits_required=21
[2011-04-07 13:07:22.125] DEBUG[10963] dsp.c: Setup tone 2100 Hz, 2600 ms, block_size=160, hits_required=116
[2011-04-07 13:07:22.125] DEBUG[10963] devicestate.c: device 'DAHDI/i1/0800268743#-bc' state '2'
[2011-04-07 13:07:22.124] DEBUG[29965] devicestate.c: device 'DAHDI/i1/5604817-bb' state '2'
[2011-04-07 13:07:22.126] VERBOSE[10964] sig_pri.c:     -- Starting simple switch on 'DAHDI/i1/5604817-bb'
[2011-04-07 13:07:22.126] ERROR[10963] lock.c: chan_dahdi.c line 2078 (my_unlock_private): mutex '&p->lock' freed more times than we've locked!
[2011-04-07 13:07:22.126] ERROR[10963] lock.c: chan_dahdi.c line 2078 (my_unlock_private): Error releasing mutex: Operation not permitted
[2011-04-07 13:07:22.126] VERBOSE[29965] sig_pri.c:     -- Accepting overlap call from '5604817' to '039833199' on channel 0/1, span 1

By: Alec Davis (alecdavis) 2011-04-07 23:00:18

there seems to be a race between hangup and request.

this can be seen in the backtrace-apr7-2 between threads 15 and 18 both trying to lock 0xb637bb40

rmudgett: looking at your chan_dahdi commits, I noticed this <pre>                                /*
                                * We already have the B channel reserved for this call.  We
                                * just need to make sure that dahdi_hangup() has completed
                                * cleaning up before continuing.
                                */
                               ast_mutex_lock(&p->lock);
                               ast_mutex_unlock(&p->lock);</pre>

seems relevant?

By: Alec Davis (alecdavis) 2011-04-08 00:15:34

backtrace-apr08.txt and core-show-locks-apr08.txt

Another lockup today, seems like the 1st lockup.
1 thread with a 1 lock 'LOCK8' waiting on something.

LOCK1, wait for LOCK7
 LOCK7, wait for LOCK3
   LOCK3, wait for LOCK8
     LOCK8, but not sure what is it's problem.

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0xb6dffb90 (do_devstate_changes  started at [  724] devicestate.c ast_device_state_engine_init())
=== ---> Lock #0 (astobj2.c): MUTEX 657 internal_ao2_callback c 0x99586f0 (1)                             LOCK1
=== ---> Waiting for Lock #1 (channel.c): MUTEX 1641 ast_channel_cmp_cb chan 0xaf87cc10 (1)                      LOCK7
=== --- ---> Locked Here: channel.c line 2796 (ast_raw_answer)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xb2cfcb90 (do_monitor           started at [24470] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 23964 handle_request_do &netlock 0xb6b0f6c0 (1)                      LOCK2
=== ---> Waiting for Lock #1 (astobj2.c): MUTEX 493 internal_ao2_link c 0x99586f0 (1)                             LOCK1
=== --- ---> Locked Here: astobj2.c line 657 (internal_ao2_callback)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaffb7b90 (pri_dchannel         started at [ 7117] sig_pri.c sig_pri_start_pri())
=== ---> Lock #0 (sig_pri.c): MUTEX 4351 pri_dchannel &pri->lock 0xb638a824 (1)                           LOCK3
=== ---> Waiting for Lock #1 (chan_dahdi.c): MUTEX 2072 my_lock_private &p->lock 0xb04f72f0 (1)                  LOCK8
=== --- ---> Locked Here: chan_dahdi.c line 2072 (my_lock_private)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaff3fb90 (do_monitor           started at [11577] chan_dahdi.c restart_monitor())
=== ---> Lock #0 (chan_dahdi.c): MUTEX 11304 do_monitor &iflock 0xb6386b40 (1)                            LOCK4
=== ---> Waiting for Lock #1 (chan_dahdi.c): MUTEX 11322 do_monitor &i->lock 0xb04f72f0 (1)                      LOCK8
=== --- ---> Locked Here: chan_dahdi.c line 2072 (my_lock_private)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaff7bb90 (pri_dchannel         started at [ 7117] sig_pri.c sig_pri_start_pri())
=== ---> Lock #0 (chan_dahdi.c): MUTEX 2072 my_lock_private &p->lock 0xafe2ac78 (1)                       LOCK5
=== ---> Waiting for Lock #1 (astobj2.c): MUTEX 493 internal_ao2_link c 0x99586f0 (1)                            LOCK1
=== --- ---> Locked Here: astobj2.c line 657 (internal_ao2_callback)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xafa1eb90 (pbx_thread           started at [ 5038] pbx.c ast_pbx_start())
=== ---> Lock #0 (cdr.c): RDLOCK 1142 post_cdr &(&be_list)->lock 0x8203868 (1)                            LOCK6
=== ---> Waiting for Lock #1 (astobj2.c): MUTEX 657 internal_ao2_callback c 0x99586f0 (1)                        LOCK1
=== --- ---> Locked Here: astobj2.c line 657 (internal_ao2_callback)
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaecb3b90 (pbx_thread           started at [ 5038] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2796 ast_raw_answer chan 0xaf87cc10 (1)                               LOCK7
=== ---> Tried and failed to get Lock #1 (sig_pri.c): MUTEX 288 pri_grab &pri->lock 0xb638a824 (0)               LOCK3
=== -------------------------------------------------------------------
===
=== Thread ID: 0xaf74ab90 (pbx_thread           started at [ 5038] pbx.c ast_pbx_start())
=== ---> Lock #0 (chan_dahdi.c): MUTEX 2072 my_lock_private &p->lock 0xb04f72f0 (1)                       LOCK8
=== -------------------------------------------------------------------
===
=======================================================================



By: Alec Davis (alecdavis) 2011-04-08 00:34:35

[2011-04-08 16:06:08.738] VERBOSE[23914] sig_pri.c:     -- Accepting call from '5604819' to '8322' on channel 0/29, span 1
[2011-04-08 16:06:08.738] DEBUG[23929] app_queue.c: Device 'DAHDI/i1/5604819-246' changed to state '2' (In use) but we don't care because they're not a member of any queue.
[2011-04-08 16:06:08.739] DEBUG[2343] pbx.c: Function result is '5604819'
[2011-04-08 16:06:08.740] DEBUG[2343] pbx.c: Launching 'MYSQL'
[2011-04-08 16:06:08.740] VERBOSE[2343] pbx.c:     -- Executing [8322@incoming:2] MYSQL("DAHDI/i1/5604819-246", "Query ...) in new stack
... MYSQL ...
[2011-04-08 16:06:08.741] DEBUG[2343] pbx.c: Launching 'Dial'
[2011-04-08 16:06:08.741] VERBOSE[2343] pbx.c:     -- Executing [8322@incoming:9] Dial("DAHDI/i1/5604819-246", "DAHDI/g0/033417052:8322,,r") in new stack
[2011-04-08 16:06:08.741] DEBUG[2343] chan_dahdi.c: Using channel 1
[2011-04-08 16:06:08.741] DEBUG[2343] sig_pri.c: sig_pri_request 1
[2011-04-08 16:06:08.739] DEBUG[23914] chan_dahdi.c: Enabled echo cancellation on channel 29
[2011-04-08 16:06:08.741] DEBUG[2343] dsp.c: Setup tone 1100 Hz, 500 ms, block_size=160, hits_required=21
[2011-04-08 16:06:08.741] DEBUG[2343] dsp.c: Setup tone 2100 Hz, 2600 ms, block_size=160, hits_required=116
[2011-04-08 16:06:08.741] DEBUG[2343] devicestate.c: device 'DAHDI/i1/033417052:8322-247' state '2'
[2011-04-08 16:06:08.741] ERROR[2343] lock.c: chan_dahdi.c line 2078 (my_unlock_private): mutex '&p->lock' freed more times than we've locked!
[2011-04-08 16:06:08.741] ERROR[2343] lock.c: chan_dahdi.c line 2078 (my_unlock_private): Error releasing mutex: Operation not permitted

By: Richard Mudgett (rmudgett) 2011-04-08 10:16:58

Look at -r301946.  You are suffering from that bug at least.

By: Richard Mudgett (rmudgett) 2011-05-06 14:21:18

alec:
Did applying the patch of -r301946 take care of your problem?

By: Alec Davis (alecdavis) 2011-05-06 16:09:22

rmudgett: not sure exactly, went to the current 1.8 branch SVN-branch-1.8-r315673M and havn't seen it since.

By: Richard Mudgett (rmudgett) 2011-05-06 17:30:21

From the symptoms the -r301946 bug was the main culprit here.