[Home]

Summary:ASTERISK-15432: [patch] Deadlock on &(&channels)->lock
Reporter:Sergey Tamkovich (sergee)Labels:
Date Opened:2010-01-14 15:04:34.000-0600Date Closed:2011-06-07 14:00:56
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 20100301__issue16608.diff.txt
( 1) 20100414__issue16608.diff.txt
( 2) 20100426__issue16608.diff.txt
( 3) 20100513__issue16608__1.6.2.diff.txt
( 4) 20100913-bt.txt
( 5) 20100913-locks.txt
( 6) btall.txt
( 7) locks.txt
( 8) sergee_issue16608.diff
( 9) xlite.txt
(10) zoiper.txt
Description:I've got 2 deadlocks in 3 days. It happens in a peak hours, with more then a hundred calls in the system.

I've got 2 files 1 with "core show locks" - from the first deadlock (the day before yesterday), second file - with output from gdb (intho thread, thread apply all bt, thread apply all bt full) - from today's deadlock.
Comments:By: Leif Madsen (lmadsen) 2010-01-15 14:15:26.000-0600

When is this actually happening? Is this happening in the shutdown?

By: Sergey Tamkovich (sergee) 2010-01-15 14:46:08.000-0600

It happening during normal work, after i find out that system is not responding to sip requests, i tried to restart it with cli command "restart now"

By: Tilghman Lesher (tilghman) 2010-03-01 17:58:36.000-0600

It appears that a SIP call is no longer receiving RTP packets, which causes recvmsg to hang forever.  Technically, this isn't a deadlock, but it still should not do that.  Patch uploaded that should fix this, by preventing the RTP code from hanging while waiting for a packet to arrive.

By: Tilghman Lesher (tilghman) 2010-03-01 17:59:22.000-0600

Please test and verify that this patch prevents this situation from occurring again.

By: Sergey Tamkovich (sergee) 2010-03-05 05:53:38.000-0600

i will be able to test this on tuesday (9 march)



By: Tilghman Lesher (tilghman) 2010-03-11 15:48:33.000-0600

It is now the 11th of March.  How did your test go?

By: Tilghman Lesher (tilghman) 2010-03-22 16:00:14

sergee: ping.  Have you been able to test this yet?

By: Tilghman Lesher (tilghman) 2010-03-24 11:34:35

No response from reporter.

By: Sergey Tamkovich (sergee) 2010-03-27 16:56:33

Sorry for a long silence, i have just come back home. i applyed this patch to Asterisk 1.6.0 Revision: 237060. - same as before

I will have peak load on monday at 09:00 - 13:00 UTC. I will report about any issues.



By: Sergey Tamkovich (sergee) 2010-03-28 03:45:04

Now i have false-positives for rtp-timeout timers,

but rtp works,

   -- Executing DIAL("SIP/1010-000017a4", "SIP/voipgw4/........")
 == Using SIP RTP CoS mark 5
 == Using SIP VRTP CoS mark 6
 == Using UDPTL CoS mark 5
   -- Called voipgw4/09689262415703
   -- SIP/voipgw4-000017a5 is making progress passing it to SIP/1010-000017a4
   -- SIP/voipgw4-000017a5 is making progress passing it to SIP/1010-000017a4
   -- SIP/voipgw4-000017a5 answered SIP/1010-000017a4
   -- Packet2Packet bridging SIP/1010-000017a4 and SIP/voipgw4-000017a5

Disconnecting call 'SIP/1010-000017a4' for lack of RTP activity in 61 seconds



I've checked "core show channel SIP/1010-000017a4" several times during a call. It seems that Asterisk stops updating packet counters (Frames in, Frames out) after a few seconds of call.



By: Tilghman Lesher (tilghman) 2010-04-09 10:59:37

sergee:  can you verify through wireshark whether RTP has actually stopped flowing or not?

By: Sergey Tamkovich (sergee) 2010-04-13 04:42:21

tilghman, i verifyed that RTP is present with tshark.

Also i fuond an interesting issue. My scheme is very simple:

Zoiper -> asterisk -> as5350

so, there are 3 rtptimeout timers for this scheme: 1 for zoiper, 1 for cisco and last one - general rtptimeout defined in sip.conf to distinct them i assigned 3 different values: 5, 67 and 30 seconds. And now i receive:

[Apr 13 13:28:57] NOTICE[21752]: chan_sip.c:20129 check_rtp_timeout: Disconnecting call 'SIP/1010-00000000' for lack of RTP activity in 31 seconds

So it seems that rtptimeout from general section of sip.conf is the source of false-positives.


Another interesting thing that i found, is that behaviour depends on UAC. For example, if i change Zoiper with X-Lite - problem is gone. If i switch back to Zoiper - problem is present again (fals positives for rtptimeout)



By: Tilghman Lesher (tilghman) 2010-04-14 16:30:13

I don't see any reason why, if the media is flowing through Asterisk, why the rtptimeout would not be updated.  The only thing I can figure is that the RTP may be flowing directly between endpoints.  This might explain why the rtptimeout is triggered.

If you set 'canreinvite=no' for at least one of the related peers in sip.conf, then this problem should go away entirely.  If it does, then we have a code path that we could possibly work on.



By: Tilghman Lesher (tilghman) 2010-04-14 16:45:10

Patch updated that contains this change, as detailed above.

By: Sergey Tamkovich (sergee) 2010-04-15 02:41:49

just a thoughts:

1. without your first patch - rtptimeoput wasn't triggered. I checked it just now - with clean asterisk 1.6.0 r237060, problem with false-positives on rtptimeouts - doesn't exist.

2. i can see RTP from both peers on asterisk-host with wireshark. Also i can see RTP in Asterisk with "rtp debug"

Sent RTP P2P packet to 1.1.1.1:16542 (type 08, len 000160)
Sent RTP P2P packet to 2.2.2.2:8000 (type 08, len 000160)
Sent RTP P2P packet to 1.1.1.1:16542 (type 08, len 000160)
Sent RTP P2P packet to 2.2.2.2:8000 (type 08, len 000160)
Sent RTP P2P packet to 1.1.1.1:16542 (type 08, len 000160)
Sent RTP P2P packet to 2.2.2.2:8000 (type 08, len 000160)



May be it has something to do with bridging technology engaged in this call:

   -- Packet2Packet bridging SIP/1010-00000002 and SIP/voipgw4-00000003

3. canreinvite=no is configured for both peers

By: Sergey Tamkovich (sergee) 2010-04-15 02:47:12

Yes, looks like it has something to do with Packet2Packet bridging,
if i use X-Lite, instead of Zoiper i don't have a string like:

   -- Packet2Packet bridging SIP/1010-00000002 and SIP/voipgw4-00000003

also "rtp debug" looks different:

Sent RTP packet to      1.1.1.1:19524 (type 08, seq 032384, ts 3420640, len 000160)
Got  RTP packet from    1.1.1.1:19524 (type 08, seq 046749, ts 1302925229, len 000160)
Sent RTP packet to      2.2.2.2:9100 (type 00, seq 036859, ts 1302925224, len 000160)
Got  RTP packet from    2.2.2.2:9100 (type 00, seq 004692, ts 3420800, len 000160)
Sent RTP packet to      1.1.1.1:19524 (type 08, seq 032385, ts 3420800, len 000160)
Got  RTP packet from    1.1.1.1:19524 (type 08, seq 046750, ts 1302925389, len 000160)


and no rtptimeouts with X-Lite.



By: Sergey Tamkovich (sergee) 2010-04-15 02:57:01

also i found out that rtp timeout on called party (voipgw4) is triggered.

previous scheme:

zoiper (5) -> asterisk (15) -> cisco (65)

and calls was droped after 16 seconds.
Now i have:

zoiper (5) -> asterisk (70) -> cisco (18)

and now calls are droped after 19 seconds.

By: Tilghman Lesher (tilghman) 2010-04-15 17:02:26

1.  If you have no problems without any of these patches, then we can close this issue with no patches.

2.  If you aren't receiving RTP, that would certainly explain why the RTP timeout is occurring.  This suggests some other interaction is not properly occurring with the Zoiper client.  Please provide SIP traces for both calls.

By: Sergey Tamkovich (sergee) 2010-04-16 05:42:01

> 1. If you have no problems without any of these patches, then we can close this issue with no patches.

With clean Asterisk 1.6.0 r237060 - I don't have false-positives for rtptimeouts. However, i have a deadlock problem - when asterisk under load - freezes (stop processing all calls). So, before tackling "deadlock" problem, i suppose we need to fix "rtptimeout" problem, which was introduced in your patches.

> 2. If you aren't receiving RTP, that would certainly explain why the RTP timeout is occurring.

Yeah, that would be nice, however, as i mentioned several times, i do see RTP flowing through asterisk - with "rtp debug" and wireshark.

> This suggests some other interaction is not properly occurring with the Zoiper client.

Just for the record: the problem is not only with zoiper, but with any software and hardware clients whos calls are bridged with "Packet2Packet bridging".

> Please provide SIP traces for both calls.

Attached.
10.10.10.197 - softphone (zoiper/x-lite)
10.10.10.207 - Asterisk 1.6.0 r237060
10.10.10.219 - Cisco AS5350

scheme:

softphone -> Asterisk -> Cisco



By: Tilghman Lesher (tilghman) 2010-04-16 09:28:46

1)  It's not a deadlock.  It's a resource starvation caused by holding a lock while waiting for RTP to be received.  Since no RTP is received, the lock is held forever.  Now I think you understand why my patch did not introduce an RTP timeout:  it was already inherent in your system.

2)  No, you see RTP _sent_ from Asterisk, but in certain cases (most?), you aren't _receiving_ RTP.  That's what's causing the timeout.

I want to see the SIP traces for calls both where you ARE receiving RTP (X-Lite) as well as calls for which you are not (Zoiper), so I can figure out what is specifically different.

By: Sergey Tamkovich (sergee) 2010-04-17 04:11:39

tilghman,

> 1) It's not a deadlock. It's a resource starvation ..

Oh! i've got it! thanks for your patience :)

> I want to see the SIP traces for calls both

i've attached 2 files: zoiper.txt and xlite.txt - it contains output of "sip debug" for 2 calls. Is that what you want?

By: Sergey Tamkovich (sergee) 2010-04-17 04:23:05

> 2) No, you see RTP _sent_ from Asterisk, but in certain cases (most?), you aren't _receiving_ RTP. That's what's causing the timeout.

Probably i don't understand something again, here are an output from tshark for 2 calls which was droped by asterisk due to false-positive on rtptimeout.


Call-leg 'A' (Zoiper -> Asterisk), 2 seconds before disconnect:

38.194732 10.10.10.197 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0xC7E1A4D6, Seq=27074, Time=1632192850
38.212472 10.10.10.197 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0xC7E1A4D6, Seq=27075, Time=1632193010
38.215000 10.10.10.207 -> 10.10.10.197 RTP PT=ITU-T G.711 PCMA, SSRC=0x21A9E0DB, Seq=35190, Time=586537724
38.230463 10.10.10.197 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0xC7E1A4D6, Seq=27076, Time=1632193170
38.235990 10.10.10.207 -> 10.10.10.197 RTP PT=ITU-T G.711 PCMA, SSRC=0x21A9E0DB, Seq=35191, Time=586537884
38.250200 10.10.10.197 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0xC7E1A4D6, Seq=27077, Time=1632193330
38.256975 10.10.10.207 -> 10.10.10.197 RTP PT=ITU-T G.711 PCMA, SSRC=0x21A9E0DB, Seq=35192, Time=586538044
38.273687 10.10.10.197 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0xC7E1A4D6, Seq=27078, Time=1632193490

- as far as i understand RTP flows in both directions.


Call-leg 'B' (Asterisk -> Cisco AS5350), 2 seconds before disconnect:


36.079551 10.10.10.219 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0x1EB3E0DB, Seq=63721, Time=1467799745
36.100539 10.10.10.219 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0x1EB3E0DB, Seq=63722, Time=1467799905
36.104070 10.10.10.207 -> 10.10.10.219 RTP PT=ITU-T G.711 PCMA, SSRC=0xDEBBC5EC, Seq=27226, Time=987187368
36.116559 10.10.10.207 -> 10.10.10.219 RTP PT=ITU-T G.711 PCMA, SSRC=0xDEBBC5EC, Seq=27227, Time=987187528
36.121527 10.10.10.219 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0x1EB3E0DB, Seq=63723, Time=1467800065
36.135547 10.10.10.207 -> 10.10.10.219 RTP PT=ITU-T G.711 PCMA, SSRC=0xDEBBC5EC, Seq=27228, Time=987187688
36.142515 10.10.10.219 -> 10.10.10.207 RTP PT=ITU-T G.711 PCMA, SSRC=0x1EB3E0DB, Seq=63724, Time=1467800225


- as far as i understand RTP flows in both directions.

So i have a question - why do you think that Asterisk is not _receiving_ RTP?



By: Sergey Tamkovich (sergee) 2010-04-17 11:30:36

Ok, i found out the reason of false-positives for rtp-timeout. As i said before, false-positives for rtptimeouts happens for calls which are bridged with 'Packet2Packet bridging'.

The reason is your code, added to sip_read():


if (fr == &ast_null_frame) {
sip_pvt_unlock(p);
return fr;
}

it was added right before:

p->lastrtprx = time(NULL);

So, if sip_rtp_read(ast, p, &faxdetected) returns &ast_null_frame - p->lastrtprx is never updated (with your patch).

But, with 'Packet2Packet bridging', sip_rtp_read() always returns &ast_null_frame, here is a code from ast_rtp_read():

       if ((bridged = ast_rtp_get_bridged(rtp)) && !bridge_p2p_rtp_write(rtp, bridged, rtpheader, res, hdrlen)) {
               return &ast_null_frame;
       }


tilghman, i suggest a simple modification of your last patch (patch uploaded) - please review it. If you don't mind against this modification -  we could proceed with original problem - i will organize testing.

By: Tilghman Lesher (tilghman) 2010-04-26 13:28:51

sergee:  good catch.  I have revised my patch, such that I use a special NULL frame just for the purpose.

By: Sergey Tamkovich (sergee) 2010-05-02 03:52:04

tilghman, i will install this patch to the main system on the next friday (07-05-2010).

Is there any way to see (or predict) the original problem before system freeze?



By: Tilghman Lesher (tilghman) 2010-05-04 01:49:32

The original problem is impossible with the patch.

By: Sergey Tamkovich (sergee) 2010-05-08 13:23:34

Patch installed. No problems for now, but there is no peak load either :) Peak loads will happen on 11/05 09:00 GMT and 12/05 09:00 GMT - i will report here if there will be any problems or not.

By: Sergey Tamkovich (sergee) 2010-05-11 06:52:52

got the same problem: server stops processing new calls:

[root@sip ~]# asterisk -rx "core show locks"

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 55450512 (do_monitor           started at [20257] chan_sip.c restart_monitor())
=== ---> Tried and failed to get Lock #0 (chan_sip.c): MUTEX 20197 do_monitor dialog 0xb4666730 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 89893776 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2580 __ast_read (channel lock) 0xa1e6a254 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 6143 sip_read p 0xb4666730 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 89533328 (netconsole           started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 68967312 (netconsole           started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)
=== -------------------------------------------------------------------
===
=======================================================================


I tryed to restart it properly, with CLI command 'restart now' - but server didn't restart, also output of "core show locks" changed to what i've seen before:

[root@sip ~]# asterisk -rx "core show locks"

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 55450512 (do_monitor           started at [20257] chan_sip.c restart_monitor())
=== ---> Tried and failed to get Lock #0 (chan_sip.c): MUTEX 20197 do_monitor dialog 0xb4666730 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 60648336 (monitor_sig_flags    started at [ 3523] asterisk.c main())
=== ---> Lock #0 (channel.c): RDLOCK 491 ast_begin_shutdown &(&channels)->lock 0x81dd9a8 (1)
=== ---> Waiting for Lock #1 (channel.c): MUTEX 1632 ast_softhangup (channel lock) 0xa1e6a254 (1)
=== --- ---> Locked Here: channel.c line 2580 (__ast_read)
=== -------------------------------------------------------------------
===
=== Thread ID: 89893776 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Lock #0 (channel.c): MUTEX 2580 __ast_read (channel lock) 0xa1e6a254 (1)
=== ---> Lock #1 (chan_sip.c): MUTEX 6143 sip_read p 0xb4666730 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 137595792 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 35421072 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 154131344 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 83032976 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 85695376 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 149318544 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 121359248 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 111782800 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 24923024 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 19151760 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 41601936 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 19889040 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 34929552 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 160054160 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 28179344 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 126606224 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 36658064 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 39009168 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 54868880 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 148708240 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 24677264 (pbx_thread           started at [ 4027] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (channel.c): WRLOCK 1685 ast_hangup &(&channels)->lock 0x81dd9a8 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: 89533328 (netconsole           started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 68967312 (netconsole           started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 32533392 (netconsole           started at [ 1087] asterisk.c listener())
=== ---> Lock #0 (channel.c): RDLOCK 491 ast_begin_shutdown &(&channels)->lock 0x81dd9a8 (1)
=== ---> Waiting for Lock #1 (channel.c): MUTEX 1632 ast_softhangup (channel lock) 0xa1e6a254 (1)
=== --- ---> Locked Here: channel.c line 2580 (__ast_read)
=== -------------------------------------------------------------------
===
=======================================================================

By: Sergey Tamkovich (sergee) 2010-05-11 07:12:26

i tryed to execute "core show channels verbose" - 2 times before 1st 'core show locks'

- maybe it is the source of last 2 entities:

=== Thread ID: 89533328 (netconsole started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)
=== -------------------------------------------------------------------
===
=== Thread ID: 68967312 (netconsole started at [ 1087] asterisk.c listener())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 1215 channel_find_locked (channel lock) 0xa1e6a254 (0)

By: Sergey Tamkovich (sergee) 2010-05-12 01:46:36

next time  it hangs - i'll dump a core, without calling "restart now".

By: Sergey Tamkovich (sergee) 2010-05-13 05:12:59

It happened again.

tilghman, i want to notice, that with your patch this "not deadlock" happens not as often (once per 2 workdays ~ 110 000 calls) as it was back in january (2-3 times per day ~ 20 000 calls).

Now, after 2 minutes of hanging, i dumped "core show locks" and then killed asterisk with -6 signal, without calling "restart now" CLI command.

Output of "core show locks" and "thread apply all bt" from gdb - attached here.
I'll keep the core file - if you'll need anything else.

By: Tilghman Lesher (tilghman) 2010-05-13 10:23:16

Per the Asterisk maintenance timeline page at http://www.asterisk.org/asterisk-versions maintenance (bug) support for the 1.6.0 and 1.6.1 branches has ended. For continued maintenance support please move to the 1.6.2 branch.

More information on this change can be found in the release announcement: http://www.asterisk.org/node/49924

By: Sergey Tamkovich (sergee) 2010-05-13 13:08:52

tilghman, shoul i apply your patch (20100426__issue16608.diff.txt) to 1.6.2?

By: Tilghman Lesher (tilghman) 2010-05-13 15:49:17

New patch produced for 1.6.2, specifically.

By: Tilghman Lesher (tilghman) 2010-05-13 15:56:09

You might be able to do without this patch at all.  In 1.6.2, unlike 1.6.0, the RTP read code is already nonblocking.

By: Sergey Tamkovich (sergee) 2010-05-16 16:22:21

Just installed 1.6.2 - let's see how it behaves

By: Sergey Tamkovich (sergee) 2010-05-17 13:17:45

1.6.2 doesn't work properly, it crashesh every few minutes

-rw------- 1 root   root   729772032 May 17 09:59 core.14890
-rw------- 1 root   root   125067264 May 17 10:05 core.24352
-rw------- 1 root   root   120496128 May 17 10:35 core.25772
-rw------- 1 root   root   119312384 May 17 10:36 core.30637
-rw------- 1 root   root   115625984 May 17 10:37 core.30971
-rw------- 1 root   root   114724864 May 17 10:38 core.31304
-rw------- 1 root   root   115630080 May 17 10:39 core.31591

Can we proceed with 1.6.0? I believe that if we'll fix this problem for 1.6.0 - we'll fix it for 1.6.1 and 1.6.2 as well.

By: Tilghman Lesher (tilghman) 2010-05-17 13:39:21

As noted above, there is no further support for 1.6.0.  If you cannot reproduce on 1.6.2, then this issue will be closed.

By: Tilghman Lesher (tilghman) 2010-05-20 11:58:21

It would be helpful if you would file those crashes in 1.6.2 as a separate issue.  If it's crashing, I'd like to make sure those get fixed.

By: Sergey Tamkovich (sergee) 2010-05-29 06:33:26

tilghman, i won't be able to reproduce those crashes in the nearest future, because it rises a lot of noise from customers :) i'm sorry.

If it helps, i can say that crashes happened under load (monday morning ~ 120-150 calls). During weekend tests (~ 30-50 calls) everything was fine.



By: Tilghman Lesher (tilghman) 2010-05-29 10:51:54

1.6.0 is no longer supported for bug fixes, so closing.