[Home]

Summary:ASTERISK-06813: An attempt to transfer call into same callid causes deadlock
Reporter:caspy (caspy)Labels:
Date Opened:2006-04-19 08:27:37Date Closed:2006-09-20 10:29:38
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Transfers
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk_bt_20060510.txt
( 1) asterisk_bt_20060518_1330.txt
( 2) asterisk_bt_20060518.txt
( 3) asterisk_bt_20060609_1.txt
( 4) asterisk_bt_20060619_2.txt
( 5) asterisk_bt_20060621_1.txt
( 6) asterisk_bt_20060626_1.txt
( 7) asterisk_bt_20060703_1.txt
( 8) debug_20060619_1.txt
( 9) debug_20060619.txt
(10) debug_20060621_1.txt
(11) debug_20060703_1.txt
(12) debugagenttransfer_queuedeadlock_gdb.txt
(13) debugagenttransfer_queuedeadlock_log.txt
(14) debugagenttransfer_segfault_gdb.txt
(15) debugagenttransfer_segfault_log.txt
(16) goakham_backtrace_june_15_2006.txt
Description:After working several hours under medium load, Asterisk become unusable in a strange manner. The main things i can notice are:

1) pri desyncronization. Asterisk thinks, that it has X active channels on zap. Telco station knows only about Y. Y is strictly less than X. So, we can't receive any more incoming calls, cause this error:
WARNING[17934] chan_zap.c: Ring requested on channel 0/1 already in use on span 2. Hanging up owner.

2) Commands, such 'show agents', 'show queues' print nothing at all. But channel's commands (sip show channels, zap show channels, etc..) works.

3) Peoples in queue listening music forever, regardless of free agents online.(so Agents do not receive any more calls)

4) 'show channels' can print nothing too, but sometimes print some information.

5) in log, at time when the problem begins, there is a message:
NOTICE[17930] chan_sip.c: Supervised transfer attempted to transfer into same call id (030099cc351a181c7eee26786cdec9eb@masterhost.ru == 030099cc351a181c7eee26786cdec9eb@masterhost.ru)!

6) sip calls continue to work

7) 'stop now' doesn't stops asterisk
7a) after 'stop now' console works, but it is impossible to open new console
7b) kill to asterisk's pid do not help, only kill -9


****** ADDITIONAL INFORMATION ******

Configuration:
- pri channel, here we receive and send outside calls.
- all incoming call from pri go into Queue.
- all inside phones are SIP-based.
- about 10 Agents, who logs on SIP phones (AgentCallBackLogin) and answers queue.


The main trouble for me - i don't understand location of problem, and don't know what to dig. ;(
Comments:By: Andrey S Pankov (casper) 2006-04-21 09:13:41

I'm afraid you need a consultant with access to your server...

By: Serge Vecher (serge-v) 2006-04-21 09:19:08

caspy:
1) although the console messages are different, I would downgrade to 1.2.4 to rule http://bugs.digium.com/view.php?id=6963 out.
2) since you are using chan_sip, you need to do a full debug of sip transactions when problems occur:
  a) set debug 4
  b) set verbose 4
  c) sip debug
  *) capture the output and _attach_as_a_file_ to this bug note.
3) for immediate help, please follow casper's suggestion.

By: caspy (caspy) 2006-04-21 09:21:16

it seems to be the same trouble:
http://bugs.digium.com/view.php?id=6147

By: Olle Johansson (oej) 2006-04-23 01:24:32

...and why is this categorized as a SIP problem if SIP works?

By: opsys (opsys) 2006-04-29 01:00:14

Its in SIP because everyone LOVES You. Moved to chan_agent and Merged with 6147.
As they both seam to be related.

By: caspy (caspy) 2006-05-02 02:24:13

As i mentioned in 6147, may be it is not a chan_agent problem. Cause today i've got a same trouble after working 4 days without chan_agent.

I still use queues, but agents rewritten to dialplan using a chan_local.

By: Serge Vecher (serge-v) 2006-05-04 13:26:06

caspy: did you try debugging the deadlock as suggested by ferrency in 6147?

"A short update for those who need assistance debugging deadlock issues:

The previous links in this ticket describing how to debug Asterisk seem to be broken, after recent site renovations. Here is a link in voip-info.org which contains the most important information:

http://www.voip-info.org/tiki-index.php?page=Asterisk%20debugging "

By: caspy (caspy) 2006-05-10 02:21:14

It died again.

In log:
May 10 10:59:42 NOTICE[2589] chan_sip.c: Supervised transfer attempted to transfer into same call id (792cf88e7cfe7366714bd4f6194edbad@masterhost.ru == 792cf88e7cfe7366714bd4f6194edbad@masterhost.ru)!
May 10 11:01:15 WARNING[14066] channel.c: Avoided initial deadlock for '0x4141acb0', 10 retries!
May 10 11:01:20 WARNING[2584] channel.c: Avoided initial deadlock for '0x4141acb0', 10 retries!
May 10 11:01:45 WARNING[2593] chan_zap.c: Ring requested on channel 0/1 already in use on span 2.  Hanging up owner.
May 10 11:01:57 WARNING[2593] chan_zap.c: Ring requested on channel 0/1 already in use on span 2.  Hanging up owner.

Backtrace attached.

By: Mark Spencer (markster) 2006-05-11 03:06:13

Can you avoid both chan_agent and chan_local and still make the problem happen?

By: Mark Spencer (markster) 2006-05-11 03:07:20

Also, in your use of chan_local, if you change instances of:

Local/foo@bar

to be

Local/foo@bar/n

That will disable the masquerades and I'd be interested in whether that makes a difference.

By: caspy (caspy) 2006-05-11 03:12:32

markster,
i can't avoid chan_local, cause i need to account my people, and do other call managment.
chan_agent already avoided, even it does not loaded anymore.

what about Local/foo@bar/n, - i'll look if it is acceptable in my enviroment, and try if ok.

By: BJ Weschke (bweschke) 2006-05-11 09:39:39

caspy: are you using MixMonitor in this setup? I've seen some behavior similar to this where the use of callback agents/queue/and MixMonitor is triggering someone to take a channel lock, and then not release it and that process itself eventually goes away. I've tried recompiling with DEBUG_THREADS in hopes of tracking down where in the code the lock is getting claimed that last time, but I haven't been able to come up with the same situation since launching with DEBUG_THREADS a couple days back.

By: caspy (caspy) 2006-05-11 09:50:55

bweschke, yes, i use Monitor() app.
Also, please mention, that in my setup there are no more chan_agent.

By: syslod (syslod) 2006-05-11 19:19:34

I have the same problem after upgrading.

By: syslod (syslod) 2006-05-11 19:21:06

I have no queues.  Calls goto menu, vm, or sip devices.

By: caspy (caspy) 2006-05-22 04:14:17

a couple of new backtraces from weekends.

By: Serge Vecher (serge-v) 2006-05-22 10:20:55

caspy: if you replace the Monitor app with MixMonitor, would that make a difference?

Also, is this on an upatched Asterisk 1.2.7 from the tarball?

By: caspy (caspy) 2006-05-23 00:41:12

I'll try to replace recording to MixMonitor today. Later write here.

Yes, it is pure 1.2.7 from tar-ball, even without any external module.

By: caspy (caspy) 2006-05-23 03:14:53

I have some difficulties with 'b' option of MixMonitor().
I need to use it (and Monitor() works ok), but MixMonitor() produce an emtpy wav files (44 or 364 bytes) for about 90% of my calls. And i can't find any difference between these calls.

By: caspy (caspy) 2006-05-30 01:40:22

(according to 7226, i'll write here)

With yesterday's svn of 1.2 branch, Asterisk become much more UNstable.
Unfortunately, i was away from ssh, and have no any bt. But after second crash (deadlock + "Ring requested on channel..") in less than one hour, we returned 1.2.7.

Also, my peoples was unable to login into the queue with AddQueueMember().

By: caspy (caspy) 2006-05-30 03:15:46

Some additional info about Queue logon\logoff:

while trying to logoff from being queue member i get:
in console:
   -- Executing RemoveQueueMember("SIP/1001-9fa4", "callcenterq|Local/1001@toagent") in new stack

in bt:
Thread 2 (Thread -1220445264 (LWP 16486)):
#0  0xb7da8e03 in strcasecmp () from /lib/tls/libc.so.6
#1  0xb757a810 in remove_from_interfaces (interface=0xb7411518 "Local/1001@toagent") at app_queue.c:657
#2  0xb757e5cd in remove_from_queue (queuename=0xb7411518 "Local/1001@toagent", interface=0xb7411518 "Local/1001@toagent")
   at app_queue.c:2533
#3  0xb7576ac0 in rqm_exec (chan=0x814a570, data=0xb741150c) at app_queue.c:2875
#4  0x0808e585 in pbx_extension_helper (c=0x814a570, con=0xb7417b9c, context=0x814a6c0 "fromoffice", exten=0x814a7b4 "878",
   priority=1, label=0x0, callerid=0x0, action=0) at pbx.c:553
ASTERISK-1  0x0808f12a in __ast_pbx_run (c=0x814a570) at pbx.c:2230
ASTERISK-2  0x0808fe1f in pbx_thread (data=0xb7417b9c) at pbx.c:2517
ASTERISK-3  0xb7f0cb63 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-4  0xb7e0718a in clone () from /lib/tls/libc.so.6

no other threads locked.

This situation is modulated in test env, without any active load.

By: caspy (caspy) 2006-05-30 03:16:57

... and 'show channels' show me:
*CLI> show channels
Channel              Location             State   Application(Data)
SIP/1001-9fa4        878@fromoffice:1     Ring    RemoveQueueMember(callcenterq|
1 active channel
1 active call

By: caspy (caspy) 2006-05-30 03:30:57

(gdb) frame 0
#0  0xb7da8dd6 in strcasecmp () from /lib/tls/libc.so.6
(gdb) frame 1
#1  0xb757a810 in remove_from_interfaces (interface=0xb7411518 "Local/1001@toagent") at app_queue.c:657
657                             if (!strcasecmp(interface, mem->interface)) {
(gdb) frame 2
#2  0xb757e5cd in remove_from_queue (queuename=0xb7411518 "Local/1001@toagent", interface=0xb7411518 "Local/1001@toagent")
   at app_queue.c:2533
2533                    remove_from_interfaces(interface);



forever loop in interface_exists_global():
==
               while(mem) {
                       if (!strcasecmp(interface, mem->interface)) {
                               ret = 1;
                               break;
                       }
               }
==
??

By: caspy (caspy) 2006-05-30 09:22:11

vechers, about 7242, - it looks like another issue for me. sorry for dup.

By: BJ Weschke (bweschke) 2006-05-30 09:50:40

caspy: your assesment is spot on. this is an error I missed in doing the backport. will fix momentarily. Thanks!

By: caspy (caspy) 2006-05-30 10:38:20

bweschke, thanks. adding\removing of the interfaces works ok.

now watching for 30799 in production. in case of troubles i'll do my best. ;)

By: Andrey S Pankov (casper) 2006-05-30 13:41:16

IMO the issue can be closed safely...

By: BJ Weschke (bweschke) 2006-05-31 07:01:21

caspy: how are we doing with the latest patch?

By: caspy (caspy) 2006-05-31 13:46:13

bweschke,
thanks, rather good, but in my case it's too early to make a decision (before patch it tooks up to 1-2 week to crash).

*CLI> show uptime
System uptime: 1 day, 2 hours, 54 minutes, 49 seconds


During this period i saw a couple of spontaneous messages like:
May 31 15:18:14 WARNING[31734] channel.c: Avoided initial deadlock for '0x41628360', 10 retries!
but there are no any stalled channels or locked threads.
Is it dangerous? Is it possible to get somehow more diagnostics?


Also, i still periodically notice single messages like:
May 31 17:04:31 WARNING[31743] chan_zap.c: Ring requested on channel 0/3 already in use on span 2.  Hanging up owner.
but i'm not sure that this is a really problem, due to no locked channels after. It seems to me, that this can be a result of 'm' option of Monitor(), cause thread is locked while soxmix works (am i right?). Later i'll do some testing in that direction, if the problem will persist.

By: caspy (caspy) 2006-06-09 03:03:11

Asterisk crashed again.

First of all it writes to log:
Jun  9 11:32:33 NOTICE[31739] chan_sip.c: Supervised transfer attempted to transfer into same call id (084c402649479289368ebdd134163831@masterhost.ru == 084c402649479289368ebdd134163831@masterhost.ru)!

After this, 'show channels' begin to show somthing strange:
phone2*CLI> show channels
Channel              Location             State   Application(Data)
[..skip some lines..]
SIP/gatewayTDO-0ec1  (None)               Up      Bridged Call(
<-- SIP read fro
Zap/40-1             7729720@fromoutside: Up      Queue(callcenterq|tw)
9 active channels
12 active calls
(Mention broken line, and parts of debug info there (mem leak?). Also unreal calls count.)

In a few minutes later, Asterisk begin to log
Jun  9 11:36:31 WARNING[31743] chan_zap.c: Ring requested on channel 0/6 already in use on span 2.  Hanging up owner.
on every new call.

So, at that moment, i needed to restart it. backtrace attached (asterisk_bt_20060609_1.txt). 'show channels' just after bt, but before restart:
phone2*CLI> show channels
Channel              Location             State   Application(Data)
Zap/32-1             7729720@fromoutside: Up      Queue(callcenterq|tw)
SIP/1117-3f4f        (None)               Up      Bridged Call(Zap/36-1)
Zap/36-1             1117@toagent:6       Up      Dial(SIP/1117|30|tw)
Zap/33-1             7729720@fromoutside: Up      Queue(callcenterq|tw)
SIP/gatewayTDO-0ec1  (None)               Up      Bridged Call(
<-- SIP read fro
Zap/40-1             7729720@fromoutside: Up      Queue(callcenterq|tw)
6 active channels
10 active calls



By: caspy (caspy) 2006-06-09 03:26:52

Some additional info:
SIP/gatewayTDO -- this is an AudioCodes gateway (MP-104/FXO). We use it very rarely, but today's failure happens when a call was transfered to gateway.

By: Serge Vecher (serge-v) 2006-06-09 08:16:05

caspy, what version is this from. Hopefully 1.2.9.1 ...

By: caspy (caspy) 2006-06-09 08:22:48

Asterisk SVN-branch-1.2-r30770 built on 2006-05-30 15:27:18 UTC
i have not seen any related patches in changlog since that date.

By: Serge Vecher (serge-v) 2006-06-09 09:37:33

How about 31921: app_queue.c return bridge exit logic to what it was before i broke it :-( ?

By: caspy (caspy) 2006-06-09 09:57:27

Hmm.. it doesn't look so relative for me. I'll do upgrade soon.

But what about this:
===
SIP/gatewayTDO-0ec1 (None) Up Bridged Call(
<-- SIP read fro
===
You see this lines now, exactly i saw them in console.
'sip debug' at that moment was already turned off. every time i executed 'show channels' i got this result. This is not a mixed lines. Asterisk really thought, 'that SIP/gatewayTDO' was bridged with "\n<-- SIP read fro".

By: Serge Vecher (serge-v) 2006-06-09 10:38:39

caspy: I just realized you've never provided a sip debug log. Can you please set bebug 4, set verbose 4, sip debug? Also, make sure debug is set for console in logger.conf. Thanks.

By: caspy (caspy) 2006-06-09 10:46:15

yes-yes.
Today, after crash, i tried to correlate and structurize all my previous troubles, and it seems to me, that audiocodes is one of Asterisk killers. Weekend i'll try to force Asterisk's crash by AudioCodes with debug turned on.

Anyway, some feedback will follow.

By: Serge Vecher (serge-v) 2006-06-09 10:55:44

ok, good luck. Don't forget to upgrade too :)

By: Fernando Romo (el_pop) 2006-06-10 14:02:05

The same erractic behaivor witn any 1.2.x version and without the Audiocodes. we use aastra sip phones and the ekiga softphones



By: Fernando Romo (el_pop) 2006-06-10 15:05:13

I forget:

When the CLI stop to respond to commands, the API-Manager stop to respond too.

For example if i send a "show agents" via the command action in the API manager, the respond is empty, even if we try many times.

We test with 3 customers with 2 E1's, 5E1's and 8E1's and see if we have more load, Asterisk stop to respond more faster.

With the 3 customers test with each tarball of the 1.2.x branch, test with the lasted version of the SVN branch and when we receive load, Asterisk become unusable and we need to restart asterisk.

With one call-center, we back to the 1.0.x branch and we could operate stable. seems like the problem belong to 1.2.x versions. We test with trunk versions with 4E1's and average of 100 calls peer minute and Asterisk stop to respond each 5 minutes. In this scenario we use Polycom Phones for administrative purposes, Agent Stations with Linux and the X-Lite softphone (NO Audiocodes), The same are working rigth now with asterisk 1.0.7, but we have a strange ACD problem: If we put 100 agents, we only have 50 simultaneous calls and a long queue even if Agents are avalable (the same occurs in 1.2.x).

My guess is how to the core of asterisk deal with the connections matrix and how priorise the diferents threads.

markster:

We can use one of our customers to make load test (a couple of days), and we test with arround 200 agents and 8E1s, let me know to coordinate and make this kind of stres test happend.

By: Fernando Romo (el_pop) 2006-06-10 17:21:13

Few moments ago, we find one particular symptom before the failure:

One agent try to log (no mather AgentLogin or AgentCallbackLogin) to the PBX, and the login process are refused with the message of "Agent alredy log on", but the agent not are never logged in before.

I think when we have a load in the asterisk system, the number of systems channels has a raised condition and the PBX can't generate any new channel and start to refuse any kind of channel or command.

in each failure this symptom repeat.

By: Michael Toop (mmmmmtoop) 2006-06-12 15:18:42

We are also experiencing exactly the symptoms described above with 1.2.7.1  This only is happening on our servers using app_queue.c with rather high loads. Our quiet box's don't seem to be having these issues. We have also noted rather strange memory usage on the box, sometimes 'free' reports  0 mem left. I know not the best measure...



By: Serge Vecher (serge-v) 2006-06-12 15:22:25

There were a couple of memory-leaks plugged yesterday, including a core pbx.c file. Can you, guys, please update to 1.2 branch r33638 and see if that fixes it? Thanks

By: Fernando Romo (el_pop) 2006-06-13 12:26:05

vechers:

The new version don't work, the error still happen even in trunk.

You can test with real load?

With more tha E1 (30 trunks) Asterisk become very unestable, with one customer we need to reload asterisk every one and half hour. With another we note this each 5 to 10 minutes (5 E1's)

By: Michael Toop (mmmmmtoop) 2006-06-13 14:58:02

Tried running latest 1.2 branch & had same issues.  
Anyone tried to run Valgrind (http://valgrind.org/) on Asterisk & check for mem. leaks etc.?

By: Serge Vecher (serge-v) 2006-06-13 15:04:27

MmmmmToop: please don't use the bug-tracker to ask questions. If you need to help using basic trouble-shooting techniques, please use the #asterisk channel on IRC or seek the help of an Asterisk consultant.

El Pop: you have provided no debugging information to link your problems to this issue. Please review the deadlock section on voip-info.org and upload the gdb output.

By: Geoff Oakham (goakham) 2006-06-14 11:21:50

We've been running Asterisk 1.2.9.1 for three days, and this has happened twice in the past 24h.  The major symptom is #1, desynchronization of the PRI.

By: Serge Vecher (serge-v) 2006-06-14 12:12:33

goakham: please try the latest 1.2 branch, I wonder if r31921 has fixed this (this was committed to 1.2 branch after 1.2.9.1 released). Thanks

By: Geoff Oakham (goakham) 2006-06-14 13:47:05

Um.. are you sure that's the correct revision number?  That appears to be reversing a change back to the 1.2.9.1 state:

$ svn checkout http://svn.digium.com/svn/asterisk/branches/1.2 asterisk-1.2
$ cd asterisk-1.2/
$ svn log -r 31920:31921
r31921 | kpfleming | 2006-06-03 23:43:35 -0400 (Sat, 03 Jun 2006) | 2 lines
return bridge exit logic to what it was before i broke it :-(
$ svn diff -r 31920:31921
Index: apps/app_queue.c
===================================================================
--- apps/app_queue.c    (revision 31920)
+++ apps/app_queue.c    (revision 31922)
@@ -2431,7 +2431,7 @@
               if (bridge != AST_PBX_NO_HANGUP_PEER)
                       ast_hangup(peer);
               update_queue(qe->parent, member);
-               res = bridge ? -1 : 0;
+               res = bridge ? bridge : 1;
       }
out:
       hangupcalls(outgoing, NULL);
$ svn diff --old http://svn.digium.com/svn/asterisk/tags/1.2.9.1/ --new http://svn.digium.com/svn/asterisk/branches/1.2/ apps/app_queue.c
$

By: Serge Vecher (serge-v) 2006-06-14 14:19:22

gokham: yes, you are right -- looks like I need new glasses here -- the fix in r31921 became part of 1.2.9.1 . Thanks for clarification here.

By: Geoff Oakham (goakham) 2006-06-14 14:33:39

I've turned on debug-logging.. do you have any other suggestions?  People are quite concerned here.

By: Serge Vecher (serge-v) 2006-06-14 14:37:28

a bt snapshot of a deadlocked Asterisk, like the ones caspy has produced, would be quite useful. Make sure that Asterisk is built with 'make dont-optimize'

By: Geoff Oakham (goakham) 2006-06-14 14:55:54

Thanks venchers, I'll try to recompile & restart.. it's a production machine though.

Would you consider it safe to run the system with 'DEBUG_THREADS'?

By: Serge Vecher (serge-v) 2006-06-14 15:07:36

not sure about DEBUG_THREADS, but dont-optimize should be safe for production use. Please consult the gurus of #asterisk-dev on DEBUG_THREADS

By: Fernando Romo (el_pop) 2006-06-14 17:08:37

I recompile asterisk with the last version of yesterday (SVN-branch-1.2-r33993) and present one big failure in 6 hours of operation and another in 3 hours.

I try to put debug info later, but is a production system and the persons in that system are mad about it.

By: Geoff Oakham (goakham) 2006-06-15 09:19:14

It happened again this morning.. attached a backtrace.  Unfortunately, I don't seem to have as much debugging information as casper got.

Both yesterday and today, I've noticed that the last action performed by the "hung" channel was Queue()ing a caller.  Here's an exert from today's crash:

Jun 15 08:33:42 VERBOSE[9202] logger.c:     -- Executing Queue("Zap/29-1","helpdesk|t|||270") in new stack
Jun 15 08:33:42 VERBOSE[9202] logger.c:     -- Started music on hold, class 'default', on channel 'Zap/29-1'
Jun 15 08:33:42,call-id 459,itd-helpdesk,NONE,ENTERQUEUE,,+1(613)555-3141 (last entry in queues.conf)
Jun 15 08:34:37 VERBOSE[15082] logger.c:     -- Channel 0/1, span 3 got hangup request
Jun 15 08:34:42 VERBOSE[15082] logger.c:     -- Channel 0/1, span 3 got hangup
Jun 15 08:35:24 VERBOSE[9240] logger.c:   == Spawn extension (macro-dial-iax, s, 1) exited non-zero on 'Zap/32-1' in macro 'dial-iax'
Jun 15 08:35:24 VERBOSE[9240] logger.c:   == Spawn extension (macro-dial-iax, s, 1) exited non-zero on 'Zap/32-1'
Jun 15 08:35:24 VERBOSE[9240] logger.c:     -- Hungup 'Zap/32-1'
Jun 15 08:35:25 WARNING[15082] chan_zap.c: Ring requested on channel 0/1 already in use on span 3.  Hanging up owner.

By: Geoff Oakham (goakham) 2006-06-15 10:38:24

I've gone over the source code changes between 1.2.7.1 and 1.2.9.1 to app_queue.c and noticed there have been some major changes to locking in r30546 and related changes in r30770.

We're serriously considering rolling back those changes to get our PBX stable again.. unless there's a fix.  bweschke.. any ideas?

By: Serge Vecher (serge-v) 2006-06-15 11:05:56



By: Fernando Romo (el_pop) 2006-06-15 21:47:30

SVN-branch-1.2-r34306

with this version produce a lot of warnigs of zap channels in use, but in the customer pbx has 60 trunks, 11 in use and the rest free and asterisk send:

Jun 15 21:41:35 WARNING[23884]: cdr.c:550 ast_cdr_disposition: Cause not handled
   -- Hungup 'Zap/1-1'
   -- B-channel 0/1 successfully restarted on span 1
   -- Requested transfer capability: 0x00 - SPEECH
   -- Channel 0/1, span 1 got hangup
   -- Forcing restart of channel 0/1 on span 1 since channel reported in use
Jun 15 21:41:36 WARNING[23888]: cdr.c:550 ast_cdr_disposition: Cause not handled
   -- Hungup 'Zap/1-1'
   -- B-channel 0/1 successfully restarted on span 1
   -- Requested transfer capability: 0x00 - SPEECH
   -- Channel 0/1, span 1 got hangup
   -- Forcing restart of channel 0/1 on span 1 since channel reported in use
Jun 15 21:41:36 WARNING[23892]: cdr.c:550 ast_cdr_disposition: Cause not handled
   -- Hungup 'Zap/1-1'
   -- B-channel 0/1 successfully restarted on span 1

i try to bak the app_queue to old version (r30546) to see what happend.

By: BJ Weschke (bweschke) 2006-06-16 07:10:41

I've been over the 1.2.X code again this morning and I really cannot see any circumstances from the new code that should cause a deadlock on top of the risks that already existed from prior code. The code introduced in r30546 was put in to stop mutexes from getting locked when app_queue receives state updates about devices that having nothing to do with app_queue.
The bt's attached here already confirm what we know, that there's a deadlock happening, but we don't have any data as to where the original mutex causing the deadlock was locked.
We need some DEBUG/VERBOSE traces up here along with queues.conf and extensions.conf (and agents.conf - if applicable) configurations attached here to try and get a better idea of what's going on.

By: BJ Weschke (bweschke) 2006-06-16 08:05:43

maybe related to 7334? Can anyone take a look and tell me if there are similarities in configs?

By: Fernando Romo (el_pop) 2006-06-16 16:06:09

Ok, i found some workaround stability: I compile SVN-branch-1.2-r34400 but i roll back app_queue.c to version r30546 and i have:

pbx*CLI> show uptime
System uptime: 15 hours, 34 minutes, 10 seconds

it's a record, the operation continue with high load, i supose the problem is the app_queue.c than make pbx.c to measure the call level in wrong way.

By: Fernando Romo (el_pop) 2006-06-17 11:03:19

the system seems stable with the back roll of app_queue.c:

pbx*CLI> show uptime
System uptime: 1 day, 10 hours, 34 minutes, 23 seconds

but ocasionaly i have some PRI HDLC errors, i check with the carrier any anomaly before blame asterisk.

The problem must be in app_queue.c and how deal with the locks.



By: Fernando Romo (el_pop) 2006-06-17 15:32:19

The workaround fail in 1 day 14:35 hours later with the same beaheivor. The back roll of app_queue.c delay the problem, but persist.

By: Vladimir S. Blazhkun (vovan) 2006-06-19 05:21:44

Same problem here, after 2-3 hours of high load asterisk becomes totally locked. For example: show queues doesn't show anything and then console locks. Logs has tons of:

Jun 19 12:55:23 DEBUG[29673] chan_sip.c: = No match Their Call ID: 74174E13-FEA711DA-A0E5B926-5E8B904A@10.0.0.61 Their Tag 3FA6B100-2326 Our tag: as72920e1e

By: caspy (caspy) 2006-06-19 06:51:41

1.2.9.1, crashed again on transfering call to AudioCodes gateway.

look for:

asterisk_bt_20060619_2.txt -- as usual,
debug_20060619.txt -- it a 'rasterisk | tee -a <file>' of asterisk's console with 'sip debug', 'set verbose 4', 'set debug 4'. File cleaned from other debug, and contain my comments.

Some comments:
SIP/1127 (10.10.11.210) -- operator, getting calls via Local/1127,
7677 -- fax number for operator
677@gatewayTDO -- fax machine via AudioCodes

Thanks.

ps: debug_20060619_1.txt is a cleared from special chars copy of debug_20060619.txt



By: Geoff Oakham (goakham) 2006-06-19 08:35:20

el_pop: did you get a backtrace of Asterisk during the last crash?  I curious if you found another bug.

The main reason I'm asking is because we experience this type of crash before our upgrade to 1.2.9.1, but only once in 3 months.  After upgrading to 1.2.9.1, we crashed 3 times in 4 days.  I'm suggesting there are two seperate bugs here.  Thoughts?

By: Geoff Oakham (goakham) 2006-06-19 08:44:16

We've been running on 1.2.9.1 patched to rollback the changes r30770 and r30546, and things appear to be stable so far.  I'll keep you posted--if we make it to Wednesday without a crash, I'll be a happy camper.

Oh, el_pop: I noticed you rolled back app_queue to r30546.. did you mean r30545 (ie, not including r30546)?  Also, r31921 appears to be a fix for r30424--maybe you should consider applying it to your active version.

By: caspy (caspy) 2006-06-19 15:55:39

Heh. It's reproducable. Rather hard to do, but possible. And too much work needed for log cleaning, cause of large amount of debug (i can do this only in production). :(

Scenario:
- call to queue (from zap in my case),
- take a call with agent (agents uses sjphone in my case),
- put call on hold,
- make a call to a _busy_(!) number, which is behind(!) sip\fxo gateway,
- while gateway is trying to dial that busy number, you need to select "transfer" menu item in sjphone for that call,
- and when, while you a navigating menu, this call leg dissapers,
your sjphone will do an invalid(?) transfer.

That moment you'll see a problem. I don't know why, but this invalid transfer attempt will put asterisk in locked state.

debug_20060619_1.txt contains the same scenario, that i got just now.
but, cause is's really annoing to clear debug log, i asks you to analyze debug_20060619_1.txt. Thanks.

By: caspy (caspy) 2006-06-19 16:04:31

also, it seems to me, that AudioCodes is not a problem itself.
here may be any SIP-FXO gateway, cause of delay while gw dialing number from fxo port to telco. this delay (up to 1+ sec) is sufficient to open transfer menu (but no to transfer, only select menu) in sjphone. so, when gw detects busy, and destroy a call from sjphone to it, you will get a dumb situation in sjphone: open menu for transfer for destroed call. and finishing of this transfer, put Asterisk in lock.

By: Serge Vecher (serge-v) 2006-06-19 16:09:58

caspy: I've looked at debug_20060619_1.txt and it does not contain the any of the debug output.

1) Make sure your logger.conf has the following line:
  console => notice,warning,error,debug
2) Restart asterisk
3) Enable SIP transaction logging with the following CLI commands:
set debug 4
set verbose 4
sip debug

Capture the full log and attach it here so this can be properly analysed.

By: Fernando Romo (el_pop) 2006-06-19 18:45:56

i compile with version SVN-branch-1.2-r34655 when i notice a chenge in pbx.c, so far the system has an uptime of 17 hours without a crash. i try to make a bugtrace when the customer end the workshift.

By: Serge Vecher (serge-v) 2006-06-19 21:01:46

el_pop: I don't think we need another backtrace here. caspy's analysis shows something very not kosher happening with a sip transfer out of the queue. If you have the same setup as described by caspy, then, please, produce the verbose sip debug I've requested. If you don't have the same setup, you may have a different issue. Thanks

By: Serge Vecher (serge-v) 2006-06-19 21:13:14

caspy: did you try running trunk at all? If your problems are indeed related to sip transfers, then you will benefit from better code in trunk. Also, trunk should be fairly stable right now, as it is steps away from being in beta 1.4 status.

By: caspy (caspy) 2006-06-20 02:17:19

hmm.. next evening i'll try to crash it again, but only evening (too much other activity).

but, it seems to me, that bug is in attempting to transfer a call into itself:

<-- SIP read from 10.10.11.210:5060:
REFER sip:495XXXXXXX@10.10.11.61 SIP/2.0
Via: SIP/2.0/UDP 10.10.11.210;rport;branch=z9hG4bK0a0a0bd200000175449646a800004c110000059b
Content-Length: 0
Contact: <sip:1127@10.10.11.210:5060>
Call-ID: 13c380d7783e881372ebd560453e866c@masterhost.ru
CSeq: 4 REFER
From: "unknown"<sip:1127@10.10.11.210:5060>;tag=2208271871221
Max-Forwards: 70
Referred-By: <sip:1127@10.10.11.210:5060>
Refer-To: <sip:495XXXXXXX@masterhost.ru?Replaces=13c380d7783e881372ebd560453e866c%40masterhost.ru%3Bto-tag%3Das3886ec4e%3Bfrom-tag%3D2208271871221>
To: <sip:495XXXXXXX@masterhost.ru>;tag=as3886ec4e
User-Agent: SJphone/1.60.289a (SJ Labs)


--- (12 headers 0 lines)---
Transfer to 495XXXXXXX in fromoffice
Transfer from 1127 in fromoffice
Jun 19 10:39:36 NOTICE[4042]: chan_sip.c:6866 get_refer_info: Supervised transfer attempted to transfer into same call id (13c380d7783e881372ebd560453e866c@masterhost.ru == 13c380d7783e881372ebd560453e866c@masterhost.ru)!
Reliably Transmitting (NAT) to 10.10.11.210:5060:
SIP/2.0 404 Not Found
Via: SIP/2.0/UDP 10.10.11.210;branch=z9hG4bK0a0a0bd200000175449646a800004c110000059b;received=10.10.11.210;rport=5060
From: "unknown"<sip:1127@10.10.11.210:5060>;tag=2208271871221
To: <sip:495XXXXXXX@masterhost.ru>;tag=as3886ec4e
Call-ID: 13c380d7783e881372ebd560453e866c@masterhost.ru
CSeq: 4 REFER
User-Agent: Asterisk PBX
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY
Contact: <sip:495XXXXXXX@10.10.11.61>
Accept: application/sdp
Content-Length: 0

---

By: Serge Vecher (serge-v) 2006-06-20 09:40:06

caspy: that's a very good possibility. Can you please attach all traces (no matter how small) as attachments -- mantis is mangling the stuff and turning it into links. Thanks.

By: caspy (caspy) 2006-06-20 20:52:47

asterisk_bt_20060621_1.txt
debug_20060621_1.txt

Thats was a clean Asterisk's start. Debug collected from start to end.

Scenario:
1) call from 6104314 (via Zap) to queue (callcenterq)
2) take a call with sjphone agent (SIP/1211 via Local/1211)
3) put call on hold
4) dial from sjphone to a busy line behind sip/fxp gateway (exten 7677, which dials 677@10.10.10.37)
5) click transfer in sjphone at 'correct' moment, so doing an attempt to transfer call into itself.

6) take a backtrace
7) issue 'show channels' (look at the end of debug_20060621_1)

Reproducibility == always.

Thats all.
Hope this helps.

By: Leonardo Gomes Figueira (sabbathbh) 2006-06-21 17:20:09

I have this lockup problems with queues here too after upgrading to 1.2.9.1 from 1.2.7.1. I had to downgrade back to 1.2.7.1 on production servers. The tests below were made on a devel box.

I uploaded 2 set of logs and backtraces of this 2 different results I get after transfering a call answered by an agent from a queue using SIP transfer:

1- Transfering the call to a Polycom IP300 the call gets "locked" on Polycom/Asterisk and Asterisk segfaults after issuing an "show channels":

debugagenttransfer_segfault_log.txt
debugagenttransfer_segfault_gdb.txt

2- Asterisk in deadlock on queue call

debugagenttransfer_queuedeadlock_log.txt
debugagenttransfer_queuedeadlock_gdb.txt

Both situations can be reproduced on every attempt.

Summary of the situations registered on log files:

1- Segfault situation:

A- SIP/3050 (Soyo) calls queue
B- Agent/3054 - SIP/3054 (Grandstream) answer call
C- SIP/3054 (Grandstream) makes a SIP attended transfer to SIP/3053 (Polycom)
D- SIP/3050 (Soyo) gets a hangup after the transfer
E- Call stay up on SIP/3053 (Polycom)
F- Issuing a "show channels" on cli segfault Asterisk:

pfdesenv*CLI> show channels
Channel              Location             State   Application(Data)
SIP/3053-89ad        (None)               Up      Bridged Call(Agent/3054)
Agent/3054           s@macro-atende:900   Up      Dial(SIP/3053|30|tT)
pfdesenv*CLI>
Disconnected from Asterisk server
Executing last minute cleanups

2- Deadlock situation:

A- SIP/3050 (Soyo) calls queue
B- Agent/3053 - SIP/3053 (Polycom) answer call
C- SIP/3053 (Polycom) makes a SIP attended transfer to SIP/3054 (Grandstream)
D- Both SIP/3050 (Soyo) and SIP/3054 (Grandstream) gets a hangup after the transfer
E- SIP/3050 (Soyo) calls queue again
F- Agent/3054 - SIP/3054 (Grandstream) answer call
G- SIP/3050 (Soyo) hangup the call
H- SIP/3054 (Grandstream) gets a hangup
Everything is "fine" up to now (except by the unsucessuful transfer that results on hangup)
I- SIP/3050 (Soyo) calls queue for the third time
J- Asterisk deadlocks on Queue:

-- Executing Queue("SIP/3050-0fb0", "desenvolvimento|tr|||600") in new stack

Every queue/agent related command stops working on cli. Asterisk must be restarted.

I hope the backtraces are useful, if I missed some info please ask and I will be glad to help tracking this problem.

Thanks,

 Leonardo

By: Leonardo Gomes Figueira (sabbathbh) 2006-06-21 17:33:58

I forgot to add some info:

1- This situations are not load related. On this tests I start asterisk on devel box, make the 2 or 3 calls to queue like said on previous post and the problem occurs. There is no other calls or external services running during this tests.

2- This box has 2 E1 ports, 1 PRI span configured but not used (no cable):

pfdesenv*CLI> zap show status
Description                              Alarms     IRQ        bpviol     CRC4
T2XXP (PCI) Card 0 Span 1                RED        0          0          0
T2XXP (PCI) Card 0 Span 2                RED        0          0          0

So I think the problem it's not related to any hardware/zaptel/pri stuff

3- There is no manager connections during this tests.

4- I'm not using MixMonitor, I'm using Monitor but I just tested disabling all monitor and queue recording and repeated the test: same problem.

By: caspy (caspy) 2006-06-26 02:34:41

vechers, any news? can i help somehow?

By: Serge Vecher (serge-v) 2006-06-26 10:05:48

caspy: see my note in 0048137. Trunk is almost 1.4 beta right now, so we need to know if this bug is present there. Thanks.

By: Olle Johansson (oej) 2006-06-26 10:28:10

Please try with latest svn 1.2. Thanks.

By: Geoff Oakham (goakham) 2006-06-26 11:29:53

An update on my previous post: we've been running 1.2.9.1 build without r30770 and r30546, since June 16th (so 6 business days total) and experienced only two crashes, both on June 21.  So it doesn't appeared to have solved the problem, but might have reduced the frequency.

After analysing the crashes, I found one of them co-incided with a queue call being rejected (busy).. so I've asked the queue agents to not put their phones in 'do not disturb' or hit 'reject' with queue calls.

Unfortunately, I was unable to get decent debug information from either crash, but I'm ready for the next time.

By: Serge Vecher (serge-v) 2006-06-26 11:38:15

goakham, el_pop, sabbathbh.

caspy has narrowed down this issue to a very specific set of events that has been described in note 0048192 in detail. If you have pri desynchronization, but it does not happen the same way is in caspy's case, please open a new bug so those problems/bugs can get better attention. Thanks

If your problems are reproduced in a way similar to caspy's, then please test the latest revision of 1.2 branch, which oej has hopefully fixed with r36043. Thanks

By: caspy (caspy) 2006-06-26 14:26:51

oej,

SVN-branch-1.2-r36043,
the same scenario as in 0048192 comment,
crashed.

Jun 26 23:19:13 NOTICE[7985] chan_sip.c: Supervised transfer attempted to transfer into same call id (0fafbb5a01be1b1206c56cf462f80091@masterhost.ru == 0fafbb5a01be1b1206c56cf462f80091@masterhost.ru)!

phone2*CLI> show channels
Channel              Location             State   Application(Data)
0 active channels
1 active call
Jun 26 23:19:19 WARNING[8028]: channel.c:787 channel_find_locked: Avoided initial deadlock for '0x40623cc0', 10 retries!

and backtrace (asterisk_bt_20060626_1.txt) attached.

By: Leonardo Gomes Figueira (sabbathbh) 2006-06-29 12:55:35

Hello,

repeated my tests described on note 0048236 with last 1.2 branch and both the segfault and deadlock still happens.

Asterisk SVN-branch-1.2-r36254M built by root @ pfdesenv on a i686 running Linux on 2006-06-29 17:29:20 UTC

vechers,

my issues are triggered by SIP transfer and are well described on note 0048236, it's better open a new issue like you said in 0048390 or stick with this ? They are not related to PRI.

 Leonardo

By: Serge Vecher (serge-v) 2006-06-29 14:38:20

sabbathbh: yes, I've reread your description and you definately have a different problem. Please open a new bug report and repost your analysis of the problem, steps to reproduce, as well as sip debug. Thanks.

By: Leonardo Gomes Figueira (sabbathbh) 2006-06-30 07:07:51

Opened 7458 with my especific issues.

http://bugs.digium.com/view.php?id=7458

By: Olle Johansson (oej) 2006-07-02 14:36:28

caspy: Can I get an updated SIP debug with the latest svn 1.2? I need to see what happens now that we don't retransmit the reply. I need a full SIP debug with verbose set to 4 and debug set to 4, make sure you capture all output on the verbose and debug channel. THank you.

By: caspy (caspy) 2006-07-02 16:45:16

SVN-branch-1.2-r36377
asterisk_bt_20060703_1.txt
debug_20060703_1.txt

Please, look for comment 0048192. Here was exactly tha same scenario.

By: Geoff Oakham (goakham) 2006-07-18 13:34:05

FYI, I can confirm this bug is *not* present in Asterisk 1.2.7.1.  Has anyone tried 1.2.10?

By: Serge Vecher (serge-v) 2006-09-01 14:25:58

caspy, goakham: what's the status in 1.2.11 / latest 1.2 branch?

By: Serge Vecher (serge-v) 2006-09-20 10:29:30

Presumed fixed in latest Asterisk release 1.2.12.1. If not the case, please ask the bug-marshall to reopen the issue, thanks.