ASTERISK-10201: 'Unknown' member status in app

[Home]

Summary: ASTERISK-10201: 'Unknown' member status in app_queue

Reporter: jfitzgibbon (jfitzgibbon) Labels:

Date Opened: 2007-08-30 08:23:31 Date Closed: 2011-06-07 14:03:09

Priority: Major Regression? No

Status: Closed/Complete Components: Applications/app_queue

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) backtrace.txt
( 1) core_show_locks.out

Description: Without any obvious trigger, a large number of the members in all of my queues changed to a status of 'Unknown':

cs_billing has 11 calls (max unlimited) in 'rrmemory' strategy (114s holdtime), W:0, C:1771, A:58, SL:93.8% within 60s
Members:
SIP/1405 (dynamic) (Unknown) has taken no calls yet
SIP/1420 (dynamic) (paused) (Not in use) has taken no calls yet
SIP/1442 (dynamic) (paused) (Unknown) has taken 3 calls (last was 1523 secs ago)
SIP/1440 (dynamic) (In use) has taken 4 calls (last was 1262 secs ago)
SIP/1428 (dynamic) (paused) (Not in use) has taken 7 calls (last was 2969 secs ago)
SIP/1404 (dynamic) (paused) (Not in use) has taken 7 calls (last was 2918 secs ago)
SIP/1429 (dynamic) (paused) (Unknown) has taken 17 calls (last was 5 secs ago)
SIP/1432 (dynamic) (Unavailable) has taken 17 calls (last was 965 secs ago)
SIP/1430 (dynamic) (In use) has taken 15 calls (last was 3506 secs ago)
SIP/1435 (dynamic) (In use) has taken 17 calls (last was 1808 secs ago)
SIP/1434 (dynamic) (Unavailable) has taken 19 calls (last was 827 secs ago)
SIP/1424 (dynamic) (In use) has taken 24 calls (last was 1277 secs ago)
SIP/1408 (dynamic) (paused) (Not in use) has taken 22 calls (last was 2770 secs ago)
SIP/1203 (dynamic) (In use) has taken 16 calls (last was 1730 secs ago)
SIP/1410 (dynamic) (Unknown) has taken 20 calls (last was 292 secs ago)
Callers:
1. Zap/60-1 (wait: 8:51, prio: 0)
2. Zap/65-1 (wait: 6:04, prio: 0)
3. Zap/71-1 (wait: 5:50, prio: 0)
4. Zap/69-1 (wait: 5:22, prio: 0)
5. Zap/26-1 (wait: 4:51, prio: 0)
6. Zap/28-1 (wait: 4:14, prio: 0)
7. Zap/27-1 (wait: 3:33, prio: 0)
8. Zap/30-1 (wait: 2:45, prio: 0)
9. Zap/33-1 (wait: 1:58, prio: 0)
10. Zap/34-1 (wait: 1:48, prio: 0)
11. Zap/35-1 (wait: 1:21, prio: 0)

This has happened once before (when we were running 1.4.9) just over a month ago. I was unable to reproduce the behaviour in a lab environment.

When this happens, ringinuse=no stops being effective (because 'Unknown' members are considered available to take a call. app_queue starts to dequeue calls to agents who are already on a call. The SIP channels of the agents have a call-limit of 2, so when this happens the log fills up with:

pbxtel-01*CLI>
[Aug 29 16:43:08] ERROR[22762]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2
-- Couldn't call SIP/1410

pbxtel-01*CLI>
[Aug 29 16:43:08] ERROR[22762]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2
-- Couldn't call SIP/1429

pbxtel-01*CLI>
[Aug 29 16:43:09] ERROR[22851]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2
-- Couldn't call SIP/1429

pbxtel-01*CLI>
[Aug 29 16:43:09] ERROR[22851]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2
-- Couldn't call SIP/1410

pbxtel-01*CLI>
[Aug 29 16:43:09] ERROR[22712]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2
-- Couldn't call SIP/1429

pbxtel-01*CLI>
[Aug 29 16:43:09] ERROR[22712]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2
-- Couldn't call SIP/1410

Attempts to have remove and add agents does not fix things - they go back into an Unknown state as soon as they have completed a call.

The only way I could resolve the issue was to restart Asterisk. I killed the running process to generate a core file, which is attached. The tarball also contains a full backtrace and a copy of the asterisk binary, which is from a 'Linux pbxtel-01.comwave 2.6.9-55.ELsmp #1 SMP Wed May 2 14:04:42 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux' system running CentOS 4.5.

There is nothing obvious in the logs preceeding the trouble to indicate why agents were marked as 'Unknown'.

****** ADDITIONAL INFORMATION ******

Report marked as private because it contains a full core dump which probably contains passwords. If the core can be removed by whoever this is assigned to, the report can be marked public.

Comments: By: jfitzgibbon (jfitzgibbon) 2007-08-30 08:26:11

The core / binary backtrace is too large to attach, so I've just attached the backtrace file. I'll find a place to put a tarball of the core and binary for download.
By: jfitzgibbon (jfitzgibbon) 2007-08-30 08:28:06

Core and binary can be downloaded from http://carriersupport.comwave.net/asterisk-bugs.core.10605.tar.gz
By: Jason Parker (jparker) 2007-08-30 10:08:56

Why is this marked as private?
By: jfitzgibbon (jfitzgibbon) 2007-08-30 10:19:08

I explained the reason for marking it private in 'Additional Info'
By: jfitzgibbon (jfitzgibbon) 2007-08-30 10:22:14

Corydon76 indicates that the core is useless to you, so I've removed it from the URL I posted. This can be marked public.
By: jfitzgibbon (jfitzgibbon) 2007-09-05 08:33:14

This happened again on Tuesday 09/04, requiring a restart. Nothing in the logs to indicate why the SIP channels went into an unknown state.

I took a snapshot of 'core show locks' before killing * to generate a core dump, which I've attached.

I'm recommending that we go back to 1.4.7.1, which is the last version of Asterisk that we haven't seen catastrophic failure of app_queue on. It's also the last version before all the work to improve realtime queues was started.
By: jmls (jmls) 2007-09-12 16:40:59

is this resolved by the large changes to app_queue in 1.4 svn ?
By: Mark Michelson (mmichelson) 2007-10-11 09:55:58

I just asked for a status update on this one, since it appears that this is a chan_sip issue. It appears that this is not resolved yet.
By: jfitzgibbon (jfitzgibbon) 2007-10-11 10:18:54

Management has me locked to 1.4.7.1 (which has not exhibited *any* problems since we rolled back), so I can't tell if things have been fixed in 1.4.12/.13/SVN.

The plan right now is to wait for 1.4 ABE. I know that doesn't help squash this in any way, but since the bug (and others I have with app_queue) only manifest under production load, my hands are tied.

This should probably be closed; if I at some point in the future get permission to try later revisions I'll re-open it or re-file, whatever Mantis allows.

Thanks
By: Mark Michelson (mmichelson) 2007-11-01 14:07:10

Closing at reporter's request.