Summary: | ASTERISK-10201: 'Unknown' member status in app_queue | ||
Reporter: | jfitzgibbon (jfitzgibbon) | Labels: | |
Date Opened: | 2007-08-30 08:23:31 | Date Closed: | 2011-06-07 14:03:09 |
Priority: | Major | Regression? | No |
Status: | Closed/Complete | Components: | Applications/app_queue |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) backtrace.txt ( 1) core_show_locks.out | |
Description: | Without any obvious trigger, a large number of the members in all of my queues changed to a status of 'Unknown': cs_billing has 11 calls (max unlimited) in 'rrmemory' strategy (114s holdtime), W:0, C:1771, A:58, SL:93.8% within 60s Members: SIP/1405 (dynamic) (Unknown) has taken no calls yet SIP/1420 (dynamic) (paused) (Not in use) has taken no calls yet SIP/1442 (dynamic) (paused) (Unknown) has taken 3 calls (last was 1523 secs ago) SIP/1440 (dynamic) (In use) has taken 4 calls (last was 1262 secs ago) SIP/1428 (dynamic) (paused) (Not in use) has taken 7 calls (last was 2969 secs ago) SIP/1404 (dynamic) (paused) (Not in use) has taken 7 calls (last was 2918 secs ago) SIP/1429 (dynamic) (paused) (Unknown) has taken 17 calls (last was 5 secs ago) SIP/1432 (dynamic) (Unavailable) has taken 17 calls (last was 965 secs ago) SIP/1430 (dynamic) (In use) has taken 15 calls (last was 3506 secs ago) SIP/1435 (dynamic) (In use) has taken 17 calls (last was 1808 secs ago) SIP/1434 (dynamic) (Unavailable) has taken 19 calls (last was 827 secs ago) SIP/1424 (dynamic) (In use) has taken 24 calls (last was 1277 secs ago) SIP/1408 (dynamic) (paused) (Not in use) has taken 22 calls (last was 2770 secs ago) SIP/1203 (dynamic) (In use) has taken 16 calls (last was 1730 secs ago) SIP/1410 (dynamic) (Unknown) has taken 20 calls (last was 292 secs ago) Callers: 1. Zap/60-1 (wait: 8:51, prio: 0) 2. Zap/65-1 (wait: 6:04, prio: 0) 3. Zap/71-1 (wait: 5:50, prio: 0) 4. Zap/69-1 (wait: 5:22, prio: 0) 5. Zap/26-1 (wait: 4:51, prio: 0) 6. Zap/28-1 (wait: 4:14, prio: 0) 7. Zap/27-1 (wait: 3:33, prio: 0) 8. Zap/30-1 (wait: 2:45, prio: 0) 9. Zap/33-1 (wait: 1:58, prio: 0) 10. Zap/34-1 (wait: 1:48, prio: 0) 11. Zap/35-1 (wait: 1:21, prio: 0) This has happened once before (when we were running 1.4.9) just over a month ago. I was unable to reproduce the behaviour in a lab environment. When this happens, ringinuse=no stops being effective (because 'Unknown' members are considered available to take a call. app_queue starts to dequeue calls to agents who are already on a call. The SIP channels of the agents have a call-limit of 2, so when this happens the log fills up with: pbxtel-01*CLI> [Aug 29 16:43:08] ERROR[22762]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2 -- Couldn't call SIP/1410 pbxtel-01*CLI> [Aug 29 16:43:08] ERROR[22762]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2 -- Couldn't call SIP/1429 pbxtel-01*CLI> [Aug 29 16:43:09] ERROR[22851]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2 -- Couldn't call SIP/1429 pbxtel-01*CLI> [Aug 29 16:43:09] ERROR[22851]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2 -- Couldn't call SIP/1410 pbxtel-01*CLI> [Aug 29 16:43:09] ERROR[22712]: chan_sip.c:3169 update_call_counter: Call to peer '1429' rejected due to usage limit of 2 -- Couldn't call SIP/1429 pbxtel-01*CLI> [Aug 29 16:43:09] ERROR[22712]: chan_sip.c:3169 update_call_counter: Call to peer '1410' rejected due to usage limit of 2 -- Couldn't call SIP/1410 Attempts to have remove and add agents does not fix things - they go back into an Unknown state as soon as they have completed a call. The only way I could resolve the issue was to restart Asterisk. I killed the running process to generate a core file, which is attached. The tarball also contains a full backtrace and a copy of the asterisk binary, which is from a 'Linux pbxtel-01.comwave 2.6.9-55.ELsmp #1 SMP Wed May 2 14:04:42 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux' system running CentOS 4.5. There is nothing obvious in the logs preceeding the trouble to indicate why agents were marked as 'Unknown'. ****** ADDITIONAL INFORMATION ****** Report marked as private because it contains a full core dump which probably contains passwords. If the core can be removed by whoever this is assigned to, the report can be marked public. | ||
Comments: | By: jfitzgibbon (jfitzgibbon) 2007-08-30 08:26:11 The core / binary backtrace is too large to attach, so I've just attached the backtrace file. I'll find a place to put a tarball of the core and binary for download. By: jfitzgibbon (jfitzgibbon) 2007-08-30 08:28:06 Core and binary can be downloaded from http://carriersupport.comwave.net/asterisk-bugs.core.10605.tar.gz By: Jason Parker (jparker) 2007-08-30 10:08:56 Why is this marked as private? By: jfitzgibbon (jfitzgibbon) 2007-08-30 10:19:08 I explained the reason for marking it private in 'Additional Info' By: jfitzgibbon (jfitzgibbon) 2007-08-30 10:22:14 Corydon76 indicates that the core is useless to you, so I've removed it from the URL I posted. This can be marked public. By: jfitzgibbon (jfitzgibbon) 2007-09-05 08:33:14 This happened again on Tuesday 09/04, requiring a restart. Nothing in the logs to indicate why the SIP channels went into an unknown state. I took a snapshot of 'core show locks' before killing * to generate a core dump, which I've attached. I'm recommending that we go back to 1.4.7.1, which is the last version of Asterisk that we haven't seen catastrophic failure of app_queue on. It's also the last version before all the work to improve realtime queues was started. By: jmls (jmls) 2007-09-12 16:40:59 is this resolved by the large changes to app_queue in 1.4 svn ? By: Mark Michelson (mmichelson) 2007-10-11 09:55:58 I just asked for a status update on this one, since it appears that this is a chan_sip issue. It appears that this is not resolved yet. By: jfitzgibbon (jfitzgibbon) 2007-10-11 10:18:54 Management has me locked to 1.4.7.1 (which has not exhibited *any* problems since we rolled back), so I can't tell if things have been fixed in 1.4.12/.13/SVN. The plan right now is to wait for 1.4 ABE. I know that doesn't help squash this in any way, but since the bug (and others I have with app_queue) only manifest under production load, my hands are tied. This should probably be closed; if I at some point in the future get permission to try later revisions I'll re-open it or re-file, whatever Mantis allows. Thanks By: Mark Michelson (mmichelson) 2007-11-01 14:07:10 Closing at reporter's request. |