Summary:ASTERISK-20690: Call gets stuck in Queue and all subsequent calls to the queue do not get delivered to available agents.
Reporter:Jeff Hoppe (jhoppebugs)Labels:
Date Opened:2012-11-16 12:42:25.000-0600Date Closed:2018-01-02 08:30:35.000-0600
Status:Closed/CompleteComponents:Applications/app_queue PBX/pbx_realtime
Versions:10.8.0 10.10.0 Frequency of
Environment:Centos 5.7 2.6.18-274.17.1.el5Attachments:( 0) Agent_Login_and_Logoff.pdf
( 1) Answers_to_Questions.pdf
( 2) callstuckinqueue_20121203.pcap
( 3) debug.22.gz
( 4) debug.23.gz
( 5) fulllog.txt
( 6) jira_asterisk_20690_combined_v10.patch
( 7) jira_asterisk_20690_mrg_up_v10.patch
( 8) jira_asterisk_20690_one_login_v10.patch
( 9) queues.conf
(10) sip.conf
(11) Stuck_in_queue_debug.txt
(12) Summary.txt
Description:A call will get stuck in the queue for some reason, the call does not actually exist anymore but Asterisk still thinks it does.  This call never gets sent to an agent and subsequent calls to the queue are never sent to available agents.
Comments:By: Jeff Hoppe (jhoppebugs) 2012-11-16 12:43:18.659-0600

Here is a debug log for the full log snippet.

By: Jeff Hoppe (jhoppebugs) 2012-11-16 12:44:00.944-0600

Here is the full log for the call hitting the system and getting into the queue.

By: Jeff Hoppe (jhoppebugs) 2012-11-16 12:46:22.599-0600

1 minute after the call hit the queue the RTP activity ceased.  Then another minute later the call got disconnected, but yet Asterisk still thought it had it.

11/14/2010 12:22:08|" VERBOSE"|29311|" pbx_realtime.c"|"-- Executing |80086055041@ivrcontext1-pn:6| Queue("SIP/hq-ast-006-user-0000583d", "RBENGLISH,tT,,,1500,,,,overflow15")"|""
11/14/2010 12:22:08|" VERBOSE"|29311|" res_musiconhold.c"|"-- Started music on hold, class '10001270-rbenglish', on SIP/hq-ast-006-user-0000583d"|""
11/14/2010 12:24:12|" NOTICE"|20329|" chan_sip.c"|"Disconnecting call 'SIP/hq-ast-006-user-0000583d' for lack of RTP activity in 61 seconds"|""

By: Rusty Newton (rnewton) 2012-11-26 11:39:47.238-0600


For the "stuck in queue debug" , can you get the same debug but with VERBOSE at level 5 enabled as well?

Other items that will help anyone looking into this:

-point out where the stuck call begins just to make it easy on those looking through it.
-a SIP & RTP packet trace on both systems for the first call that gets stuck would be very helpful.
-attach sanitized queues.conf configuration and sip.conf configuration.

By: Jeff Hoppe (jhoppebugs) 2012-11-26 13:06:39.489-0600

I will get this information you asked.  It may take 2 - 4 weeks to produce this issue.   I plan on upgrading to 10.10.0 in 2 days, so if we get this issue again it will be with that version.

By: Jeff Hoppe (jhoppebugs) 2012-12-06 13:13:07.682-0600

This is the starting debug file.

By: Jeff Hoppe (jhoppebugs) 2012-12-06 13:13:34.094-0600

This is the ending debug file.

By: Jeff Hoppe (jhoppebugs) 2012-12-06 13:28:14.974-0600

This is a summary of what to look for in the debug files.

By: Jeff Hoppe (jhoppebugs) 2012-12-06 13:34:07.339-0600

I attached the debug file at level 5, the first file is names debug.23 and the second file is debug.22.   Queue.conf and Sip.Conf have been applied.  I am using version 10.10.0  
CallStuckInQueue_20121203.pcap is the SIP on the call that got stuck in queue.

The summary.txt file is verbiage that should help the developer look through debug file.  Let me know if I can contribute more on this.

By: Jeff Hoppe (jhoppebugs) 2012-12-06 13:42:50.582-0600

Is this all you need from me?

By: Matt Jordan (mjordan) 2012-12-06 13:58:00.448-0600

I'm curious - do you have this problem if you don't use pbx_realtime?  Is it possible for you to test a configuration that is using standard Asterisk dialplan, as opposed to pbx_realtime?

By: Jeff Hoppe (jhoppebugs) 2012-12-06 14:04:08.610-0600

We have 16000 dialplan entries in our database. I don't want to manage that size in the config file.  We also change them frequently.  

By: Matt Jordan (mjordan) 2012-12-17 08:50:19.084-0600

I think we're looking for a needle in a stack of needles right now. We need some additional information to try and narrow down what is occurring so that we can reproduce the problem, or at least determine what path of execution in Asterisk could potentially be causing the problem.

# In the logs provided, which channels are stuck?
# What dialplan were they executing when they were left in memory? Can you provide the necessary portions of your dialplan and queues.conf from your databases?
# When a channel is stuck, is it always the same portions of dialplan? Is it different? Are the agents or callers doing particular actions?

Please keep in mind that {{pbx_realtime}} is extended support, and as such, if this ends up being a problem with {{pbx_realtime}} - that is, we cannot reproduce it using normal Asterisk dialplan - response times may reflect the location of the problem.

By: Jeff Hoppe (jhoppebugs) 2013-01-04 10:30:31.973-0600

I believe I have answered your questions in the Answers to Questions.pdf attached.
The Agent Login and Logoff is an addendum, the Answeres to Questions.pdf

Let me know if you need more information.

By: Matt Jordan (mjordan) 2013-01-15 17:14:17.123-0600

Thanks for the detailed analysis. It's still a bit murky as to who the culprit is, but based on your logs, there are a few observations:
# There are some database transactions that appear to be exacerbating whatever issue you are seeing. When the agent logs off, an ODBC transaction occurs to update what I believe is the Queue log. This takes quite a long time, and it does in fact happen immediately before some cleanup and device state changes that affect the status of the agent channel. It's possible this is what is creating the skewed situation where the Agentlogoff takes a very long time, thereby preventing the full cleanup of the current Agent channel and the Agent pvt before the Agent attempts another login.
# Whether or not {{pbx_realtime}} is exacerbating this situation I'm not sure - however, it's possible it could be. Are the queue_log and the dialplan stored in the same database?
# It's not (unfortunately) terribly surprising that {{chan_agent}} would run into this, given it's threading issues. However, the calls out to external systems may be making this issue more likely to happen.

As it is, it's definitely a bug in at least {{chan_agent}}, albeit one that would be very difficult to reproduce.

By: Jeff Hoppe (jhoppebugs) 2013-01-17 10:46:50.244-0600

Based on your analysis, I noticed that we are using the same res_odbc.conf database connection for all our real time entities.  This may be what is holding up the queue_log insert.  I will give it its own connection in res_odbc.conf.  
To answer your question, dialplan and queue_log are in the same database, along with static realtime agents, voicemails and musiconhold and finally along with realtime queues and queue_members.

The work around I plan to implement on this is as follows: The way I see it is that I have to make sure that the AgentLogin() application does not get called before I get an AGENTLOGOFF event for the previous connection for a particular agent.  My application that listens to the manager events (AMI) will insert/delete and entry in a table based on the AGENTLOGIN and AGENTLOGOFF events respectively. Upon an agent logging in, the Asterisk dialplan will make sure that no record exists in this table for the given agent, before I call the AgentLogin().  This should prevent the AGENTLOGIN coming before the  AGENTLOGOFF event.

The only thing I am assuming for this work around is that the AGENTLOGOFF manager event gets sent after all agent channel clean up has occurred.

By: Richard Mudgett (rmudgett) 2013-01-17 12:32:23.756-0600

[^jira_asterisk_20690_mrg_up_v10.patch] - Brings chan_agent.c up-to-date with v1.8 patches.
Multiple revisions 378427,378456,378486

 r378427 | rmudgett | 2013-01-03 11:41:46 -0600 (Thu, 03 Jan 2013) | 4 lines

 chan_agent: Fix agent_indicate() locking.

 Avoid deadlock potential with local channels and simplify the locking.
 r378456 | rmudgett | 2013-01-03 12:39:41 -0600 (Thu, 03 Jan 2013) | 14 lines

 chan_agent: Misc code cleanup.

 * Fix off-nominal path resource cleanup in agent_request().

 * Create agent_pvt_destroy() to eliminate inlined versions in many places.

 * Pull invariant code out of loop in add_agent().

 * Remove redundant module user references in login_exec().

 * Remove unused struct agent_pvt logincallerid[] member.

 * Remove some redundant code in agent_request().
 r378486 | rmudgett | 2013-01-03 13:40:55 -0600 (Thu, 03 Jan 2013) | 9 lines

 chan_agent: Fix wrapup time wait response.

 * Made agent_cont_sleep() and agent_ack_sleep() stop waiting if the wrapup
 time expires.  agent_cont_sleep() had tried but returned the wrong value
 to stop waiting.

 * Made agent_ack_sleep() take a struct agent_pvt pointer instead of a void
 pointer for better type safety.

Merged revisions 378427,378456,378486 from http://svn.asterisk.org/svn/asterisk/branches/1.8

[^jira_asterisk_20690_one_login_v10.patch] - This is the v10 version of the patch up on reviewboard: https://reviewboard.asterisk.org/r/2260/
It must be applied after the mrg_up patch.

[^jira_asterisk_20690_combined_v10.patch] - This combines the two patches so you only need to apply this one.  The other two are showing where the changes came from.

These patches should prevent an agent loging in to a chan_agent if another agent is already logged in.

By: Jeff Hoppe (jhoppebugs) 2013-01-17 12:55:37.588-0600

Are these patches going to be applied in a future realease of 10?  

By: Richard Mudgett (rmudgett) 2013-01-17 13:10:31.626-0600

No.  Asterisk 10 is no longer in regular maintenance.  It is in security fix only mode.

By: Richard Mudgett (rmudgett) 2013-01-25 16:52:01.868-0600

Did these patches resolve your problem?

By: Jeff Hoppe (jhoppebugs) 2013-01-31 09:38:37.826-0600

I will apply this patch to one of our systems to verify that it solves our issue within the next week.

By: Joshua C. Colp (jcolp) 2017-12-19 05:20:11.019-0600

Did you ever apply the patches? Is this still a problem in supported versions of Asterisk?

By: Asterisk Team (asteriskteam) 2018-01-02 08:30:35.572-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].
[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines