ASTERISK-06452: Queues freeze if AgentCallbackLogin is used

[Home]

Summary: ASTERISK-06452: Queues freeze if AgentCallbackLogin is used

Reporter: HÃ¥kan KÃ¤llberg (hk) Labels:

Date Opened: 2006-03-02 03:34:32.000-0600 Date Closed: 2006-09-13 12:06:56

Priority: Major Regression? No

Status: Closed/Complete Components: Channels/chan_agent

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) agents.conf
( 1) ast_bug.log
( 2) extensions.conf
( 3) gdb_out.2.txt
( 4) jstorm_wh_0001_agents.conf
( 5) jstorm_wh_0001_bt_08_37
( 6) jstorm_wh_0001_bt_08_39
( 7) jstorm_wh_0001_extensions.conf
( 8) jstorm_wh_0001_messages_start_07_58_17___end_08_23_30
( 9) jstorm_wh_0001_queues.conf
(10) jstorm_wh_0002_trunk_bt1
(11) jstorm_wh_0002_trunk_bt2
(12) jstorm_wh_0002_trunk_messages
(13) manager_deadlock_log_extract.gz
(14) manager_deadlock_log.gz
(15) manager_eventq_backport-1.2.10.patch
(16) queues.conf
(17) rflopes3_bug_console.tar.gz
(18) superdjc_crash_messages.2.txt
(19) superdjc_crash_messages2.txt
(20) superdjc_gdb_output.2.txt
(21) superdjc_gdb_output.txt
(22) x.gdb

Description: I have a reproducible problem with queues. I have reproduced it
in two different setups which makes it possible to exclude some
causes.

After a while of using the queues the queue system freezes. If
you call a queue you get no answer, just ring tone. You can
still place normal calls through asterisk. The CLI is also
frozen. You can write "show queues" or whatever, but you do not
get an answer. If you restart asterisk it hangs a while before
stopping. No messages are produced any more, verbosity=3.

This happens when agents are logged in with AgentCallbackLogin.
It does only happen when agents have taken calls. It can happen
after the first call, it can take a day of use or so. It can
happen after a while of inactivity, e.i. after the night or
after lunch.

****** ADDITIONAL INFORMATION ******

Zaptel is in use on both system, on one just ztdummy on the
other wctdm for some faxes. Zaptel 1.2.3/4. One system is a
barebone system with no extra hardware, just a few sip and
iax2 clients. I can use this system for testing. One system used
leastreacent policy the other ringall. I am not using weight. One
system have a 64 bit P4-HT, the other 32 bit P4-HT. One system
used MOH the other system just rings. One system has a 4 port
Beronet card with mISDN/chan_misdn. I am not using cdr_mysql.

The other chan_agent/queue lockup bugs seem to behave
differently, I see nothing common. I'll try to hang out on
#asterisk-bugs if anyone has a question.

Comments: By: HÃ¥kan KÃ¤llberg (hk) 2006-03-02 05:52:47.000-0600

I am using SuSE Linux 10.0 ( Kernel 2.6.13-15.8-smp )
By: HÃ¥kan KÃ¤llberg (hk) 2006-03-04 04:43:45.000-0600

The problem remains in * 1.2.5. After the first accepted call, the queues hang.
By: HÃ¥kan KÃ¤llberg (hk) 2006-03-09 06:49:50.000-0600

The case where the lock up always occurs, after the first succesfully taken call, is related to using the Local channel in the Dial rule, I found out. This problem was described elsewhere. The other case, involving a direct call of the Queue, where it takes many calls to get the lock up situation, looks like:

exten => s,1,Dial(Local/s@open_hours)
exten => s,n,Queue(callcenter|t|||8)

The open_hours context just checks for business hours and plays a message when outside. *Could* it be that the use of the Local channel is a problem here too?
I will try to reproduce the problem completely without the use of the Local channel.
By: HÃ¥kan KÃ¤llberg (hk) 2006-03-11 04:33:24.000-0600

The case where I used Dial(Local/888@context) to get to the Queue call, that
always caused hangd queues is resolved. If I put a /n after the context, it works.

-- Later experience teach me that it just takes a few more calls before the
-- lockup with the /n option:-(

The case which behaves exactly the same, but takes a lot of calls to reproduce, remains.I will try to reproduce it in a test environment, outside our call center.

Btw, when I try to take a backtrace with gdb connected to the running asterisk process, *gdb* core dumps:-(

% gdb --version
GNU gdb 6.3

By: Serge Vecher (serge-v) 2006-05-01 15:46:40

Is this still an issue? If so, please provide a backtrace made from a core snapshot while the deadlock is in effect. Thank you.
By: David J Craigon (superdjc) 2006-05-02 10:21:51

This used to be related to bug number 6147, before that bug was cut down in its prime :'(. See most of the notes for that bug.
By: Serge Vecher (serge-v) 2006-05-02 10:29:52

the notes in 6147 indicate at least 3 different problems with different symptoms. Developers cannot understand the problem if not proper debugging information is attached and therefore can't fix it. If you experience the same apparent deadlock as hk, please attempt to debug and post the result here. Thanks.
By: David J Craigon (superdjc) 2006-05-02 10:32:39

Just to clarify what is known about this bug.

1) The queue system crashes when an agent calls AgentCallbackLogin, or shortly afterwards.
2) When the queue system has crashed, it locks any PRI lines it has in use. This causes asterisk to think that PRI lines are in use when they are in fact free. All incoming calls are blocked with messages like this in the log:
Mar 23 10:16:47 WARNING[17119]: chan_zap.c:8514 pri_dchannel: Ring requested on channel 0/4 already in use on span 1. Hanging up owner.

The only escape from this is to restart Asterisk.

3) When the queue system has crashed, you can still log into the asterisk console, and run commands. Once you have typed show queues or show queue, you can't do anything else with the console- you get no output. Starting a new console lets you type commands again.

4) I've got it to do this with no calls on the system apart from the login call.
By: David J Craigon (superdjc) 2006-05-02 10:40:46

Sorry, your right- there does seem to be a lot of rubbish in 6147. Some of the notes are still useful.

Anyhow, I've uploaded a dump of a gdb session on unoptimised asterisk. It's not the latest asterisk, but I promise the issue is the same on the new one. I'll try and get a dump from that at some point.

We have "solved" the issue locally by banning agents from logging in or out :>>>>)

By: Serge Vecher (serge-v) 2006-05-02 10:48:36

superdjc: thanks for debug. For future reference, please upload files in uncompressed format for easier viewing.

Also, is there anything "interesting" displayed on the console at the time of this?
By: David J Craigon (superdjc) 2006-05-02 11:02:02

I've attached the messages from a crash (not the same one sadly). This crash happens lots for me. I've never spotted anything common between them all on the console (except for chan_zap.c:8514 pri_dchannel: Ring requested on channel 0/4 already in use on span 1. Hanging up owner after it's all gone wrong.)
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-02 12:20:47

In my case, I have, as stated, no PRI channels ( only ztdummy ). I'll try to get a backtrace tonight...
By: Serge Vecher (serge-v) 2006-05-02 12:32:59

> superdjc: I've attached the messages from a crash (not the same one sadly).
You must have forgotten to...
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-02 15:04:25

Now I could catch a solid lock up! Just a few calls... Backtrace attached!
By: Serge Vecher (serge-v) 2006-05-09 10:13:46

hk: what version of asterisk is this backtrace from? Please always include a version you are reporting on. The backtrace must be from a very recent 1.2 code built with 'make dont-optimize'
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-09 16:04:09

Asterisk 1.2.5
# uname -a
Linux 2.6.13-15.8-smp x86_64 x86_64
By: Serge Vecher (serge-v) 2006-05-09 16:09:21

ok, 1.2.5 is too old. There are several changes that have gone into 1.2 branch (currently at r26090) that will affect this issue. Please update to the most recent code in 1.2 branch and if the problem still persists, please recompile with 'make dont-optimize' and attach a new backtrace.

Thank you.
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-09 18:10:57

Well, this occures in Asterisk 1.2.6 also! I will try to get a backtrace from
1.2.7.1 also!

By: outcast (outcast) 2006-05-09 22:18:26

I have be able to dup. in 1.2.7.1 but I don't have a backtrace.
By: BJ Weschke (bweschke) 2006-05-09 22:23:04

hk / outcast : any of you using MixMonitor in conjunction with these hangups?
By: outcast (outcast) 2006-05-10 00:04:05

no
By: BJ Weschke (bweschke) 2006-05-10 00:40:09

ok. let's adjust the Makefile to have -DDEBUG_THREADS -DDETECT_DEADLOCKS enabled in DEBUG_THREADS = line of the main Makefile.

do that, and then do a "make clean" and "make install".

you should then get a bunch of console output when a deadlock is detected with who is trying to take the lock and who has it so that the person trying to claim it cannot get at it.

please make sure you have a log with VERBOSE and DEBUG turned up in addition to ERROR and hopefully we'll get some good data to start to get to the bottom of this.

Thanks.
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-10 02:38:12

No MixMonitor used here either...
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-10 03:41:17

asterisk -vvvgc | tee ast_bug.log

Dialplan exerpt:

;exten => 888,1,Dial(Local/720@users/n)
exten => 888,1,Dial(Local/720@users)

; Funktionenn

exten => 720,1,SetCallerID(${CALLERID(name)} <Deutschland>)
exten => 720,n,Queue(extrabit,rt,,,40)
exten => 720,n,Hangup()

If I use the Local call without an \n it hangs on the first call.
If I use a Local call with \n or call 720 directly it normally takes
quite a few calls to reproduce. I'll do it the quick way now!
By: outcast (outcast) 2006-05-10 13:28:23

Do you have a timeout in the context for the callback agents?
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-10 14:26:43

No, the only timeout involved is the 40 seconds in the Queue command.
By: outcast (outcast) 2006-05-10 22:10:47

are your in coming calls sip?
By: HÃ¥kan KÃ¤llberg (hk) 2006-05-11 01:17:40

In my test set up, where I can do backtraces and so on, I have incoming IAX2 calls and IAX2 agents.

But I could try with calls over mISDN too. But at the produktive site where I
first saw the problem, we have SIP agents and mISDN incoming calls. The method
Local is not used at all anymore in that setup. But it takes a week or two of
active call center use to reproduce the error there.

By: outcast (outcast) 2006-05-11 05:46:23

Well we have switch our incoming calls from sip to IAX and that has seemed to have fixed the issue as well. So could it be there is something funny with how sip calls are handled by the queue application?

By: HÃ¥kan KÃ¤llberg (hk) 2006-05-11 06:12:40

As stated in my note above, the Problem is definitly precent for us when IAX2 is used.
By: outcast (outcast) 2006-05-11 13:28:21

sorry i miss read your post.
By: Serge Vecher (serge-v) 2006-05-22 13:47:50

hk, have you been able to produce any debugging information as per bweschke's note 0045864?
By: outcast (outcast) 2006-05-23 16:34:57

Switching to IAX did not fix the issue asterisk crashed after 5 hours.

By: HÃ¥kan KÃ¤llberg (hk) 2006-05-24 02:01:40

Vechers: Yes, attached in ast_bug.log!

In the log: 05-10-06 03:42 hk File Added: ast_bug.log
By: BJ Weschke (bweschke) 2006-05-25 18:54:44

There was a patch put into /trunk earlier this evening which I believe will begin to address some of the deadlock issues with app_queue. It is by no means a panacea for all of the deadlock issues that are coming from these types of configurations, but I think it should probably take a good chunk out of what's going on. This patch will also be backported and make it's way into 1.2 as a "bug fix", but I have no immediate ETA on that at this point. Hopefully in the next week or so.
By: Serge Vecher (serge-v) 2006-05-26 12:29:25

hk and outcast: as this has been backported to 1.2 branch code in rev. 30546), we need you to immediately test and report back if the issue is resolved. Thank you.

By: David J Craigon (superdjc) 2006-05-31 06:55:43

Still getting queue lockups with asterisk after patch applied.

I attach a gdb backtrace, and logfiles of it crashing.
By: BJ Weschke (bweschke) 2006-05-31 06:59:26

superdjc - which verison are you running? sorry for the inconvenience, but a bug was found and fixed yesterday morning that did indeed cause additional deadlocks. It was fixed at r30770. If you're running a rev after that, please do attach gdb and logs so we can get a better look at where the deadlock is occuring. Thanks.
By: David J Craigon (superdjc) 2006-05-31 07:10:02

r30546, will upgrade now
By: David J Craigon (superdjc) 2006-06-02 04:31:16

No, still no use.

Attaching gdb backtrace, logfile messages.
By: David J Craigon (superdjc) 2006-06-02 04:37:42

superdjc_crash_meesages.2.txt and superdjc_gdb_output.2.txt
By: David J Craigon (superdjc) 2006-06-02 05:26:20

Probably worse than it was:

1 login nearly= 1 crash at the moment.
By: BJ Weschke (bweschke) 2006-06-02 10:40:47

superdjc: can you attach your dialplan and other relevant configs please? we've got this code running in production at one of our clients and aren't having this problem you describe, so I'm trying to understand what the difference is.
By: David J Craigon (superdjc) 2006-06-02 10:51:38

Are you a digium-ite? If so, we bought a not-very-helpful support contract with you. Your support desk has all of our config. You're also quite welcome to login and poke around our machine, email me for details.

I've attached the dialplan, however bear in mind we use MySQL realtime, so the config in isolation not so much use.
By: BJ Weschke (bweschke) 2006-06-02 11:47:36

No sir. I don't work for Digium. That's why I've asked for your config. I'm going to try and reproduce your issue to see if I can recreate what's going on.

Can you tell me what was happening at the time you deadlocked on the info you posted up to 6/2?
By: David J Craigon (superdjc) 2006-06-02 14:52:11

> No sir. I don't work for Digium. That's why I've asked for your config. I'm going to try and reproduce your issue to see if I can recreate what's going on.

Ah sorry, I was guessing that's how you got to get to "manager" status. My mistake. Anyhow, my offer to let you look around my setup still stands- there's got to be an easier ways to get root access to a random box on the net.

As for "what exactly happens", I don't totally know. We can't recreate the problem on an unloaded box. But my symptoms are as stated in comment 0045032 on this bug when it does go over.
By: HÃ¥kan KÃ¤llberg (hk) 2006-06-08 09:02:54

Well, I just upgraded to 1.2.9.1, it contains the patch, doesn't it? I hoped for some releafe - but: In two active hours I got two lock ups, like before. In this setup the calls come in over mISDN and is serviced over SIP. This is a production site, so I can not easily get backtraces...
By: HÃ¥kan KÃ¤llberg (hk) 2006-06-09 03:52:30

Today, I had two more lock ups with 1.2.9.1 and therefor I downgraded to 1.2.7.1.
By: HÃ¥kan KÃ¤llberg (hk) 2006-06-12 01:44:15

I just wanted to point out that the title change done by vechers 05-24-06 might be a little bit misleading. The Local channel was for me the quickest way to reproduce the problem. But it definitely behaves the same without use of the Local channel! The components involved from my point of view are still the ones from the original subject "Queue" and "AgentCallbackLogin". The rest of the system is working as normal during a queue lock up. With * 1.2.9.1 it doesn't take half a day of call center use to reproduce it - without chan_local!
By: Jack Storm (jstorm) 2006-06-12 10:55:44

I am having the same issue, files jstorm_wh_0001_* are my attached configs, log and bt's (two bt's 2mins apart). I can reproduce this on the system with an active
call load, as long as I have agents defined in the queue, and they are logged in
with AgentCallBackLogin.

This is Asterisk 1.2.9.1, make valgrind with -DDETECT_DEADLOCKS -DDEBUG_THREADS

All incoming lines are Zap, all stations are Sip

Note: in the messages file the I removed a ton of repeating lines, and replaced with [...] {didn't think you guys would be happy with a 180meg 40min log file :)}
By: Jack Storm (jstorm) 2006-06-12 21:44:42

Extra info:
[I would have added this sooner, but I had to wait till after hours to test, and
test on another system, with other chan types IAX to SIP and SIP to SIP]

The dead lock I reported happends when an Agent answers a call from the queue,
and then transfers (Attended) the call to another extension, that extension
takes the call and the dead lock happends right when the Agent who took the
queue call hangs up (the transfer is successful, and continues normaly).
My logs are what lead me to look in to this and to verify it. Blind transfers
seem to work fine only Attended lead to the deadlock [with my testing]

By: Jack Storm (jstorm) 2006-06-13 16:55:51

The files jstorm_wh_0002_trunk_* are from Asterisk (SVN-trunk-r33913), same
problem, but now, when you finish the attended transfer everyone is hung up, bt1
is from that point and bt2 is me calling back in, to get the deadlock in the
queue (just for good measure)

Note: same configs as jstorm_wh_0001_*
By: Roberto Lopes (rflopes3) 2006-06-24 10:53:57

I'm having a freeze problem when an Agent dial out via Agent/XXXX. The outbound calls works fine but when the first inbound call arrives the queue don't transfers the call to any agents and therefore the queue don't work anymore.
By: Frank Waller (explidous) 2006-06-30 20:02:11

We had the same behaviour as Jstorm, unfortunately I could not go back to test the attended transfer after the first core dumps I posted. Maybe someone should put a relationship to my first posting about crash when transfering out of Meetme into Queue in here.
By: dillec (dillec) 2006-07-04 15:55:09

got the same problem with use of AgentCallbackLogin and SIP agents. Asterisk freezes "show queues" and "show agents" until a restart occurs.

Further more i've experienced the same as (0047752) jstorm in one case. Agent 1 transfered a queue originated call to Agent 2 by attendet transfer.

Hope you developers can fix that problem as soon as possible. In my opinion that issue is critical!

I will try to get some debug of my maschine.
By: outcast (outcast) 2006-07-05 17:49:10

vechers,
I install the lastes 1.2 branch from SVN. Still same issues.
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-06 08:56:21

I had the same problem today. Asterisk 1.2.9.1. Debian Etch. It started out with some of these:

Jul 6 09:53:31 WARNING[26876] channel.c: Avoided initial deadlock for '0x8219e80', 10 retries!

Then I'd get a whole lot of these:

Jul 6 10:00:48 WARNING[26889] chan_zap.c: Ring requested on channel 0/1 already in use on span 2. Hanging up owner.

Asterisk would then stop taking new callers, and would stop responding as described above.

I recompiled asterisk with

make valgrind

and turned on all debug compile options. I was hoping for the deadlock to be detected, and a core dump forced.

One out of two ain't bad. I now got a new set of errors:

Jul 6 11:24:50 ERROR[10108]: ../include/asterisk/lock.h:248

__ast_pthread_mutex_lock: app_queue.c line 1110 (join_queue): '&qlock' was

locked here.
Jul 6 11:24:50 ERROR[10108]: ../include/asterisk/lock.h:245

__ast_pthread_mutex_lock: app_queue.c line 1057 (load_realtime_queue):

Deadlock? waited 5 sec for mutex '&qlock'?

These were repeated over and over again. Asterisk didn't core dump, but I was able to gdb to the process (bts and debug-level message log available on request).

I had a look at the code. The locked section starting at line 1110 seems pretty innocuous, apart from the Join manager_event call. IIRC, manager_event calls do not return until the event has been completely written to the network.

This seems to be an unacceptable length of time. Or, perhaps a problem in the manager subsystem has prevented the manager_event call from completing at all. Perhaps there were a lot of other manager events waiting to be written at the time, and the network was slow (which would explain why this tends to happen during peak periods)...

The manager_event function was also implicated in the original bug report ASTERISK-5989.

Why is this manager_event call inside locked code?

Hope this helps,

Matt.

By: outcast (outcast) 2006-07-06 17:30:20

I have switch all of the agent to static members (IE: Member => SIP/1000)
This seem to work great!!! The only down side you can not login or out the queues.
To resolve this just create a context the dynamice adds and remove the SIP channel of the queues using AddQueueMember and RemoveQueueMember.
By: Roberto Lopes (rflopes3) 2006-07-07 07:15:16

outcast, can you do the following test and see if still have problems:

Log in an agente with agentcallbacklogin application
Make an outbound call using the agent logged inn and not the extension
Hangup the call
Place an inbound call and see if you queue transfers to the same agent.

I have 1.2.9.1 installed and with this simply test we have a queue freeze always.
By: Roberto Lopes (rflopes3) 2006-07-07 13:10:00

I have re-created the bug and attached a console output with debug and console information. Hope this help.
By: BJ Weschke (bweschke) 2006-07-08 07:09:21

I think kebl0155 is on to something, but that code he's talking about does need to be within a lock for now as the event is looking directly at a pointer that is volatile and the data within could change if not locked. If you want to quickly do the lock, copy the data you're going to use into a temporary structure, and then unlock and go into the manager event with the temporary structure, that should be fine.
This shouldn't be an out and out "deadlock" though that never returns because if it is, I think that the original caller who belonged to this thread that fired off the manager event will never get through.
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-08 11:05:40

It does seem like the current arrangement (repeated throughout app_queue.c and chan_agent.c) is an ANTI-PATTERN.

What's currently happening (in pseudocode) is:

lock(importantThing);

blockingFunction(importantThing->volatileData);

unlock(importantThing);

This is bound to cause problems, particularly as the blockingFunction (manager_event) locks on the entire manager session, so there can only be one thread running through it at a time. If there are lots of pending manager events to send (busy call centre), and an operating system or hardware signal needs to be processed quickly, it's easy to see how problems could arise.

I believe the correct pattern is:

var copyOfVolatileData;

lock(importantThing);

quickCopy(importantThing->volatileData, copyOfVolatileData);

unlock(importantThing);

blockingFunction(copyOfVolatileData);

If this pattern was adopted throughout chan_agent.c and app_queue.c, perhaps Asterisk queues would lose their buggy reputation? :-P

Just kidding, Asterisk queues are fab.

Matt.
By: Alex Richardson (alexrch) 2006-07-13 06:40:47

I don't know how helpful the following information will be, but:

I have a running Asterisk PBX in production, which has worked fine for months - even on full load - however, after minor modifications in configuration I started experiencing exactly the same problems as those described here.

I have changed only two things:

1. I have implemented SIP in the queues, so that agents can now use either ZAP hardphone or SIP softphone.
2. I have added 'm' parameter to the Dial application in one of my extensions to which calls are being occasionally transfered from the queue to another agent's telephone

I don't know yet which of these two modifications is causing the problem (it will probably take me at least a few more days to figure it out, as I am switching back to the original configuration step-by-step to see at which point problems will be solved), but if anyone is interested, I can post parts of the configuration files that have been changed.
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-17 09:24:20

I've had another look at the source code. It appears that the manager_event- inside-locked-code antipattern is widespread.

Never the less, there may be an alternative, easy fix for ALL these manager-related bugs. I noticed that the manager_event() function calls ast_carefulwrite(), which calls the blocking functions write() and poll(). I think this is the source of the deadlock.

Instead, you could fix this by having two processes, a Producer which produces the data and writes it to a buffer in memory, and a Consumer which takes it from the buffer and writes it to the network. The Producer would return immediately without waiting for the Consumer to do its write. This would allow manager_event() to be called from inside locked code without causing the deadlocks (IMHO).

Unfortunately I don't have the C skills to translate this model into working code for Asterisk (as I do mostly Java). I'm sure at least one of the other Asterisk developers does though...

By the way, this bug crippled our 12-agent call centre twice this morning (during the Monday morning rush). We'd really appreciate a fix!

Hope this helps,

Matt.

By: Matt King, M.A. Oxon. (kebl0155) 2006-07-17 09:50:40

Make that three times.

:-(

I've got a core and full debug-level log with make valgrind and all debugging options turned on if anybody wants it...

Matt.

By: Michael Toop (mmmmmtoop) 2006-07-19 07:11:43

We have noted very similar symptoms BUT not using AgentCallBackLogin using AddQueueMember cmd.

Also getting many "Ring requested on channel xx already in ues on span1. Hanging up Owner" in Asterisk logs when it starts locking up. Michael.
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-19 07:56:38

I've had a look at the Trunk source code and it appears the producer/consumer strategy I suggested above IS being adopted as manager events are now queued for network write by a separate process. Hopefully this will sort out ALL these problems.

In the meantime, anybody care to do a backport? When is 1.4 expected to be released? I tried it last night but it's not stable enough for us to use it in production yet...

Matt.
By: Alex Richardson (alexrch) 2006-07-20 07:20:50

Backport would definately be most appreciated, as I wouldn't like to upgrade my production call centre to 1.4 Alpha/Beta just yet and deal with a number of new problems for another half a year...I rather deal with those problems on my testing machine, and have a stable one in production.

I would love to contribute at least something to help solving this bug, but unfortunately have no adequate Asterisk code knowledge...if there is something like a "Hacking Asterisk for Dummies" somewhere and I don't know about it, please let me know!
By: jalsot (jalsot) 2006-07-20 07:44:03

We use 1.2.4 at the moment and do not have lockups - intensively using AgentCallBackLogin.

I fear to upgrade to latest 1.2.10 or 1.2 SVN because of this bug ticket - I would need to upgrade because of mixmonitor bugfix which was introduced in 1.2.10.

Can anybody tell me if I can expect queue/agent nightmare or what could I do to make things working? [unfortunately we have major quality issues so upgrade might be reasonable]
By: Alex Richardson (alexrch) 2006-07-20 08:25:31

jalsot: are you using combination of SIP and/or IAX + AMI?

(I didn't upgrade to 1.2.10 on production machine yet, but upgrade to 1.2.9 - which we use in production currently - solved a few problems that we had before that)
By: jalsot (jalsot) 2006-07-20 08:41:57

alexrch:
we use IAX2 softphones and 2xE110P zap cards with AMI.
So you did not discover any problems on 1.2.9 in the question of queues/agentcallbacklogin?
By: Alex Richardson (alexrch) 2006-07-20 10:00:23

jalsot: we used to use ZAP + AMI only and we didn't have any problems. We hit the problems once we started using SIP.
By: Alex Richardson (alexrch) 2006-07-20 12:22:11

jalsot:

btw: how many calls do you handle per day?
By: jalsot (jalsot) 2006-07-21 03:47:52

alexrch:
So in zap only configuration it seems, AgentCallBackLogin and queue works well?
I don't know how many calls per day we have, however in peak hours we have about 50 concurrent calls and usually around 15-20 whole day. Right we have major problems with quality about over 25-30 concurrent calls so an upgrade might be reasonable, however this issue keeps me handing off hands from it - a complete lockup is worst. [maybe monitor application is the source of quality issues, but this is another story]
By: Alex Richardson (alexrch) 2006-07-21 04:29:00

jalsot:

yes, ZAP + AMI seems to be working great - at least in my configuration - AgentCallbackLogin is being used frequently with no problems at all.

In my case it's the ZAP + SIP + AMI that is causing me headaches... :(
By: Tristan Mahe (tristan_mahe) 2006-07-21 04:36:01

In my case ( zaptel only ) I got the same troubles, thinking of switching to dynamic members to solve the trouble...
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-21 04:45:02

Anybody got some good dial plans for dynamic members? Would be helpful to have them attached to this bug as potential workarounds...

Also, I've had several people email me to express interest in a backport of queued events in manager.c - does seem like this is desperately needed by Asterisk's largest customers.

I'm trying to encourage people to request the backport here too.

Matt.
By: Serge Vecher (serge-v) 2006-07-21 08:57:41

matt: please don't encourage people to request a backport through the bug-tracker.
1. There is no confirmation as of yet that these manager.c enhancements that you speak of have an impact on this bug (btw, what was the specific revision that you looked at -- I've gone as far back as March and didn't see anything obvious).
2. There are several off-bugtracker resources that are specifically tailored to keep track of Asterisk backports -- www.asterisk-backports.org seems to come more often than not.
3. Only if it is established that the enhancements to manager.c in trunk resolve the issues as described by this bug report will the patch to 1.2 be allowed to be posted on the bug-tracker

Thanks and keep looking for a solution.
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-21 10:50:10

Hello Vechers,

1) I have attached a gzipped log of an example deadlock directly linking manager.c to lock-ups in the agent system. We use AgentCallbackLogin, and didn't experience the Queue Freeze when we used AgentLogin. This is why I started adding to this bug report, rather than risk starting a dupe. I hope I did the right thing.

I'm afraid I'm not familiar enough with the Asterisk source code to be able to definitively say whether this particular manager lock is also the cause of HK's original fault. There are calls to manager_event() from within the locked code implicated by his log trace, however.

I also can't definitively say whether manager locks are causing any of the other specific faults reports that have since become attached to this bug, all of which describe very similar symptoms.

So, should I create a new bug report for this log, or should we keep using this one?

As for the lack of confirmation, it would seem to me that the only way we could confirm whether or not the enhancements make a difference to this (these) particular bug(s) would be by doing a backport, for several reasons:

- Please correct me if I'm wrong, but even if you compile with make valgrind, the deadlock checker may be reporting 'upstream' deadlocks that are actually left unrelinquished because of manager locks/blocks elsewhere in the code. If this is the case, then it would not be possible for anyone to determine exactly which deadlock caused the original bug AFAIK.

- The manager anti-pattern may be causing deadlocks anywhere that manager_event() is called from. Almost any of the locks in chan_agent or app_queue could be left unrelinquished due to this problem.

- The deadlocks seem only to occur when things get really busy (i.e. in production environments for large-and-busy call centers). Trunk cannot be run in these circumstances, so we can't use it to test.

In answer to your question, I'm looking at Trunk Jul 18 and Asterisk 1.2.10.

I also have a question for you: if the 1.2 model does not cause deadlock problems, why has it been changed so radically for 1.4?

2) Thank you for suggesting asterisk-backports.org. I have placed a request there as advised. Is there anything more I can do through this or any other site? I would hate to burden bugs.digium.com unnecessarily.

3) How can we determine whether the enhancements to manager.c will affect this bug without a backport? Seems like a chicken-and-egg problem to me...

Unfortunately, none of us who are suffering from this problem have the technical/C/asterisk know-how to make the patch we think we need - so far.

If a patch were to be produced, I certainly wouldn't consider posting it to this bug list or asterisk-backports.org (or indeed anywhere else) without having it thoroughly tested.

If only we had a patch to test!

Respectfully yours,

Matt.

By: Serge Vecher (serge-v) 2006-07-21 11:22:19

> In answer to your question, I'm looking at Trunk Jul 18 and Asterisk 1.2.10.
ok, I'm confused now about this producer-consumer commit to manager.c in trunk. According to revision history http://svn.digium.com/view/asterisk/trunk/manager.c?rev=37936&view=log, there were no commits to manager.c on that date. r37936 was done on 07/19, but that looks different. By the way, with svn it is better to quote a specific revision of a trunk or branch, rather than the date (as we used to do with cvs).

>So, should I create a new bug report for this log, or should we keep using this one?
Let's keep this in one bug for now, unless we come across evidence that there are separate problems goind on.

>As for the lack of confirmation, it would seem to me that the only way we
>could confirm whether or not the enhancements make a difference to this
>(these) particular bug(s) would be by doing a backport, for several reasons:
Well, not necessarily. The preferred method would be to test the trunk, confirm that the problem is resolved by consumer/producer enhancements to manager.c. Then, perhaps, these enhancements could be backported under the auspicies of "feature-that-fixes-bugs-in-release-branch".
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-22 05:57:28

Hello Vechers,

Thank you for the swift response. Jul 18 was the day I checked out trunk. I've just had a look at the latest version to confirm (today, global revision 38087, manager.c revision 38042).

In this version, as in the other trunk versions I've seen, the producer is the call to manager_event(), which calls append_event(), which adds the event to an event queue without calling any blocking functions. Hooray!

The consumer (for events) is process_events, which runs through the queue and sends the events using ast_carefulwrite (which does call blocking functions). I think manager responses also have their own consumer called astman_append(), but that's another story.

manager_event() itself still has a lock on the manager session, but this is only to ensure a well-defined event queue order, and to prevent the consumer thread from being woken twice if it's asleep. There's no longer any risk of block due to write() and poll() (which would seem to be the only blocking functions inside 1.2 manager_event())...

I'm as sure as I can be at this stage that this is the code we need to fix this bug.

I agree that it would be best practice to test trunk to see if this does indeed fix the problem. I cannot use trunk in our production environment because it is not sufficiently stable (and I did already try). I cannot reproduce the bug in our dev environment because it only seems to happen when things get really busy.

So, how do you suggest we go about this?
By: Alex Richardson (alexrch) 2006-07-22 18:24:17

jalsot & others:

it seems that even ZAP + AMI conf is causing occasional problems. Just had my first crash today, with the exactly the same symptomps as written in this bug's description: "If you call a queue you get no answer, just ring tone. You can still place normal calls through asterisk. The CLI is also frozen. You can write "show queues" or whatever, but you do not get an answer. If you restart asterisk it hangs a while before stopping. No messages are produced any more, verbosity=3.". Unfortunately that happened on a production machine, so I've got no debug info... :(
By: Matt King, M.A. Oxon. (kebl0155) 2006-07-23 19:47:14

alexrch: Those are exactly my symptoms also.

We need this patch.

Matt.
By: Alex Richardson (alexrch) 2006-07-25 05:44:28

vechers: where in your opinion is the problematic code (manager.c, app_queue.c, something else?) that is causing all this problems? I would love to help debugging it, if at least I knew where to look for bugs. I guess you have an idea what could be wrong?
By: Serge Vecher (serge-v) 2006-07-25 08:15:51

alexrch: I'm not a developer, so I can't tell for sure. Jstorm has looked into this issue in depth; maybe he can elaborate a bit more...

From the discussions on #asterisk-dev channel, the problem seems to be in the callback mechanism of chan_agent. Supposedly, this functionality will be deprecated in chan_agent for 1.4 to be replaced by a solution implemented in dialplan (yes, with example documented ;) See respective email by K.P. Fleming to asterisk-dev mailing list.

I hope this helps.
By: Alex Richardson (alexrch) 2006-08-03 14:33:06

vechers: okay, I have switched to the dynamic agents, (on a test machine only), but I am facing bunch of other problems now...like, agent's status is always unknown (so, I can't use patch 5577 anymore), AgentComplete events are fired right after agent answers the call and not when the call is actually complete, etc, etc.....

So, are you and mr. Fleming really serious about switching from AgentCallbackLogin to something else, cause if that's the case, then I'm affraid that there will be quite a few new queue related bugs posted here in the days following... :(

And since THIS bug (6626) is really the only problem I have with * currently, I would rather see if I could somehow help you guys solving it, rather than facing zillions of new ones...I even made this little utility (that's part of my AMI proxy) that automatically restart asterisk when queues stop working and sends me an email with notification, so agents sometimes don't even notice that something was wrong...so can someone tell me what must be done for this bug to get fixed rather than simply left as it is?

By: nicolasg (nicolasg) 2006-08-03 16:24:43

Another issue that will arise if callbacklogin is deprecated has to do with queue_log. If you use devices instead of agents you won't be able to tell just from queue_log the agent involved in each event (several agents might use the same phone/device at different times, a common practice in callcenters).
By: Matt King, M.A. Oxon. (kebl0155) 2006-08-03 17:40:58

Many of our call centres rely on agent channels to provide agent log in and out information, and hot desking. Are we really going to be able to do this from the dial plan without agent channels?

An example dial plan would really help us all...
By: Jack Storm (jstorm) 2006-08-04 23:51:20

Ok, a good chunk of the notes here really belong in the users mailing list and/or the dev mailing list, not here.

Yes there are issues with chan_agent and app_queue. The bulk of these relate to chan_agent holding a lock from app_queue, and thus bugging attened transfers (and this is what the original report was about..DEADLOCKS in release code).

Alot has changed in /trunk, and given us new bugs. If 1.4 is to truely deprecate
AgentCallbackLogin, then alot needs to be discussed on the mailing list (sans code). So that we can work out the best way to resolve this issue.

Sorry to be blunt but comments here are starting to be more discussion of configuration, and not cause and detailed effect. We need to hash this out where more people are watching.
By: Alex Richardson (alexrch) 2006-08-07 05:29:29

jstorm: Exactly! I agree that we should focus on fixing this bug in 1.2.x, rather than continue discussing configuration changes needed to workaround it (btw: the discussion about this started when the decision to depreciate AgentCallbackLogin was made instead of fixing this bug in 1.4).

I think it would not be such a bad idea if we could meet on IRC to discuss where exactly should the debugging begin. I would like to help you guys solving this problem - however so far I have not yet seen anyone on the IRC (the dev channel that is) who would have adequate info on this issue. Therefore I would like to meet with you and discuss it. Please, purpose when I can reach you. Thanks!
By: Christopher McBee (cmcbee) 2006-08-15 08:54:14

Just wanted to see if anyone has found a work-around or solution in 1.2. We're currently running 1.2.8 (1.2.9.1 introduced a large number of problems, so we are holding off on 1.2.10).
By: Serge Vecher (serge-v) 2006-08-17 13:19:36

not to take away attention from fixing the bug 1.2.x, but the "dialplan dynamic queue member" sample has been posted in the latest trunk r.40254 (http://lists.digium.com/pipermail/svn-commits/2006-August/015936.html)

alexrch, jstorm: did you have a chance to get together on IRC to discuss this?
By: Nic Bellamy (nic_bellamy) 2006-08-17 16:22:28

I've just uploaded manager_eventq_backport-1.2.10.patch, a backport of the trunk producer-consumer changes to the manager eventq.

I'm not sure if it fixes this bug, but it's worth a go, and I'd be interested to hear other peoples results.
By: Alex Richardson (alexrch) 2006-08-18 04:12:44

vechers: no, not yet, but I will now try nic_bellamy's patch and let you know of the results. Thank you [edit: murf :) ] for the sample!

By: Serge Vecher (serge-v) 2006-08-18 08:18:57

actually, thank murf... I'm just the messenger
By: dimitripietro (dimitripietro) 2006-08-18 18:25:59

I think that agentcallbacklogin shouldn'T be depreciated because when generating reports, using dynamic agent, we get the name the of device. The problem is that agent are rotating and this become a complete mess to remeber each day which agent was using which phone.
By: Joel Vandal (jvandal) 2006-08-18 19:59:44

I agree that AgentCallBackLogin must not be depreciated but can all functions from chan_agent can be recreated with some Dialplan logic and it's what I've do using some AstDB key for storing agents infos.

About reporting, yes you get the Device name but it's also possible to match AgentID and Device name. I have add support for this on our reporting system so you can write a script that parse the queue_log file, grep AGENTLOGIN entry (that can be 'mimic' on dialplan.

Ok, required a lot of dialplan rewrite but it's not impossible. I have rewrite it and will check if we can 'publish' our dialplan script and a little convert script that read agent.conf and rewrite it in AstDB.
By: Serge Vecher (serge-v) 2006-08-28 12:57:33

anybody try nic_bellamy patch?

dimitripietro, jvandal: please take the discussion of the pros/cons of AgentCallBackLogin deprecation to the asterisk-dev mailing list, thanks.
By: Alex Richardson (alexrch) 2006-08-29 14:51:44

serge-v: yes, I did. It seems to make manager work a bit better, but the problem with queue system hanging up persists.
By: BJ Weschke (bweschke) 2006-09-03 13:43:15

pls try the chan_agent_noapplock_on_cb.diff file attached to 7458 with latest 1.2 branch to see if this addresses some of the deadlock issues with chan_agent used with AgentCallBackLogin
By: Matt King, M.A. Oxon. (kebl0155) 2006-09-13 05:20:52

We've been using the patch from ASTERISK-7261 AND the manager event backport patch (above) for about a week.

We've had NO deadlocks in that time.

Also the manager event patch has made Asterisk much more responsive, even under heavy load.

Thanks so much for these patches - just what we needed!

Matt.
By: Serge Vecher (serge-v) 2006-09-13 12:06:39

Alright, I think there are enough positive reports to close thise issue down (finally). Major props to bweschke for fixing this! Please open a new bug report if somehow this issue is not fixed.

Fixed by chan_agent_noapplock_on_cb.diff patch from ASTERISK-7261, which was committed to 1.2 branch in r42133 and appears in 1.2.12.1 release.