[Home]

Summary:ASTERISK-09873: [patch] Random replacement of channel name with other text in queue log entries
Reporter:jfitzgibbon (jfitzgibbon)Labels:
Date Opened:2007-07-13 08:44:04Date Closed:2007-09-10 13:14:56
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 10199.lockpatch_coredump_1.txt
( 1) 10199.lockpatch_coredump_2.txt
( 2) 10199.patch
( 3) 20070813__bug10199.patch
( 4) 20070905__bug10199.diff.txt
Description:On 1.4.5, 1.4.6 and 1.4.7.1, I have observed random replacement of the channel name with other text in my queue log.  Typically the other text is part of a manager event, suggesting that a pointer is getting corrupted somewhere.  It happens on a very small percentage of calls, but there are consistent elements to the corruption from my observations:

- it only happens on one of my eleven queues.  Of the eleven, four others are configured almost identically to the queue on which the corruption is observed (the only differences are queue name and wrapuptime length). The problem queue is listed second in queues.conf, and does not sort lexically to the top or bottom of a list of queue names.

- the corruption first appears on the CONNECT event

- the channel name remains corrupted for the duration of the call (i.e. you never see a good ENTERQUEUE, bad CONNECT, then a good COMPLETECALLER)

- the replacement text is not consistent.  It can differ between the CONNECT and TRANSFER events (see example 1)

- if the call is transferred to an extension which enqueues the caller to another queue, the corruption is cleared, and does not typically re-appear (see example 1)

I cannot reproduce the bug on demand in my lab environment.  On a call center fielding about 5000 calls per day, I see this on 5-10 of those calls.  The main impact of this is that queue analysis programs never see the CONNECT or COMPLETEXXXXXX events, so they think that any corrupted call is waiting forever.  Having such a large wait time for several calls on a queue knocks statistics out of whack.


****** ADDITIONAL INFORMATION ******

(sorry for the crude masking; management insisted)

Example 1:

1184274299|1184274276.4611|queuename1|NONE|ENTERQUEUE||XXXXXXXXXX
1184274342|1184274276.4611|queuename1|
Context: macro-qlogoff
Extension: s
Priority: 6
Application: Goto
AppData: s-REMOVED|1
Uniqueid: 1184274341.4625

|CONNECT|43|1184274332.4623
1184274420|1184274276.4611|queuename1|Channel2: Zap/8-1
Uniqueid1: 1184274375.4632
Uniqueid2: 1184274375.4633
CallerID1: xxxXXXXXXX
CallerID2: 918777729158

|TRANSFER|7202|from-internal-queuename1in|43|78
1184274420|1184274276.4611|queuename2|NONE|ENTERQUEUE||XXXXXXXXXX
1184274447|1184274276.4611|queuename2|SIP/1420|CONNECT|27|1184274440.4648
1184274484|1184274276.4611|queuename2|SIP/1420|COMPLETECALLER|27|37|2

Example 2:

1184274342|1184274315.4617|queuename1|NONE|ENTERQUEUE||XXXXXXXXXX
1184274437|1184274315.4617|queuename1|bxtel-01.comwave">
<dialog id="301">
<state>confirmed</state>
</dialog>
</dialog-info>
|CONNECT|95|1184274420.4643
1184274777|1184274315.4617|queuename1|iority: 5
Application: Playback
AppData: agent-loggedoff&cw/nccc/qname_queuename3
Uniqueid: 1184274743.4697

|COMPLETEAGENT|95|340|2

Example 3:

1184275012|1184274986.4737|queuename1|NONE|ENTERQUEUE||XXXXXXXXXX
1184275100|1184274986.4737|queuename1| 'SIP/1212-9d68dd90'
|CONNECT|88|1184275088.4769
1184275122|1184274986.4737|queuename1|
Waiting: 1
New: 2
Old: 0

|TRANSFER|7202|from-internal-queuename1in|88|22
1184275122|1184274986.4737|queuename2|NONE|ENTERQUEUE||XXXXXXXXXX
1184275136|1184274986.4737|queuename2|SIP/1419|CONNECT|14|1184275122.4778
1184275275|1184274986.4737|queuename2|SIP/1419|COMPLETECALLER|14|139|1

queues.conf configuration for the problem queue:

[queuename1]
musiconhold=default
strategy=rrmemory
servicelevel=60
timeout=20
retry=5
maxlen=0
announce-frequency=120
announce-holdtime=once
periodic-announce-frequency=0
joinempty=yes
leavewhenempty=no
reportholdtime=yes
monitor-format=wav
context=ivr-nccc-leavequeue
ringinuse=no

Dialplan for initial enqueue from IVR:

[ivr-nccc-queuename1]
exten   => s,1,Playback(pls-stay-on-line)
exten   => s,n,Macro(enqueue,queuename1,queuename1,queuename1In)

[macro-enqueue]
; $ARG1 = callerid
; $ARG2 = qname
; $ARG4 = spygroup
; $ARG5 = 1 to boost prio
exten => s,1,Set(CALLERID(name)=Q-${ARG1})
exten => s,n,Set(MONITOR_FILENAME=/var/spool/asterisk/recordings/${STRFTIME(,,%Y%m%d)}/${ARG2}/${ARG2}_${UNIQUEID})
exten => s,n,Set(_SPYGROUP=${ARG3})
exten => s,n,GotoIf(${ARG4}?highprio:normprio)
exten => s,n(highprio),Set(QUEUE_PRIO=10)
exten => s,n(normprio),Queue(${ARG2},tW)
exten => s,n,Goto(${IVR_CONTEXT}|s|top)

Dialplan for enqueue from one queue to another (used in the examples above where you see an agent transfer a corrupted call to another queue):

[inc-enqueue]
exten => 7201,1,Macro(enqueue,queuename1,queuename1,queuename1In)
exten => 7202,1,Macro(enqueue,NewServices-CS,queuename2,NCCC)
exten => 7203,1,Macro(enqueue,Billing-CS,queuename2,NCCC)
exten => 7204,1,Macro(enqueue,ProductName2-TS,queuename2,NCCC)
exten => 7205,1,Macro(enqueue,ProductName1-TS,queuename3,TS)
exten => 7206,1,Macro(enqueue,ProductName2-TS,cs_ld,NCCC)
Comments:By: Joshua C. Colp (jcolp) 2007-07-13 09:21:52

It actually seems to be the member name that has gone funky... and it's not always manager events, there's some SIP stuff in there too.

By: Joshua C. Colp (jcolp) 2007-07-13 09:22:11

Are you using dynamic queue members?

By: jfitzgibbon (jfitzgibbon) 2007-07-13 09:42:08

Yes, I am using AddQueueMember.  Always plain SIP devices, not Local channels.

Here are the sections of my dialplan related to login / logoff.  I am not using the new membername parameter added in 1.4.

[inc-agent-logon-logoff]
exten => 7081,1,Answer
exten => 7081,n,Set(QLOGINRESULTS=)
exten => 7081,n,Macro(qlogin,queuename1,qname_queuename1)
exten => 7081,n,Playback(${QLOGINRESULTS})
exten => 7081,n,Hangup

exten => 7091,1,Answer
exten => 7091,n,Set(QLOGINRESULTS=)
exten => 7091,n,Macro(qlogoff,queuename1,qname_queuename1)
exten => 7091,n,Playback(${QLOGINRESULTS})
exten => 7091,n,Hangup

[macro-qlogin]
; $ARG1 = qname
; $ARG2 = qname sound file
exten => s,1,GotoIf(${DB_EXISTS(dynlogin/logintime-${CUT(CHANNEL,,1)})}?alreadyloggedin)
exten => s,n,QueueLog(NONE,${UNIQUEID},${CUT(CHANNEL,,1)},AGENTLOGIN,-)
exten => s,n,Set(DB(dynlogin/logintime-${CUT(CHANNEL,,1)})=${EPOCH})
exten => s,n(alreadyloggedin),AddQueueMember(${ARG1},${CUT(CHANNEL,,1)})
exten => s,n,Goto(s-${AQMSTATUS},1)

exten => s-ADDED,1,Macro(addtovar,QLOGINRESULTS,agent-loginok&mysounds/${ARG2},&)

exten => s-MEMBERALREADY,1,Macro(addtovar,QLOGINRESULTS,mysounds/agent-alreadyqmember&mysounds/${ARG2},&)

exten => s-NOSUCHQUEUE,1,Congestion

[macro-addtovar]
; $ARG1 - var name
; $ARG2 - thing to append
; $ARG3 - joiner
exten => s,1,GotoIf(${ISNULL(${${ARG1}})}?first)
exten => s,n,Set(${${ARG1}}=${${ARG1}}${ARG3}${ARG2})
exten => s,n(first),Set(${ARG1}=${ARG2})

[macro-qlogoff]
; $ARG1 = qname
; $ARG2 = qname sound file
exten => s,1,GotoIf(${DB_EXISTS(dynlogin/logintime-${CUT(CHANNEL,,1)})}?loggedin:notloggedin)
exten => s,n(loggedin),Set(LOGINTIME=$[${EPOCH} - ${DB(dynlogin/logintime-${CUT(CHANNEL,,1)})}])
exten => s,n,QueueLog(NONE,${UNIQUEID},${CUT(CHANNEL,,1)},AGENTLOGOFF,-,${LOGINTIME})
exten => s,n,NoOp(${DB_DELETE(dynlogin/logintime-${CUT(CHANNEL,,1)})})
exten => s,n(notloggedin),RemoveQueueMember(${ARG1},${CUT(CHANNEL,,1)})
exten => s,n,Goto(s-${RQMSTATUS},1)

exten => s-REMOVED,1,Macro(addtovar,QLOGINRESULTS,agent-loggedoff&mysounds/${ARG2},&)

exten => s-NOTINQUEUE,1,Macro(addtovar,QLOGINRESULTS,mysounds/agent-notqmember&mysounds/${ARG2},&)

exten => s-NOSUCHQUEUE,1,Congestion

By: Stuart Henderson (stuarth) 2007-07-14 08:15:51

Ah good, it's not just me then. We get this too, using ODBC realtime queues (with queue members added to SQL through a web interface), but you managed to collect more information about it than I've had chance to do so far.

Yes, it's always the member name that's corrupt.

By: jfitzgibbon (jfitzgibbon) 2007-07-18 16:09:19

I tried using the membername parameter (set to the same value as the channel) in the hopes that it might make a difference.

exten => s,n,AddQueueMember(${ARG1},${CUT(CHANNEL,,1)},,,${CUT(CHANNEL,,1)})

But it didn't seem to have any impact.  I've seen as many corrupted entries in the past two days as I did in the days when I was using this form of AddQueueMember:

exten => s,n,AddQueueMember(${ARG1},${CUT(CHANNEL,,1)})

By: Stuart Henderson (stuarth) 2007-07-18 17:03:40

Seems it happens a lot more often to you than me, then... Since the first version you mentioned was 1.4.5 I went looking for changes between 1.4.4 and there; I did wonder about a free() to fix a memory leak (r65394):

Fix a memory leak that I just noticed in the device state
handling in app_queue. On most device state changes, it would
leak roughly 8 to 64 bytes (the length of the name of the
device). ........

I didn't trace it far enough to tell if there was a use-after-free problem, but the fact that it's the device name, and was introduced between 1.4.4 and 1.4.5 did make me wonder. You might want to try reverting it, then you can at least blow a hole in my theory (-:

--- apps/app_queue.c.orig    2007-07-18 23:12:24.000000000 +0100
+++ apps/app_queue.c 2007-07-18 00:44:23.000000000 +0100
@@ -584,13 +584,13 @@ static void *changethread(void *data)
                       }
               }
               ast_mutex_unlock(&q->lock);
       }
       AST_LIST_UNLOCK(&queues);

-       free(sc);
+       /* XXX ??? free(sc); */

       return NULL;
}

static int statechange_queue(const char *dev, int state, void *ign)
{

By: Mark Michelson (mmichelson) 2007-07-20 11:17:00

I'm investigating this and I've turned up a number of leads which turned out to be red herrings. There is one thing that jumped out as being a possible cause though.

That cause would be reloading queues. Do either of you regularly reload queues during operation? If so, do you happen to notice that the errors occur close to the same time as the reload?

By: jfitzgibbon (jfitzgibbon) 2007-07-20 12:20:28

I rarely reload queues throughout the day, as I use AddQueueMember pretty much exclusively.

Yesterday for example I had no config reloads at all (other than the bogus one inserted when you do 'logger reload', but I had corruption 5 times.  There is no automated reload in this system, and I was not in the office, so nobody else was touching the console all day.

1184832127|NONE|NONE|NONE|CONFIGRELOAD|
1184844997|NONE|NONE|NONE|QUEUESTART|
1184874777|1184874735.5466|sales|dummy@queuemetrics-1246,1'
|CONNECT|14|1184874763.5469
1184876249|1184876169.5907|sales|1f9291b874a9113efee1765fabd6e@10.1.1.16
CSeq: 103 BYE
User-Agent: XXXXXXX
Max-Forwards: 70
Content-Length: 0

|TRANSFER|7202|from-internal-salesout-manual|14|43
1184878998|1184878476.6525|sales|in MixMonitor Recording Zap/85-1
|COMPLETEAGENT|16|479|1
1184883294|1184883243.8467|sales|
|CONNECT|15|1184883279.8490
1184886423|1184886390.10302|sales|dummy@queuemetrics-81ad,1'
|CONNECT|11|1184886412.10321

The logger reload was at 04:02am

The corruption happened at:

3:52pm
4:17pm
5:03pm
6:14pm
7:07pm

When I do make changes to extensions and sip conf, I tend to reload using 'sip reload' or 'extensions reload' or 'ael reload' - I stay away from a global 'reload' whenever possible.

By: Stuart Henderson (stuarth) 2007-07-20 13:20:48

We rarely reload queues, usually only the dialplan. Haven't seen it since removing the free in changethread(), but that's not conclusive as we could go for some weeks without seeing it..

By: Mark Michelson (mmichelson) 2007-07-20 15:09:28

Well, that blows that theory out of the water. I'll keep looking to see what it could be. Thanks, by the way, for your timely replies. They're much appreciated.

By: Mark Michelson (mmichelson) 2007-07-23 11:08:10

I just thought I'd give an update.

After looking at the code for some time, I believe that what might be happening is an unfortunately timed queue member logout. Basically what I mean is that a caller calls into the queue and the queue determines which members are eligible to receive calls. At this point, a member from the list logs out. The queue still attempts to make the call to the member anyway (and it is possible under certain circumstances that this operation can succeed). There would be two noticeable issues if this were the case. One would be the corrupted queue log messages (as well as some similarly messed up manager messages). The other would be that a member would receive a call after he had logged out.

If either of you have ever had complaints from your call center agents of receiving a call after logging out, then my theory may hold water. Of course, if your agents are members of multiple queues and they have no indication of which queue they are receiving a call from, then they may not even realize they're getting a call from a queue from which they have logged out.

The times that jfitzgibbon reported as times when the corruption occurred leads me to believe this is a possibility given that those times seem like possible times that agents would be logging out.

So my request for feedback:
Have you ever had a call center agent mention/complain about receiving a call from a queue after he has logged out? jfitzgibbons, do you know if those times you reported that the corruption occurred happen to coincide with member logouts?

By: jfitzgibbon (jfitzgibbon) 2007-07-23 11:23:51

I have not heard reports of agents getting calls after logging out.

As for the times agent log on and off, our current procedure has them logging off for bathroom breaks, etc. so there are many more logon/logoff events in our queue log than you might expect to see if we were using pause instead.

For the 5 corruption events on Thursday July 20th, I checked for member logon/logoff near the time of the corruption:

#1

1184874777|1184874735.5466|sales|dummy@queuemetrics-1246,1'
|CONNECT|14|1184874763.5469

- preceeed by an AGENTLOGOFF 10 seconds before
- followed by an AGENTLOGIN 14 seconds later

#2

1184876249|1184876169.5907|sales|1f9291b874a9113efee1765fabd6e@10.1.1.16
CSeq: 103 BYE
User-Agent: XXXXXXXXX
Max-Forwards: 70
Content-Length: 0

|TRANSFER|7202|from-internal-salesout-manual|14|43

- preceeded by an AGENTLOGIN 33 seconds before
- followed by an AGENTLOGIN, but it was 2 minutes and 8 seconds later

#3

1184878998|1184878476.6525|sales|in MixMonitor Recording Zap/85-1
|COMPLETEAGENT|16|479|1

- an AGENTLOGIN fired at the exact same time as this corruption
- all other AGENTLOGIN/AGENTLOGOFF events were separated in time in either direction by at least a minute

#4

1184883294|1184883243.8467|sales|
|CONNECT|15|1184883279.8490

- preceeded by an AGENTLOGIN by 2 seconds
- preceeded by an AGENTLOGOFF by 7 seconds

ASTERISK-1

1184886423|1184886390.10302|sales|dummy@queuemetrics-81ad,1'
|CONNECT|11|1184886412.10321

- preceeded by a AGENTLOGOFF by 7 seconds
- followed by an AGENTLOGIN by 5 seconds


I assume you were looking for a bit more "same-second" correlation than my log shows.

If needed, I can try to do this analysis (well, script this analysis - it's a PITA to do manually) for other days and see if something more promising comes to light.

By: Mark Michelson (mmichelson) 2007-07-23 13:01:38

Even assuming that there could be latency between when events actually occur and when they're written, those gaps seem too large to indicate what I'm thinking of.

Once again, thank you very much for the quick and detailed response. I'll get back to work trying to figure out what's going on here.

By: Mark Michelson (mmichelson) 2007-07-24 14:08:46

I've uploaded a patch which adds some locking and hopefully will keep this problem from happening. Give it a try and let me know if it works out for you.

By: jfitzgibbon (jfitzgibbon) 2007-07-25 07:22:11

I'm running the patch against 1.4.8 today.  I'll report back if we see any corruption.

I must admit to being a bit jumpy already though -  within a few minutes of the call center opening I saw this on the console:

[Jul 25 08:31:40] ERROR[7566]: /home/james/asterisk-src/asterisk-1.4.8/include/asterisk/lock.h336 __ast_pthread_mutex_unlock: app_queue.c line 2720 (try_calling): mutex '&member->member_lock' freed more times than we've locked!

That context is this bit at the end of try_calling():

out:
       hangupcalls(outgoing, NULL);
       if(member)
               ast_mutex_unlock(&member->member_lock);
       return res;

(the patch applied to 1.4.8, but with 4-7 lines of fuzz, so my line numbers will be slightly off.  The patch does not go cleanly against 1.4.9, probably due to some block-related cleanup done to app_queue)

Is that error something I should back out the patch over?

By: jfitzgibbon (jfitzgibbon) 2007-07-25 07:57:16

I've removed the patch.

I saw two core dumps in the span of 20 seconds (the second almost immediately after safe_asterisk restarted).  Both SIGSEGV, both in __ast_pthread_mutex_unlock

I've attached the full backtraces.

By: Mark Michelson (mmichelson) 2007-07-25 11:30:26

Regarding your first note in reply to my patch:

That is a problem, and is an indication that my patch is flawed and you should not use it. I believe I've located the problem with regard to this. I'm going to withhold posting another patch until I also solve the problem in your second note.

Regarding your second note:

I'm still looking into this and will get back to you when I (or someone else as it may be) can figure out why that crash is happening.

By the way, all future patches I post here will be against 1.4.9, since there were several large changes to app_queue which will make patches against older versions not apply cleanly.

By: Stuart Henderson (stuarth) 2007-07-25 12:55:22

ok, it's definitely nothing to do with r65394. Here's some of our last ones, some strings masked with #. Piped through cat -v, see 1184928343 (probably not related to this but ^A is not quite right :-)

One hunch I had was that the DB could have been out of available pooled connections at the times this happened, I have no way to confirm though (I can't seem to raise it, limit= in res_odbc.conf doesn't seem to do a lot...)


<sthen@pbx1-hex:26>$ grep -vE '(SIP|NONE|Local|\|\|RINGNO|\|\|COMPLETE|\|m\|COMPLETE|g\|COMPLETE|\|H\|COMPLETE)' queue_log.43 | cat -v
1184830363|1184830360.5033|w########|73e30^M
State: Up^M
CallerID: 0##########^M
CallerIDName: <unknown>^M
Uniqueid: 1184830361.5036^M
^M
|CONNECT|3|1184830360.5034
1184830368|1184830283.5016|inbound|, W:0, C:0, A:0, SL:0.0% within 0s
  No Members
  No Callers

|RINGNOANSWER|16000
1184860831|1184860782.8645|inbound|H|RINGNOANSWER|6000
1184860833|1184860822.8687|inbound|ote-322c,2^M
State: Busy^M
CallerID: 0##########^M
CallerIDName: The #############|RINGNOANSWER|10000
1184928343|1184928260.11048|w########|g^A|COMPLETECALLER|9|74|1
1185203246|1185203064.8871|payments|data,duration,billsec,disposition,amaflags,accountcode,uniqueid,userfield)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?M-H|COMPLETEAGENT|14|168|1
1185284021|1185283869.1897|inbound|setcli-63ac,2^M
State: Ring^M
CallerIDNum: <unknown>^M
CallerIDName: <unknown>^M
Uniqueid: 1185284005.1938^M
^M
|COMPLETECALLER|8|144|1
1185358613|1185358552.5121|inbound|data,duration,billsec,disposition,amaflags,accountcode,uniqueid,userfield)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?M-H|COMPLETEAGENT|1|60|1
|COMPLETECALLER|1|50|1
1185380341|1185380289.8118|inbound|166.5.133-b7752878'
|COMPLETECALLER|0|52|1

By: jfitzgibbon (jfitzgibbon) 2007-08-10 09:04:04

Having upgraded a few times since I last posted on this ticket, we're still seeing it, but some days we don't see it at all (which I can't really explain).  Here's the distribution for the last 28 days:

[root@pbxtel-01 james]# zgrep -c '^[|]' /var/log/asterisk/queue_log.*.gz | sort
/var/log/asterisk/queue_log.1.gz:6
/var/log/asterisk/queue_log.2.gz:0
/var/log/asterisk/queue_log.3.gz:20
/var/log/asterisk/queue_log.4.gz:7
/var/log/asterisk/queue_log.5.gz:0
/var/log/asterisk/queue_log.6.gz:0
/var/log/asterisk/queue_log.7.gz:0
/var/log/asterisk/queue_log.8.gz:4
/var/log/asterisk/queue_log.9.gz:7
/var/log/asterisk/queue_log.10.gz:6
/var/log/asterisk/queue_log.11.gz:5
/var/log/asterisk/queue_log.12.gz:0
/var/log/asterisk/queue_log.13.gz:1
/var/log/asterisk/queue_log.14.gz:3
/var/log/asterisk/queue_log.15.gz:2
/var/log/asterisk/queue_log.16.gz:0
/var/log/asterisk/queue_log.17.gz:2
/var/log/asterisk/queue_log.18.gz:2
/var/log/asterisk/queue_log.19.gz:0
/var/log/asterisk/queue_log.20.gz:3
/var/log/asterisk/queue_log.21.gz:6
/var/log/asterisk/queue_log.22.gz:5
/var/log/asterisk/queue_log.23.gz:9
/var/log/asterisk/queue_log.24.gz:6
/var/log/asterisk/queue_log.25.gz:46
/var/log/asterisk/queue_log.26.gz:3
/var/log/asterisk/queue_log.27.gz:15
/var/log/asterisk/queue_log.28.gz:3
[root@pbxtel-01 james]#

And here's the detail for yesterday:

[root@pbxtel-01 james]# zgrep '^[^0-9]' /var/log/asterisk/queue_log.1.gz
|CONNECT|49|1186701799.8650
|TRANSFER|7202|from-internal-salesin|49|17
|CONNECT|47|1186702301.8751
|TRANSFER|7202|from-internal-salesout-manual|47|29
Context: macro-addtovar
Extension: s
Priority: 1
Application: GotoIf
AppData: 1?first
Uniqueid: 1186703272.8947

|TRANSFER|7202|from-internal-salesout-manual|14|32
Context: queuemetrics
Extension: pause
Priority: 3
Application: AGI
AppData: cw-agentpause.agi|--mode=pause|--agent=1402|--pausereason=32
Uniqueid: 1186709573.10249

|COMPLETEAGENT|18|349|1
[root@pbxtel-01 james]#

We're now running 1.4.10 (with app_queue from the 1.4 SVN branch to pick up the fix for bug 10320)

Any idea what direction to take this in?

By: Mark Michelson (mmichelson) 2007-08-13 09:48:40

Sorry for the lack of activity on this issue. The problem is that it's easy enough to see where corrupt memory is being accessed, but it's hard to see exactly why it's being corrupted. My previous fix (adding locking for queue member operations) seems to be the right way to go, since it would insure that a member's information could not change unless the lock were held. The only problem is that like you discovered, the patch itself has issues. And the problem with the patch is that aside from one easily fixed problem, it's not obvious what's wrong with the patch either. This will be addressed in due time.

By: Tilghman Lesher (tilghman) 2007-08-13 11:06:55

Let's try this patch, with a touch more locking and see if that fixes the remaining issues.

By: Konrad Rozycki (krdian) 2007-08-15 08:57:38

Is this patch fixing problem ? I have the same problem with 1.4.9 and 1.4.0.1 and Local channels.



By: Tilghman Lesher (tilghman) 2007-08-15 12:16:36

krdian:  We need you to test the patch.  At the very least, it's going to perform better than what you likely have now, although whether it's going to completely fix the problem is something only you (and other people willing to test this) can tell us.

By: Stuart Henderson (stuarth) 2007-08-16 03:23:49

I just had this report after patching 1.4.10.1 and reloading last night:

'did nasty nasty stuff this morning as calls started to come in.
basically the CLI became unresponsive to "show queue XXX" requests
and calls were just ringing rather than getting routed into the queue.'

By: jfitzgibbon (jfitzgibbon) 2007-08-30 08:13:12

Given stuarth's report, I cannot install this on my production system.

The problem only manifests under production load, and I cannot risk having to dump all my callers.

stuarth, can you provide any further details so that the bug team might be able to revise the patch?

By: Stuart Henderson (stuarth) 2007-08-30 09:37:45

not really, we've got the same problem with the corruption only showing up in production; so testing is obviously difficult for us too.

By: Amilcar S Silvestre (amilcar) 2007-09-04 14:50:45

The same issue occours with me in all my 1.4 installations.

Using Agentcallbacklogin, the agent name in COMPLETECALLER and COMPLETEAGENT lines is changed to another valid agent name, or funky text with pieces of manager events, sip stuff and so on.

I'm able to reproduce this bug in mid-load callcenters, but can't test this latest patch given stuarth's report, because they are very sensitive production environments.

This bug, for me, is a blocker one for 1.2 to 1.4 migration. Many softwares (specially callcenters) relies on queue_log to produce statistics and so on.

By: Tilghman Lesher (tilghman) 2007-09-05 22:57:33

Okay, so after much discussion, we're going to bite the bullet and implement reference counting in app_queue on members.  This should clear up a great deal of the possible existing race conditions.  New patch uploaded.

This has been tested in a limited way, using a single queue, two members, and two callers, so we'll need you all to test this further.

By: Stuart Henderson (stuarth) 2007-09-06 18:54:06

We're running with it (note to others wanting to test: the diff is against a sandbox merged from branches/1.4, it applied cleanly to recent 1.4 svn checkout). It's passed a few calls ok, will report back tomorrow when there's been some real traffic.

By: Stuart Henderson (stuarth) 2007-09-07 08:37:22

Passed c.5k calls with this diff and no trouble yet. amilcar, jfitzgibbon, anyone else interested; I think this diff is ok to test on production systems, please give it a try asap. We typically only see a corrupted queue_log line once every couple of days, so it would be nice to have a report from someone seeing it more frequently.



By: jfitzgibbon (jfitzgibbon) 2007-09-07 08:41:37

I've love to test it, but all patches have been banned by management and we've rolled back to 1.4.7.1 due to the problems reported in bug ASTERISK-10201.

The only way I'm likely to get this code into production is if it becomes part of 1.4 ABE, which management is chomping at the bit to purchase.

By: Digium Subversion (svnbot) 2007-09-10 12:43:29

Repository: asterisk
Revision: 82155

------------------------------------------------------------------------
r82155 | tilghman | 2007-09-10 12:43:29 -0500 (Mon, 10 Sep 2007) | 2 lines

Convert struct member to use refcounts (closes issue ASTERISK-9873)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-09-10 13:14:56

Repository: asterisk
Revision: 82178

------------------------------------------------------------------------
r82178 | tilghman | 2007-09-10 13:14:55 -0500 (Mon, 10 Sep 2007) | 10 lines

Merged revisions 82155 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r82155 | tilghman | 2007-09-10 13:02:02 -0500 (Mon, 10 Sep 2007) | 2 lines

Convert struct member to use refcounts (closes issue ASTERISK-9873)

........

------------------------------------------------------------------------