Summary:ASTERISK-12900: Abort in free in queue show <name> command
Reporter:David Woolley (davidw)Labels:
Date Opened:2008-10-15 08:56:36Date Closed:2008-11-14 10:55:51.000-0600
Versions:Frequency of
Environment:Attachments:( 0) 13703_caseinsensitive.patch
( 1) 13703.patch
( 2) backtrace13703
( 3) no-malloc-debug-but-crashed-valgrind.txt
( 4) try2-malloc_debug.txt
( 5) try2-valgrind.txt
( 6) valgrind.txt
Description:We were getting bogus queue not found messages from the QUEUE_MEMBER (and QUEUE_MEMBER_COUNT function), so after running "queue show", which gave the truncated queue name, we tried queue show <name> with the full name.


It appears to be doing something completely inappropriate for queue show!  Last non-libc frame:

ASTERISK-2  0x006372ab in destroy_queue (obj=0x917f150) at app_queue.c:1410
1410           free(q->sound_periodicannounce[i]);

Typical QUEUE_MEMBER failure.  Note I've had to obfuscate names, but I've kept them distinct and kept them the same length.

[Oct 15 13:32:11] WARNING[30939]: app_queue.c:4880 queue_function_qac: queue "QQQQQQQQQQQQQQQ" was not found

"queue show" extract for the queue in question (note name truncated 3 characters, as I suspect that may be significant):

QQQQQQQQQQQQ has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W
:0, C:0, A:0, SL:0.0% within 0s
     Anonymous (Agent/1001) (Not in use) has taken no calls yet
     WhoKnows (Agent/3003) (Unavailable) has taken no calls yet
  No Callers

Provisisonal back trace.  Full set to follow.

Core was generated by `/usr/sbin/asterisk -f -vvvg -c'.
Program terminated with signal 6, Aborted.
#0  0x00bfc402 in __kernel_vsyscall ()
(gdb) bt
#0  0x00bfc402 in __kernel_vsyscall ()
#1  0x00138d20 in raise () from /lib/libc.so.6
#2  0x0013a631 in abort () from /lib/libc.so.6
#3  0x00170e6b in __libc_message () from /lib/libc.so.6
#4  0x00178b16 in _int_free () from /lib/libc.so.6
ASTERISK-1  0x0017c070 in free () from /lib/libc.so.6
ASTERISK-2  0x006372ab in destroy_queue (obj=0x917f150) at app_queue.c:1410
ASTERISK-3  0x080769ca in ao2_ref (user_data=0x917f150, delta=-1) at astobj2.c:187
ASTERISK-4  0x006332d8 in queue_unref (q=0x917f150) at app_queue.c:575
ASTERISK-5  0x00634a85 in interface_exists_global (interface=0x9144658 "Agent/1001",
   lock_queue_container=1) at app_queue.c:1038
ASTERISK-6 0x00634acc in remove_from_interfaces (interface=0x9144658 "Agent/1001",
   lock_queue_container=1) at app_queue.c:1048
ASTERISK-7 0x006371f9 in free_members (q=0x917f150, all=1) at app_queue.c:1393
ASTERISK-8 0x00637265 in destroy_queue (obj=0x917f150) at app_queue.c:1406
ASTERISK-9 0x080769ca in ao2_ref (user_data=0x917f150, delta=-1) at astobj2.c:187
ASTERISK-10 0x006332d8 in queue_unref (q=0x917f150) at app_queue.c:575
ASTERISK-11 0x00646efb in __queues_show (s=0x0, fd=24, argc=3, argv=0xb7743d88)
   at app_queue.c:5529
ASTERISK-12 0x00647194 in queue_show (e=0x651600, cmd=-4, a=0xb7743d64)
   at app_queue.c:5582
ASTERISK-13 0x080977b4 in ast_cli_command (fd=24,
   s=0xb7743eac "queue show qqqqqqqqqqqqqqq") at cli.c:1887
Comments:By: Tilghman Lesher (tilghman) 2008-10-15 10:09:20

Please read doc/valgrind.txt.

By: David Woolley (davidw) 2008-10-15 10:56:26

It didn't crash this time, but did produce these errors:

*CLI> queue show qqqqqqqqqqqqqqq
QQQQQQQQQQQQ has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W
:0, C:0, A:0, SL:0.0% within 0s
     Anonymous (Agent/1001) (Unavailable) has taken no calls yet
     WhoKnows (Agent/3003) (Unavailable) has taken no calls yet
  No Callers

*CLI> [Oct 15 15:55:31] ERROR[6679]: astobj2.c:183 ao2_ref: refcount -1 on object 0x4c83b08
[Oct 15 15:55:31] ERROR[6679]: astobj2.c:183 ao2_ref: refcount -2 on object 0x4c83b08
stop now

malloc_debug is empty; attaching valgrind.txt.

By: David Woolley (davidw) 2008-10-15 11:11:36

Got down to a reference count of -11, then the queue vanished without trace!

queue show qqqqqqqqqqqqqqq[Oct 15 16:12:40] ERROR[7083]: astobj2.c:183 ao2_ref:
refcount -11 on object 0x4c8c300

No such queue: qqqqqqqqqqqqqqq.

Some other queues are still there in "queue show"

By: Mark Michelson (mmichelson) 2008-10-15 11:21:46

It appears there are two separate problems being reported here. Allow me to address each.

1. The QUEUE_MEMBER function is failing to find a queue that it should be able to find with no problem. Let me first say that this is not related to the truncation of the queue name that you see when issuing a 'queue show' command. The truncation is just a formatting choice when printing the queue's name. I'm not sure, immediately, why there would be trouble finding the queue name, but to be honest, I'm more concerned about the other problem you reported.

2. There is a crash which has happened when you issued a 'queue show' command. Based on the backtrace, it appears that the reference count for a queue reaches 0, and then inside the destructor function for the queue, another function call causes the reference count to go back up to 1 temporarily and then back to 0, causing a crash to occur. This seems very familiar, as if I fixed something exactly like this at one point. I'll have to take a look at the commit history for app_queue and see if I find anything.

By: David Woolley (davidw) 2008-10-15 11:24:09

Oops.  Got interrupted between building and installing, whilst waiting for someone to install valgrind, which is why there is no malloc debug output!

By: David Woolley (davidw) 2008-10-15 11:33:27

Now uploaded files from the correct build.

In the meantime, the no-optimise only build did eventually crash under valgrind, so I'll upload that one as well.

By: David Woolley (davidw) 2008-10-15 11:47:58

WRT to the two different issues.  I was going to report the queue function not finding the queue one separately, when I'd investigated a bit more, but mentioned it here as it was in that context that I generated the crash; I was trying to see if the name had been physically truncated.

It may, of course, be that the memory corruption led to the failure to find.  However, I think there has to be more to it, otherwise I presume it would have been found before.

As to the reference counts, note that they are going distinctly negative.

By: Mark Michelson (mmichelson) 2008-10-15 11:53:39

I think I may have found the problem, and it appears to have to do with a piece of bad logic in the function which handles the 'queue show' CLI command. Essentially, what I've found is that if you call queue show <some_queue>, then all realtime queues will end up with their reference counts one lower than they should be. Just to be sure I'm on the right track, though, can you confirm that you are using realtime queues?

By: David Woolley (davidw) 2008-10-15 12:12:21

We are NOT using any realtime capabilities.

By: David Woolley (davidw) 2008-10-15 12:20:07

One other data point, although for the test that produced the malloc_debug, I used nothing but a CLI command, our CTI application does the equivalant of queue show <name>, using the AMI interface.

By: David Woolley (davidw) 2008-10-15 12:34:40

I received a correction.  The CTI only uses QueuePause with named queues, not QueueStatus.

By: Mark Michelson (mmichelson) 2008-10-15 13:50:57

Okay, if you're not using realtime capabilities, I have misdiagnosed the problem. I see a few places in the code where refcounting is not used properly, but in all the cases I've looked at, they are not conditions which could lead to the reference count being too low.

Do you have a reliable set of steps you can use to make this crash occur?

By: Mark Michelson (mmichelson) 2008-10-15 15:15:34

I did an audit of app_queue in the 1.6.0 branch and found several reference counting errors. I have uploaded a patch, 13703.patch, which corrects these errors. I'm fairly certain that the root of the issue here is in bad refcounts, but I'm not sure of where exactly the count goes awry. 13703.patch corrects some general locking/refcounting issues I found just by reading the code. Please test to see if this solves the crash you reported. Thanks!

By: David Woolley (davidw) 2008-10-15 15:22:14

At home now, but will try tomorrow.

Getting a crash may be a bit hit and miss, but whenever I started it with no traffic, and immediate did queue show <queue name>, I got error messages about the negative reference count.

By: David Woolley (davidw) 2008-10-16 05:04:40

Still getting negative reference counts:

     Anonymous (Agent/1001) (Unavailable) has taken no calls yet
     WhoKnows (Agent/3003) (Unavailable) has taken no calls yet
  No Callers

*CLI> [Oct 16 10:10:52] ERROR[8968]: astobj2.c:183 ao2_ref: refcount -1 on object 0x4dc2aa8
[Oct 16 10:10:52] ERROR[8968]: astobj2.c:183 ao2_ref: refcount -2 on object 0x4dc2aa8

By: David Woolley (davidw) 2008-10-16 05:59:22

Hypothesis:  both this and the queue function failure might be explained if the hash based lookup was failing.  It is assuming that, if the iterator finds it, the hash would also have found it.

By: David Woolley (davidw) 2008-10-16 06:26:51

The hash function is giving 42, when looking up the entry when doing the queue function lookup, which might be good for hitchhiking the galaxy, but not good when the queue entry is in hash bucket 10 (as found by brute forcing queues, with gdb).  42 is also what one gets for queue show <name>.

Hypothesis: the hash function is computed with the original case of the name when loading the table, but, for the function, the name is forced to lower case and for my queue show <name> I forced it to lower case.  (The original is mixed case, although I obfuscated to upper case.)

By: David Woolley (davidw) 2008-10-16 06:41:34

Doing queue show <name> with the exact capitalisation produces the correct hash bucket and no error messages.  Hypothesis confirmed.

By: David Woolley (davidw) 2008-10-16 07:01:31

(Parenthetically, as it isn't necessarily all the same problem, lowercasing the queue name in the dialplan and queues.conf doesn't solve the problem with the queue function, although it does mean that you can now use queue show <name> with all lower case, without getting errors.

The remaining problem with the queue function seems to be the result of having apostrophes from a variable substitution in the queue name parameter, which is a change from 1.4.  I need to check if that is an expected difference, before raising that issue.

On further research, the queue function problem looks like it is a combination of this problem and ASTERISK-1316249, either being sufficient.)

By: Mark Michelson (mmichelson) 2008-10-16 10:06:24

Excellent work, davidw. Thanks for doing all the brute force work with gdb to find the problem. So it looks like an audit is necessary to make sure that all user-input queue names are translated to the case used when originally hashing the queue name. I'll write another patch and upload it as soon as possible.

By the way, the 13703.patch is still valid, since it is correcting refcount errors, so the next patch I upload will be a combination of 13703.patch and one which helps with regards to the case of queue names.

Thanks again for all your hard work on this one!

By: Mark Michelson (mmichelson) 2008-10-16 10:10:25

Also, your explanation explains perfectly why I wasn't able to reproduce this myself. I always tab-complete the queue name on the CLI, so as a result, the queue name was exactly the same case as was defined in queues.conf.

By: Mark Michelson (mmichelson) 2008-10-16 12:24:44

I have uploaded 13703_caseinsensitive.patch for testing. In my local tests, this fixed things, but I'd like to confirm that this also works in your setup.

There is one drawback to this patch which may make it insufficient for inclusion in 1.6.0. The way it is implemented now is to change the hash function used in app_queue to convert the name to all lower-case prior to calling ast_str_hash on the name. This makes it so that case-sensitivity is not an issue when using names from the dialplan, AMI, or CLI. This, however, breaks the previous behavior of queues, which was that their names were case-sensitive when parsing from queues.conf. This patch I have submitted may be good for trunk and even 1.6.2, but as far as 1.6.0 and 1.6.1 go, they may need a different patch which doesn't alter previously-defined behavior. I'll work on such a patch.

Edit: I know I mentioned this in a previous note, but just to confirm, 13703_caseinsensitive.patch also includes the refcount fixes from 13703.patch.

By: Mark Michelson (mmichelson) 2008-10-16 16:04:42

I have come up with another solution for 1.6.0 and 1.6.1 which will fix the problems here and which will not cause a behavior change. However, it may slightly degrade performance in setups that have lots and lots of queues.

Essentially, since almost every instance of ao2_find() in app_queue may take user-generated input, using the OBJ_POINTER flag can result in bad hash lookups if the user-generated string is not exactly the same case as what was specified in queues.conf. So, the fix I have is to remove the OBJ_POINTER flag from all ao2_find calls in app_queue. The result of this is that the hashing callback will not be used on the user-generated input used to find a queue. This will result in an O(n) lookup of queues instead of O(1), but this is likely not going to be a problem for most since in versions of Asterisk prior to 1.6.0, finding a queue was an O(n) lookup, too.

By: David Woolley (davidw) 2008-10-17 06:55:39

Looks OK.

I don't imagine that an O(n) search over queues would hurt us much, either.

By: Digium Subversion (svnbot) 2008-11-14 10:53:36.000-0600

Repository: asterisk
Revision: 156883

U   trunk/UPGRADE.txt
U   trunk/apps/app_queue.c
U   trunk/include/asterisk/strings.h

r156883 | mmichelson | 2008-11-14 10:53:36 -0600 (Fri, 14 Nov 2008) | 24 lines

Fix some refcounting in app_queue.c and change the
hashing used by app_queue.c to be case-insensitive.
This is accomplished by adding a new case-insensitive
hashing function.

This was necessary to prevent bad refcount errors
(and potential crashes) which would occur due to the
fact that queues were initially read from the config
file in a case-sensitive manner. Then, when a user
issued a CLI command or manager action, we allowed
for case-insensitive input and used that input to
directly try to find the queue in the hash table. The result
was either that we could not find a queue that was input or
worse, we would end up hashing to a completely bogus value
based on the input.

This commit resolves the problem presented in
issue ASTERISK-12900. However, that issue was reported against
1.6.0. Since this fix introduces a behavior change, I am
electing to not place this same fix in to the 1.6.0 or 1.6.1
branches, and instead will opt for a change which does not
change behavior.



By: Digium Subversion (svnbot) 2008-11-14 10:55:50.000-0600

Repository: asterisk
Revision: 156889

U   branches/1.6.0/apps/app_queue.c
U   branches/1.6.0/include/asterisk/strings.h

r156889 | mmichelson | 2008-11-14 10:55:50 -0600 (Fri, 14 Nov 2008) | 7 lines

This is the 1.6.0 version of revision 156883 of trunk.
This is different in that it preserves the case-sensitiveness
of processing queues from configuration.

closes issue ASTERISK-12900