Summary:ASTERISK-17876: Infinite loop after "queue show" with realtime Postgreql queues
Reporter:Cristian Dimache (cristiandimache)Labels:
Date Opened:2011-05-17 23:21:39Date Closed:2011-09-29 09:08:47
Versions:1.8.4 Frequency of
Environment:Attachments:( 0) bt-queueshow.txt
( 1) debug-20110518_-_trim.txt
( 2) MemberTable.sql
( 3) QueueTable.sql
Description:In our system we have realtime queues in Postgresql - about 100 of them.
Upon issuing "queue show" the load of the server increases and no inbound calls are server. Checking the debug we noticed an infinite loop requesting the same queue over and over again from the Postgresql database.


A SIGABRT sent to Asterisk got a backtrace  - I don't have DEBUG_THREADS option enabled, but I can add it if it helps.
Comments:By: Cristian Dimache (cristiandimache) 2011-05-17 23:25:01

The file debug-20110518 - trim.txt contains the debug log with the looped query to the database for queue info.

By: Jonathan Rose (jrose) 2011-09-26 15:17:34.068-0500

With this many queues, it'll be pretty hard to reproduce this problem easily.  Could we get a db dump for the queue and queue member tables?  Feel free to sanitize any private data, though I'm sure I probably don't need to mention that odd database entries can cause a myriad of problems that might lead to something like this.

Regardless of odd stuff like that though, we still shouldn't be getting caught in a loop like this.

By: Cristian Dimache (cristiandimache) 2011-09-27 01:55:31.798-0500

Some of the queues are dynamic - we add and remove members via AMI.
There was no adding or removing members from a queue at the time of the infinite loop - just normal operations.
The dumps are attached.

By: Cristian Dimache (cristiandimache) 2011-09-27 01:56:26.341-0500

Queue Member Table

By: Cristian Dimache (cristiandimache) 2011-09-27 01:56:45.752-0500

Quque Table

By: Jonathan Rose (jrose) 2011-09-27 12:10:17.490-0500

Well, I copied the database and did show queue, but it just went through with the expected result after some delay.

According to the Mantis version of this report, you have the reproducibility as always.   I assume that means that every time you do queue show you run against this loop... I might need you to set up some remote access...

before that though, is this bug unique to any particular versions of Asterisk?  Have you tried it with 1.6.2 and 1.4?  If it's a regression, that might give me something to go on.

I guess I should also ask what version of PostgreSQL you are using just to see if that might have anything to do with it.  I doubt that's the case, but it can't hurt to know.

By: Cristian Dimache (cristiandimache) 2011-09-28 02:35:59.791-0500

Yep, it's always reproducible, but only after about two or three days of uptime - in the first couple of hours it always works for me as intended.
As I said earlier, some queues have about 200 adding / deleting members cycles in these two days of uptime, so maybe this has some importance as to my particular setup. The adding and removing of members is done via AMI.
The machine is in production, so for remote access I will need to setup a similar test machine - right now I'm running SVN 332561 and I can test outside business hours (so in about ten hours) if the problem still exists in this version.

By: Jonathan Rose (jrose) 2011-09-28 12:54:18.404-0500

Hmmm.  That's going to be a tough nut to crack with the issue taking so long to occur and under such a bevy of odd conditions.  So is that to say that after you force Asterisk to a close that you can then restart Asterisk, do another queue show, and then it will be working normally again?  That alone would suggest that Asterisk is more to blame than the database, though I would imagine the process should be the same regardless of all of that.

By: Cristian Dimache (cristiandimache) 2011-09-28 13:45:10.832-0500

That's exactly what I mean: a restart of Asterisk would allow me to do a "queue show", but after a while it would enter the infinite loop. A force restart would allow the viewing of the queues...
Anyway, in SVN 332561 this appears to be solved: same DB, same server, only the Asterisk version has changed:

voip-1*CLI> core show uptime
System uptime: 2 weeks, 6 hours, 44 minutes, 42 seconds
Last reload: 2 weeks, 6 hours, 44 minutes, 42 seconds
voip-1*CLI> queue show
q1206 has 0 calls (max unlimited) in 'fewestcalls' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0% within 0s
[... the complete output of the queues is displayed, yepee! ...]

I guess the problem was in res_config_pgsql.c. Looking over a diff between 1.8.4 and r332561 the only relevant change I see is a call to PQclear as documented in ASTERISK-17812 - could this be the culprit for this bug?

By: Cristian Dimache (cristiandimache) 2011-09-28 13:49:25.354-0500

I guess this can be closed - I cannot reproduce it in r332561

By: Jonathan Rose (jrose) 2011-09-29 09:06:56.593-0500

Alright, thanks.  If the issue resurfaces and you need to make a new issue, please reference this one when you do.