[Home]

Summary:ASTERISK-11802: AddQueueMember and RemoveQueueMember randomly lock up asterisk.
Reporter:Guillaume Giraudon (ggiraudon)Labels:
Date Opened:2008-04-08 14:40:22Date Closed:2011-06-07 14:03:20
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Applications/app_queue
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 12388-agentdeadlock.diff
( 1) locks.txt
( 2) threads.txt
Description:When using the AddQueueMember and RemoveQueueMember functions, asterisk will randomly lock up. Calls might still go through but "core show channels" and some other CLI functions will not return any output.

I have rebuilt asterisk with DEBUG_CHANNEL_LOCKS and DEBUG_THREADS and have a lock and thread dump of when this happens.

This issue has been noticed since 1.4.13 and identified in 1.4.17. It is however still happenning in asterisk 1.4.19.

Comments:By: Guillaume Giraudon (ggiraudon) 2008-04-08 14:42:36

I appologize for the lack of description. My browser seems to have dropped it.
Here is my original problem description :

When using the AddQueueMember and RemoveQueueMember functions, asterisk will randomly lock up. Calls might still go through but "core show channels" and some other CLI functions will not return any output.

I have rebuilt asterisk with DEBUG_CHANNEL_LOCKS and DEBUG_THREADS and have a lock and thread dump of when this happens.

This issue has been noticed since 1.4.13 and identified in 1.4.17. It is however still happenning in asterisk 1.4.19.

By: Guillaume Giraudon (ggiraudon) 2008-04-08 14:44:12

I am attaching the logs I have taken from the locks and threads to this case. I hope this is the proper way to do things :-)

By: Jason Parker (jparker) 2008-04-08 15:19:05

I just attached a patch for one spot I saw that was in the path of the trace you provided.  While this isn't by any means a "magic bullet", it should help at least a little bit.  Please do try it and report back.

By: Guillaume Giraudon (ggiraudon) 2008-04-08 15:27:53

Thank you kindly qwell. I will apply the patch and rebuild. Please allow me 24 hours or so to see if I can reproduce this issue as this issue remains hard to reproduce.

By: Guillaume Giraudon (ggiraudon) 2008-04-09 12:57:09

Problem has just occured again, this time on the use of the AgentCallBackLogin() application.
I unfortunatly dont have DEBUG_CHANNEL_LOCKS on that box and was therefore not able to pull a trace on it.
Strangely enough, I have 2 boxes running identical queue configurations : One is runnig 1.4.19 (the one that locks up) and another one running 1.4.16.2.
The one running 1.4.16.2 doesnt not seem to suffer from this issue.
I am looking through the differences in chan_agent.c between 1.4.16.2 and 1.4.19 to see if there might be some obvious reason (I'm no expert though)

By: Mark Michelson (mmichelson) 2008-04-09 13:47:11

ggiraudon:

DEBUG_CHANNEL_LOCKS is not necessary to get a "core show locks" output, you just need DEBUG_THREADS enabled.

The "core show locks" output you have provided so far doesn't clearly show which locks are in contention, but it would appear to hinge on the lock on the agent list and a channel lock held in chan_local. There have been a few changes both to chan_local and to the code which renders the "core show locks" output between 1.4.17 and 1.4.19. I realize that the problem is still occurring in 1.4.19, but if you could get a core show locks from that, it could be more helpful in diagnosing the problem. Also, if the deadlock occurs, if you could get a backtrace using "thread apply all bt full" inside gdb, it would be helpful. I'm interested in seeing the code being followed in some of the threads when this happens.

Thanks for you help.

By: Guillaume Giraudon (ggiraudon) 2008-04-09 13:58:35

I will rebuild with DEBUG_THREADS and return as soon as the issue reproduces itself.

By: Guillaume Giraudon (ggiraudon) 2008-04-09 14:10:55

Unfortunetly, my platforms currently experiencing these problems use MySQL and realtime quite extensively... and it looks like when I try to build asterisk 1.4.19 with DEBUG_THREADS turned on, the MySQL realtime engine no longer loads.
I've tried rebuilding asterisk-addons-1.4.6 after rebuilding asterisk but it didnt seem to solve the issue and I had to go back to building asterisk withough DEBUG_THREADS.

Is there a way to have both Realtime and DEBUG_THREADS on or are they mutually exclusive ?

By: Guillaume Giraudon (ggiraudon) 2008-04-09 14:17:24

Addendum idea : Also, is there a way to "kill" a specific "thread" on a running asterisk without necessarily taking the whole system down ?
I am no sure my question makes much sense as formulated but I was trying to think of a way to forcefully tear down the call causing the deadlock or freeze... although I am assuming that this would cause an issue since all locks that specific thread initiated would still be in place.

By: Mark Michelson (mmichelson) 2008-04-09 15:05:18

DEBUG_THREADS should have no effect on whether res_config_mysql loads properly. When does the failure happen? Could you paste the console output (if there is any) when that happens?

As far as your idea of killing a specific thread, I guess you could do that if your OS treats threads as separate processes, but I'm not sure how you would figure out which thread to kill. I would not recommend doing it.

By: Mark Michelson (mmichelson) 2008-04-09 15:16:22

I have a feeling that res_config_mysql isn't loading when you change to enable DEBUG_THREADS because the appropriate clean operations have not been done. To ensure that everything is clean, do a `make distclean` in both the asterisk and addons directories, then `make menuselect` in the asterisk directory and select DEBUG_THREADS (and DONT_OPTIMIZE so you can get a backtrace when the lockup happens). Then compile asterisk and addons.

See if that solves the issue of mysql not loading properly.

By: Guillaume Giraudon (ggiraudon) 2008-04-10 01:34:53

You are correct, make clean apparently was not enough, but make distclean did the trick.
Unfortunetly, another lockup on 1.4.19 with app_voicemail this time didnt give us a chance to deploy a DEBUG_THREADS enabled version as we had to revert back those 2 systems to a previous version of asterisk.
We still have 2 more running 1.4.19 with DEBUG_THREADS turned on but they are hardly as active as the other ones so it might take a little while before we can reproduce this issue again.

By: Jason Parker (jparker) 2008-05-01 15:01:11

Any updates here?

By: Tilghman Lesher (tilghman) 2008-06-04 13:25:20

ggiraudon: are you able to provide the requested debugging information?

By: Tilghman Lesher (tilghman) 2008-06-19 17:31:50

Suspended due to lack of response.  If you are able to provide the needed debug information, please contact a bug marshal on irc.freenode.net in #asterisk-bugs to help you in reopening this issue.