[Home]

Summary:ASTERISK-12070: reload stops all action on CLI
Reporter:Mark Hamilton (yourname)Labels:
Date Opened:2008-05-22 23:02:38Date Closed:2011-06-07 14:00:39
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Applications/app_cdr
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace_4_20_1_yourname_runningast_and_coreshowlocks.log
( 1) backtrace_yourname_latest.log
( 2) backtrace_yourname_runningast.txt
( 3) backtrace.txt
( 4) backtrace1n2_yourname_core_w_docrash.txt
( 5) bt_12709.txt
( 6) core_yourname_28th.log
( 7) core2_yourname_28th.log
( 8) coreshowlocks.log
( 9) coreshowlocks20.1.txt
Description:Mailing list: http://lists.digium.com/pipermail/asterisk-users/2008-May/212281.html

After doing a few changes in queues.conf, I tried reload app_queue.so, nothing happened on CLI. So I tried reload, and nothing happened. Something similar happened to be long ago, and from then I knew something bad is going to happen like no calls can come in or go out, agents login or out, etc..

These commands go by and NOTHING happens on the CLI. Even stop now didn't work. So I ctrl+c out of the CLI and then 'service asterisk stop'. Seems like it did, but on the shell it kept doing the "Asterisk died with code..."

So I attempted to stop asterisk again, and also tried killall asterisk. Finally it all worked, and then I safe_asterisk'd. Been scared to reload ever since.
Comments:By: Sherwood McGowan (rushowr) 2008-05-23 06:04:40

I will be submitting a backtrace if I can shortly, I experience this same sort of pseudo lockup sometimes when I tell asterisk to restart now or stop now. This only occurs during loaded periods, and typically exiting and reconnecting resolves the issue, but not always.

By: Tilghman Lesher (tilghman) 2008-05-23 07:23:38

You're going to need to submit a backtrace from an unoptimized build, as detailed in doc/backtrace.txt.  These optimized backtraces are next to useless for debugging purposes.

By: Mark Hamilton (yourname) 2008-05-23 09:04:17

Thanks rushowr!
I just realized that the core dump happened with an unoptimized build. I recompiled so it's not optimized but gotta try to reproduce now.

By: Mark Hamilton (yourname) 2008-05-23 10:35:50

The backtrace I submitted yesterday was of a core created by an unoptimized build. Shortly after submitting, I recompiled with DONT OPTIMIZE and show deadlocks. Today, I've seen deadlocks like crazy all over the CLI!

I ran a backtrace with Asterisk running as it didn't crash and had no coredump. (I noticed DO_CRASH was not enabled, but I will rebuild with DO_CRASH to get you another backtrace with a core this time)

By: Mark Hamilton (yourname) 2008-05-23 11:01:33

Alright. Multiple segfaults/coredumps as soon as I started safe_asterisk after rebuilding with DO_CRASH. I've attached gdb for atleast 2 coredumps in a single file. Uploading now.

By: Mark Hamilton (yourname) 2008-05-23 14:12:53

Any action on this at all? This is a production server and these deadlocks are killing us! :(

By: Tilghman Lesher (tilghman) 2008-05-23 14:24:03

Yourname: Let's try getting the output of "core show locks".  You'll need to recompile with DEBUG_THREADS.  Also, please turn off DO_CRASH.  Once you have a deadlock situation, run "core show locks" and capture the output of that command before you restart.

By: Mark Hamilton (yourname) 2008-05-23 14:50:33

core show locks uploaded.

By: Mark Hamilton (yourname) 2008-05-23 18:30:41

Alright, I updated to 1.4.20.1, and still nogo. I still see the deadlocks. However, so far there hasn't been a coredump.

Uploading a file that does a backtrace on asterisk while running, and then also the output of core show locks. All in one.

By: Mark Hamilton (yourname) 2008-05-23 18:33:28

Now, reload seems to work however I see a LOT of deadlocks on the CLI. There is one other error that shows as well:

WARNING[6729]: app_queue.c:3014 try_calling: The device state of this queue member, Agent/216, is still 'Not in Use' when it probably should not be! Please check UPGRADE.txt for correct configuration settings.

By: Mark Hamilton (yourname) 2008-05-24 09:24:52

Any news guys?

By: Sherwood McGowan (rushowr) 2008-05-24 10:38:09

Give 'em a little time to get someone on it, although it should at least be out of feedback status by now.

I'd suggest compiling back to the non-debugging version as I'd say you've done a nice job of posting debug info.

I've recompiled my current server for debugging now, and on Monday will put that version into use so I can submit some debugging information that reflects a system showing the same behavior that does NOT have queues/agents in use. I get the same symptom when doing a stop now, restart when convenient, restart now, etc...

Thanks for posting about this on the lists and posting this bug, it reminded me that I needed to contribute my experiences as well.

By: Mark Hamilton (yourname) 2008-05-24 11:36:30

rushowr, you're welcome. The more, the merrier!
I hope this gets resolved though.

By: Mark Hamilton (yourname) 2008-05-24 20:27:22

I'll need to add a couple more queues and more voicemails will need to be setup, etc. And I don't know if these deadlocks will be trouble for me or not, should I go ahead?

By: Mark Hamilton (yourname) 2008-05-25 16:23:48

The latest coredump. I did reboots, although I know that doesn't help, even cleared the /src and recompiled from scratch, and no go.

By: Mark Hamilton (yourname) 2008-05-26 19:24:01

Just a note: This bug started as 1.4.18, and all updates to new versions later, deleting and rebuilding later, rebooting later... it still exists in 1.4.20.1 (latest)



By: Mark Hamilton (yourname) 2008-05-27 17:33:04

Guys, any updates on this please?

By: Mark Hamilton (yourname) 2008-05-28 13:01:19

New cores. Then, it started but SIP registrations weren't working, and nothing was working after Asterisk started up after dumping the cores.

I had to go all this to actually get asterisk to stop:

service stop asterisk (about 5 times)
killall asterisk (3 times)
kill -9 asteriskpid (2 times)

Then, started with safe_asterisk and SIP registrations were still timingout.
{quote}
[root@queues ~]# service asterisk stop
Shutting down asterisk: /usr/sbin/safe_asterisk: line 117: 12468 Killed                  nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} >&/dev/${TTY} </dev/${TTY}
Asterisk ended with exit status 137
Asterisk exited on signal 9.
Automatically restarting Asterisk.
                                                          [  OK  ]
{/quote}


Then, finally rebooted. It still didn't work with SIP registrations or anything at all. 5 mins later, it all worked.

By: Mark Hamilton (yourname) 2008-05-28 13:11:13

CLI showing locks like no tomorrow.
Pastebinned for your viewing pleasure:
http://pastebin.ca/1032346

By: Michiel van Baak (mvanbaak) 2008-05-28 13:34:53

this is a backtrace from my system, core dumping on 'reload'

By: Mark Michelson (mmichelson) 2008-05-28 13:39:49

mvanbaak: your crash is happening most likely due to some screwy reference counting in res_odbc. Since the reference counting is not in 1.4, this is almost assuredly not related to the issue initially reported by Yourname, even though the problem manifests itself the same way. I'd recommend opening a separate bug report and assigning the issue to Corydon76 since he'd be the most likely candidate for fixing the issue.

By: Michiel van Baak (mvanbaak) 2008-05-29 00:41:02

Please ignore my backtrace.
After an svn up I'm not able to reproduce it anymore.

By: Mark Michelson (mmichelson) 2008-05-30 10:15:15

The initial problem you reported sounds like a deadlock to me. The core show locks output when this happens will probably be most useful. I realize that you've already submitted a core show locks from 1.4.18, but considering that the output of the command has been modified to be more specific and not give as many potentially spurious pieces of information, it would be very helpful if I saw this output from 1.4.20.1 since it sounds like that's what you're using now.

From what's been gathered here so far, it appears that the problem is due to the horrendous locking involved in chan_agent. I'll work on trying to reproduce the issue locally so that I don't have to rely on your locking up your production phone system in order to get information.

By: Mark Hamilton (yourname) 2008-05-30 10:24:00

Core show locks updated.

By: Mark Michelson (mmichelson) 2008-05-30 10:25:58

Yes, this is most definitely due to the screwy locking in chan_agent. I will work on a fix for it.

By: Mark Michelson (mmichelson) 2008-05-30 10:27:54

Actually, was the output that you just provided gathered when the deadlock occurred?

By: Mark Hamilton (yourname) 2008-05-30 10:30:35

Yes, although I have DETECT_DEADLOCKS turned off so it doesn't pollute the CLI, I am positive that there's locks going on even though I don't see them.

It's been like this since the day I reported the bug. Where there's locks at all times. None when there's no traffic whatsoever. And right now it's got some traffic on it.

By: Mark Michelson (mmichelson) 2008-05-30 10:42:33

Okay, I think there may be some miscommunication going on here. Any time you use chan_agent, there's going to be a lot of locking that goes on. If you have DETECT_DEADLOCKS turned on, you noticed that the CLI reported tons of potential deadlocks. This happens simply when an agent is on the phone, and unfortunately, this is actually by design.

When I referred to the deadlock reported, I meant the problem where you attempted to reload app_queue and the CLI seemed completely unresponsive. When this happens again, then that's when I need to see the core show locks output. What you've given me appears to be the state of things when calls are passing normally to agents. If I've not analyzed this correctly, please let me know.

By: Mark Hamilton (yourname) 2008-05-30 10:50:26

Ok, you're right. In that case, the last two cores (gdb) I attached are the ones that you should look at. They happened on 1.20.1

Regarding core show locks, since it crashes, I don't have a way to get on the CLI for core show locks. Also, I can't tell about the locks being normal locks or "deadlocks"? Because on the CLI, when I had detect_deadlock on, 10 out of 10 lock lines that showed up had Deadlock written in there. I think I put it down in the pastebin as well.

By: Mark Michelson (mmichelson) 2008-05-30 12:27:45

Once again, I think we're having some communications failures here. In your original bug report, you made no mention of any crashes occurring. The last two core dumps you provided were taken from a system with DO_CRASH enabled in menuselect, which will cause crashes to occur when certain error conditions arise. Those crashes most likely have nothing at all to do with the original deadlock you reported. I'm not sure whose advice you were following by enabling DO_CRASH, but in my view it doesn't help at all in resolving this issue and should never be enabled on a production box. I apologize to you for whoever told you to enable that.

So let's get back to square one here. You reported that when attempting to reload app_queue, the CLI was unresponsive and calls could not be completed. When this happens, this is when I need to see core show locks output, because that is when the failure has occurred and that's when the system is in a state where the output of that command is useful. If the CLI is not responsive to commands, you can attempt to connect to asterisk from another console with asterisk -r. This sometimes will work in getting CLI responsiveness again.

If there have also been crashes occurring (not counting the ones that happened when DO_CRASH was enabled) then backtraces from those crashes will be helpful too. From what I see, every backtrace provided so far has been from a crash point caused by DO_CRASH.

With regards to the deadlocks reported on the CLI, you shouldn't worry about those unless there is a noticeable problem (i.e. calls can't complete, CPU % spikes, etc.). The deadlock detection works in Asterisk by seeing if it's taking an excessive amount of time to acquire a lock. In some cases, it may really just need that long. In the case of chan_agent, however, you'll see deadlock messages all over the CLI when agents are on the phone because there's actually a deadlock built into the channel driver on purpose. This deadlock clears itself when an agent hangs up.

By: Mark Hamilton (yourname) 2008-05-30 13:45:37

Gotcha
Let me address your para #2.
I tried doing the core show locks way back when this actually happened, nothing worked. And I opened a new CLI, that didn't help either.

The first 2 backtraces I uploaded were not enabled with DO_CRASH. So those will be helpful to you. However, they were on 1.4.18.
And then someone told me to enable DO_CRASH, and that's the 3rd bt that I submitted. Then, Corydon asked for core show locks, and that's the 4th file.

Finally, I tried updating, and coming to 1.4.20.1, I still noticed the deadlocks. (Now I know about the reasons) I don't remember if DOCRASH was still enabled.

...and as they say, rest is history.
Summary: First two files should show you some good info, but they were on 1.4.18.1

By: Sherwood McGowan (rushowr) 2008-05-30 14:46:46

The failure of the CLI to respond to commands after a certain command has been entered is not specific to reload and/or queues. I get CLI lockups after issuing Reload/Restart Now/Stop Now and even the Originate command.

By: Mark Michelson (mmichelson) 2008-06-03 15:36:13

I've looked through each of the backtraces and core show locks outputs provided and unfortunately, none appear to actually give the necessary information, and I *still* think this is due to the fact that I'm not communicating well enough when to get the information.

I realize that Yourname said that when the lockup occurs, he cannot get core show locks to work. Has anyone else who is monitoring this issue been able to get it? The only core show locks output that's here right now was taken when Asterisk was not deadlocked, so it's not helpful.

By: Tilghman Lesher (tilghman) 2008-06-03 15:45:46

Just a thought, but when this lockup occurs, don't run 'core show locks' from the current console.  Instead, connect a new instance of 'asterisk -rc' and ONLY run 'core show locks' from that new console, and see if you can get output that way.

I suspect what may be happening is that the current console is not responding properly, but a new console may (as long as you don't run another command first).

By: Mark Hamilton (yourname) 2008-06-04 10:01:24

I guess I'll have to wait for my situation to occur.
Meanwhile, maybe someone else can submit a bt?

By: Tilghman Lesher (tilghman) 2008-06-04 10:08:46

Since this is a deadlock and not a crash, another bt is unlikely to be helpful.

By: Mark Hamilton (yourname) 2008-06-04 10:14:23

I guess it'll have to be something that can fix the deadlock. core show locks?

By: Leif Madsen (lmadsen) 2008-10-07 13:46:21

Is this still an issue in latest Asterisk? Anything we can do to move this forward? Issue seems to be kinda close to being solved...

By: Leif Madsen (lmadsen) 2008-11-20 17:03:45.000-0600

Closing this issue for now. When someone can provide some information that will help resolve this issue, please find a bug marshall on IRC to determine whether this should be reopened. Thanks!