[Home]

Summary:ASTERISK-10613: SIP channel stops processing calls, but no apparent deadlock
Reporter:callguy (callguy)Labels:
Date Opened:2007-10-24 15:58:09Date Closed:2008-01-07 10:47:58.000-0600
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) bt-11122007.txt
( 1) chan_sip.c.callguy.diff
( 2) chan_sip.c.diff
( 3) chan_sip.c.diff_enst071115
( 4) cli_crash_trunk_92202_licedey_2007_12_12.txt
( 5) core-show-locks_20071211_dklima.txt
( 6) coreshowlocks-10242007.txt
( 7) core-show-locks2.log
( 8) deadlock_gdb_bt-full_thread-apply-all-bt-full.txt
( 9) deadlock-asterisk1.4.15.txt
(10) deadlock-festr-2
(11) gdb_dump.17676.txt
(12) gdb_dump-callguy.10894.txt
(13) gdb_dump-dklima.13728.txt
(14) gdb_dump-dklima.22465.txt
(15) gdb-debug-output.txt
(16) gdb-thread-bt-festr-1
(17) infothread.txt
(18) licedey_deadlock_trunk_92202_log_2007_12_13.zip
(19) lock.12272007-2
(20) lockrep.patch
(21) locks_1_4_16_1.txt
(22) locks.12202007.txt
(23) locks-12262007.rtf
(24) locks-12262007-2.rtf
(25) locks-12272007.rtf
(26) locks-20071228.txt
(27) locks-20071228-2.txt
(28) thread-apply-all-bt.txt
(29) trunk92736_2007_12_14_lock_log.zip
Description:Approximately once per week we are seeing asterisk stop processing SIP calls. The behavior is the same as a deadlock, but core show locks does not show any evidence that there is a deadlock.

The only way to resolve is to restart asterisk.

output of:
core show locks
info thread
thread apply all bt

from the running process is attached.
Comments:By: callguy (callguy) 2007-10-24 16:04:39

One additional piece of information - we are running 1.4.13, but have added the patch from bug 10571 (DEBUG_THREADS not being thread safe) to this, so lock.h includes that code.

By: Dmitry Andrianov (dimas) 2007-10-25 08:42:54

I just had similar issue with one of my servers running SVN-branch-1.4-r85720M

Symptoms were:
1. "analog" phones connected over SIP ATAs were not producing any dialtone.
2. there were no output on the console
3. Even after I executed "sip set debug" no messages started appearing on the console although tcpdump was showing that SIP packets arrive to the host.

'restart gracefuly' restored normal work. (And 'sip set debug' started showing SIP packets on the console)

I havn't collected any evidence (like console output, tcpdump, whatever) because I was in hury to fix everything - it was prodaction machine. Because of this I realize this particular report does not worth much. I just wanted to confim that problem really exists. (Although I'm not sure if I and callguy have exactly the same problem)

By: Russell Bryant (russell) 2007-10-25 09:45:38

Can you get this same information with DONT_OPTIMIZE turned on?  The backtrace for the important thread (the SIP connection monitor thread) is all messed up ...

By: callguy (callguy) 2007-10-25 10:33:46

russell: all of the data here is from a server with DONT_OPTIMIZE and DEBUG_THREADS both enabled. I suspect you are talking about thread 69, which I'd agree looks strange. Is there anything else we can do to get you more information?

we did a lot of testing on the fix for DEBUG_THREADS not being thread safe, I'm almost wondering if something similar is going on here, though from what information i could glean it appears to be in a different part of the code.

By: callguy (callguy) 2007-10-29 11:10:28

We just experienced the same issue again. One item to note in this case, it did not happen while the server was under any particular amount of load, though we still don't know what sequence reproduces it.

For the GDB output if you look at the details for thread ASTERISK-13 it looks identical to the errant thread from our previous output.

By: Denis Galvao (denisgalvao) 2007-10-29 11:23:04

russel: Does the patch included on ASTERISK-10177 solve this problem? Is there any relationship with that issue?

By: callguy (callguy) 2007-10-29 11:29:53

that patch doesn't solve this problem. We are running 1.4.13 with the patch from 10571. I am going to try and look into it a little bit more deeply, I have a feeling that fixing the issue in 10571 may have revealed this one.

By: Denis Galvao (denisgalvao) 2007-10-29 11:39:23

Im facing the same exactly problem. SIP stops to work. I have some SIP queue agents that doesn't get calls that are waiting on a queue. All other resource remais available, but not SIP.

Im gonna try the 1.4 today branches, any chance to get it solved?

By: Russell Bryant (russell) 2007-10-30 08:04:57

Can you try running the latest code in the 1.4 branch?  I think this may have already been fixed.

By: Denis Galvao (denisgalvao) 2007-10-30 08:15:50

Unfortunely I can't test the last code from branches, because Im using an R2 dsp based Asterisk card that the channel does not compile with the new code, I've contacted the developers to get it running on branches asap.

By: callguy (callguy) 2007-10-30 18:33:16

russell: we'll test out of the latest code, but I have a question. I was looking into this a bit more deeply, it looks like the commit related to bug 11044 may be relevant to this one (mutex initialization) - is this what you were thinking of or something else? it seems there were a few different items of consequence with respect to that.

By: Russell Bryant (russell) 2007-10-31 13:25:28

Yeah, that is the change that I had in mind.

By: callguy (callguy) 2007-11-05 11:54:54.000-0600

russell: it looks like the fix in 11044 has most likely resolved this. We were seeing the issue approximately once a day before running with this patch, it's been about a week since without recurrence. I'm going to let it run for a few more days and upgrade some other servers that have also experienced so I can give a bit more of a conclusive answer - but it looks good at this point.

By: Denis Galvao (denisgalvao) 2007-11-05 12:18:43.000-0600

Hi. Im playing with the branches revision 88471. This version is runnig for about 4 hours, so I dont have nothing to say about the SIP problem till now. But I expirienced some new problem:

When doing "soft hangup" TAB TAB, the CLI stop working... I cannot use it anymore, but the calls are being proccessed normally.

By: Russell Bryant (russell) 2007-11-05 23:49:48.000-0600

callguy: Great!  Certainly let me know if it happens again.

denisgalvao: I have marked this as assigned to me and will look into the tab completion issue tomorrow.

By: callguy (callguy) 2007-11-06 09:25:39.000-0600

russell: one additional note. I opened 11165 yesterday, this looked like a different problem but may be related. This was an actual deadlock confirmed by the output of core show locks (which isn't the case in this bug). i have the output of thead apply all bt as well if you'd like to see it.

By: Russell Bryant (russell) 2007-11-06 12:56:06.000-0600

I justed tried it and channel tab completion for the "soft hangup" CLI command is working just fine for me ...

By: callguy (callguy) 2007-11-09 08:34:00.000-0600

russell: we just had this happen again, against 1.4 r87739, so i don't believe that the issue is completely resolved. I captured a thread apply all bt, which looks fine with the exception of this thread:

Thread 66 (Thread 3767216 (LWP 31690)):
#0  0x00bd57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00dd41de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0x00dd0e3b in _L_mutex_lock_35 () from /lib/tls/libpthread.so.0
#3  0x00395d88 in ?? ()
#4  0x00509e68 in ?? () from /usr/lib/asterisk/modules/chan_sip.so
ASTERISK-1  0xb5e51fa0 in ?? ()
ASTERISK-2  0x003960e0 in ?? ()
ASTERISK-3  0x00395d58 in ?? ()
ASTERISK-4  0x004abf6c in __ast_pthread_mutex_lock (filename=0xb5e51fa0 "\002", lineno=5218371,
   func=0x853 <Address 0x853 out of bounds>, mutex_name=0x4fadc7 "__sip_ack", t=0x4fadd1)
   at /usr/src/asterisk-trunk/1.4/asterisk-1.4/include/asterisk/lock.h:356
ASTERISK-5  0x004abf6c in __ast_pthread_mutex_lock (filename=0x4fa043 "chan_sip.c", lineno=2131, func=0x4fadc7 "__sip_ack",
   mutex_name=0x4fadd1 "&p->lock", t=0xb5e51fa0)
   at /usr/src/asterisk-trunk/1.4/asterisk-1.4/include/asterisk/lock.h:356
ASTERISK-6 0x004acdc1 in __sip_ack (p=0xb5e51fa0, seqno=102, resp=0, sipmethod=3) at chan_sip.c:2131
ASTERISK-7 0x004e022d in handle_response (p=0xb5e51fa0, resp=200, rest=0x396308 "OK", req=0x3960e0, ignore=0, seqno=102)
   at chan_sip.c:12398
ASTERISK-8 0x004eb7ee in handle_request (p=0xb5e51fa0, req=0x3960e0, sin=0x3960d0, recount=0x3960bc, nounlock=0x3960c0)
   at chan_sip.c:14976
ASTERISK-9 0x004ec8c8 in sipsock_read (id=0x82085a8, fd=15, events=1, ignore=0x0) at chan_sip.c:15203
ASTERISK-10 0x080b2a13 in ast_io_wait (ioc=0x8206f80, howlong=1) at io.c:279
ASTERISK-11 0x004ed447 in do_monitor (data=0x0) at chan_sip.c:15406
ASTERISK-12 0x0810c270 in dummy_start (data=0x82106e0) at utils.c:843
ASTERISK-13 0x00dcf371 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-14 0x00cbaffe in clone () from /lib/tls/libc.so.6

The lock.h reference is in the lock debugging code, but it looks like the out of bounds variable is in _sip_ack. I'm not sure if this could be related to some of the fixes you put in around 88805 - though those seemed to all be reported as crashes and not deadlocks.

By: callguy (callguy) 2007-11-12 10:01:13.000-0600

russell: we just had another occurrence of this problem. Running on 1.4 r89088. Asterisk was in a state where it wouldn't allow console connections, but was running - so we couldn't get a core show locks. If you look at the thread apply all bt I've attached, you can see there is memory corruption in threads 6,9 & 11.

The one thing these all have in common is that they seem to be referencing the same part of the lock debugging code (it could be that it is unrelated, but I thought I'd bring it up).

Otherwise, two of them (9 & 11) seem to be in devicestate.c and related to hints, the third (6) appears to be in channel.c.

edit: output of thread apply all bt uploaded to bt-11122007.txt



By: Denis Galvao (denisgalvao) 2007-11-12 12:27:01.000-0600

callguy: I have the same problems that you mentioned here. I think I can describe what I have to let you tell me if we are facing the same problem in the same environment.

Im playing now with r89105, and it worked for about 3 days, 10 min ago I got the lock, the "*cli> show channels" give me the channels but didn't give me the active calls count at the end.

All SIP accounts stop to receive calls.

I have almost 20 queues and 150 SIP clients playing with Xlite.

Most time the system run with 30 active calls...

We use only SIP members on the queues, no logged agents here.

I have a similar customer(1.4.13), without problems. The difference between them couls be the xlite and its SIP subscribe requests... I asked my customer to disable it, but he didn't had enough time to do that.

Could this problems be related to the amount of SUBSCRIBE requests to SIP?

By: callguy (callguy) 2007-11-12 15:41:02.000-0600

denisgalvao: It's tough to say if we are seeing something similar (though the symptoms are). I have seen instances where show channels doesn't show the call count at the end and just returns to the cli prompt. That said, I've seen where it does but SIP processing has also stopped.

At least several of mine have shown to have occurred during the subscribe requests, but that doesn't seem to be exclusively related. Do you have either the output of core show locks or a backtrace from the running process?

Also - are you running with DEBUG_THREADS enabled?

By: Bing Li(enst) (enst) 2007-11-12 20:52:28.000-0600

I think it is the same as bug http://bugs.digium.com/view.php?id=11165
find_call  __sip_ack

By: Denis Galvao (denisgalvao) 2007-11-13 06:24:11.000-0600

callguy: Same here. I do get the right response from "show channels" sometimes but SIP stop working. The strange thing is that it occurs without high load or without a regular time frame.

Is this possible to have a BT running with valgrind, without DEBUG_THREADS?

The "core show locks" will give me something without DEBUG_THREADS?

Im not running with DEBUG_THREADS.

By: Denis Galvao (denisgalvao) 2007-11-14 14:47:56.000-0600

Guys, something new(at least for me) here.

We got a lock a couple minutes ago.

Looking into the log full I got a lot of "Too many open files" messages. Im not sure if it is after or before the lock. Could it be the cause for the lock?

My ulimit -n show me 1024.

lsof on asterisk thread show me a lot of more than 1024.

I just increased this number to  ulimit -n 65355.

How it sounds for you?

callguy: Could you post your ulimit -a values? Im running an Etch Debian.

By: callguy (callguy) 2007-11-14 15:33:38.000-0600

russell: We noticed  the sipsock_read() method it would not release p->lock in the case when an invalid SIP message was received and rejected. The lock was previously acquired by find_call() inside. We think adding a mutex unlock  should resolve this issue (if we're reading it correctly). It would also explain why it's so intermittent - I suspect this would happen if there was a connectivity or other issue that caused a sip message to become malformed without the channel dropping.

I also believe this is the same issue as 11165, which I also posted (they originally looked like two distinct issues, but I don't believe that's the case).

See attached diff, let me know what you think.



By: Bing Li(enst) (enst) 2007-11-14 20:16:57.000-0600

I guess the chan_sip.c.diff is not OK.
In the bug, deadloack happened in handle_response,but in bug 11165 it happened in handle_request.
the diff unlock the p->lock after handle request, So it maybe cause deadlock in handle_request.

my diff chan_sip.c.diff_enst071115

By: callguy (callguy) 2007-11-14 21:48:34.000-0600

Removing the locking from __sip_ack and placing it before where it is called from within __sip_pretend_ack is the same balanced operation. But would we need to also wrap the invocation of _sip_ack in handle_response() and handle_request(),, as these two invocations would now invoke __sip_ack() without having p->lock.

I think they issue is that the p->lock is being locked by find_call(), which has ast_get_lock(p->lock) in two conditional places, but does not unlock. the calling sipsock_read() was unlocking the p->lock in the normal return of the function, however the error return on invalid sip packet was not unlocking the p->lock.  so a subsequent call to sipsock_read() following a previous lock and error exit from invalid sip packed would still cause a deadlock there too, or the next time handle_request() or handle_response() were called, which would in turn call the __sip_ack(), which would deadlock because of the previously unbalanced unlocking of p->lock from the error exit of invalid sip packed.

see my new patch, where i remove the ast_get_lock(p->lock) from find_call() and get the lock from within sipsock_read() where we can clearly see my previous patch is required to balance the lock/ unlocking of p->lock.

By: callguy (callguy) 2007-11-19 23:53:36.000-0600

russell: if you could point us in the right direction it would be a big help.

from our initial read, it looked like the ast_mutex_locks/unlocks were being handled individually within the functions in chan_sip.c (i.e. - __sip_ack and others). this is why we couldn't quite understand why our patch wasn't helping.

Separately, we couldn't see why enst's suggested patch was helping, as moving the locks outside the function would seem to be avoiding the issue, but could lead to other potential problems.

However, after tracing the code, it appears that sipsock_read is actually getting the mutex lock, then calling handle_request, which in turn calls handle_response, and then sipsock_read unlocks when it gets a return. Our question is if p is already locked, why are the functions individually locking it again (and why does that even work)?

We believe that we're seeing this in __sip_ack more frequently than other functions because it is called the most frequently - hence giving the highest probability of being the function that encounters the deadlock.

By: Daniel Kühl Lima (dklima) 2007-11-21 07:38:14.000-0600

Updated to 1.4.14.

- 10. Compiler Flags
-- DEBUG_CHANNEL_LOCKS
-- DEBUG_THREADS
-- DETECT_DEADLOCKS
-- DONT_OPTIMIZE
-- MALLOC_DEBUG

Today I got a deadlock on asterisk, when I did 'core show locks' I got no response, the CLI was locked, to exit only pressing CTRL+C and kill asterisk with -9 signal.

By: Daniel Kühl Lima (dklima) 2007-11-22 06:26:18.000-0600

I added the full log with ERROR part...
Is there any thing I can do?
'core show locks' has no effect for me :( cause the CLI still locked showing no output.

By: callguy (callguy) 2007-11-22 06:41:52.000-0600

dklima: if the process is still alive run gdb /usr/sbin/asterisk asterisk_pid (where you replace asterisk_pid with the process id of the running asterisk instance). then upload the output of "thread apply all bt". At least it will show if you are running into issues in the same area as we are.

By: Daniel Kühl Lima (dklima) 2007-11-22 07:09:28.000-0600

Output of 'thread apply all bt' attached.

By: Daniel Kühl Lima (dklima) 2007-11-22 08:12:53.000-0600

New deadlock right now.
I uploaded the output from gdb, now with 'bt full' and 'thread apply all bt'.
I hope it can help.

By: Russell Bryant (russell) 2007-11-24 13:14:27.000-0600

dklima: I can tell from your gdb output that you are not using an unmodified version of Asterisk, so we can not support you here.

By: Daniel Kühl Lima (dklima) 2007-11-24 14:07:04.000-0600

Sorry by my misunderstood, but what do you mean by 'unmodified' ?

By: pj (pj) 2007-11-24 20:24:35.000-0600

I think, without any other patches applied.

By: callguy (callguy) 2007-11-28 15:05:02.000-0600

russell: just wanted to see if there was any update on this issue, or if there is anything we can do to help on our side.

By: Daniel Kühl Lima (dklima) 2007-12-03 07:08:57.000-0600

OK.
Deadlock again, 'bt full' and 'thread apply all bt full' attached.
Now with a 'unmodified' asterisk version.

By: Tilghman Lesher (tilghman) 2007-12-03 11:12:06.000-0600

dklima:  let me see another "core show locks" with 1.4.15.

By: Daniel Kühl Lima (dklima) 2007-12-03 11:19:54.000-0600

How could I put here the 'core show locks' if I got no access to the CLI when deadlocked?
Did you find anything on my gdb output file that could be solved on 1.4.15 ?

*No access I mean: no response for any command that I do.

By: Tilghman Lesher (tilghman) 2007-12-03 16:06:19.000-0600

There are a great number of possible deadlocks that are solved with 1.4.15.

The other debug that might help us is following the instructions in doc/valgrind.txt.



By: Denis Galvao (denisgalvao) 2007-12-04 15:10:53.000-0600

Im helping dklima with this problem. We update the asterisk to 1.4.15 and the same problem occurred.

SIP stop work, but we still see the calls going to the queues. The *cli stop work and no commands were been processed after that(even "core show locks").

This asterisk has almost 150 SIP extenions, almost 30 queues with 10 members and all of them are configured with member priorities.

This asterisk is working only with SIP, no ZAP and IAX2 at all.

I attached a "gdb thread apply all bt".

One interesting thing that should be mentioned here is that this problem can be reproduced after some traffic going on and executing *cli commands. For example after 10 minutes running asterisk(with calls traffic) we go to the cli and do a "show channels" after that, the *cli stop work and no commands are processed and no more SIP channels can be created. Then we execute the gdb things and killall asterisk to stop.

Anybody facing the same? callguy?



By: David Brillert (aragon) 2007-12-05 09:09:43.000-0600

Yes I have a similar problem.
Asterisk version 1.4.15
But not compiled with option to show core locks
However we get deadlock several times daily and can reproduce the deadlock with core show channels command

I have been watching this bug with great interest.
I don't know what I can do to help...

We are also running a busy call center with 2 PRI circuits
About 20 agents and 20 queues

Our server is running stock 1.4.15 with no patches.

By: callguy (callguy) 2007-12-05 09:45:41.000-0600

While I agree we are all seeing similar symptoms, I believe there may be more than one underlying cause (though I confess - after looking at the relevant areas in chan_sip in some detail there are a lot of things I just can't make heads or tails of without help from someone who knows more about the code  - so it could be something that is being triggered outside of the sip channel and manifesting itself in different ways).

That said, the issue we are seeing appears to pretty consistently point to handle_request, and handle_response, with something wrong with mutex locking (if you see some of my previous posts that was about as far as I could get with it).  

From the gdb output it does appear that denisgalvao/dklima's issues are being triggered by something different, though enst was seeing the same thing as we are.

In general I can reiterate the same thing as others here, which is this only appears to happen when the system are under considerable load, and then only intermittently. We see some external triggers (i.e. - core show channels, sip reload, etc.) that appear to exacerbate the situation, but again - it isn't consistently reproducible, and whenever the situation does occur the console hangs so we can't get any output from core show locks (the one exception to this is posted in bug 11165). Our gdb output consistently looks similar though.

We're more than happy to do anything we can over here to help as well.

By: Denis Galvao (denisgalvao) 2007-12-05 10:50:43.000-0600

One more thing that should be posted here:

On the same customer I have a "gateway asterisk" doing all the transcoding processes that came from the PSTN(PRI and R2).

All calls are then forward to the server with Asterisk 1.4.15 as SIP channels.

On the gateway server I dont have this problem(same traffic), so this let me think that the problem is not related to a high traffic situation, but a mix of situations, like channel transfers, SIP subscribes, queues.

Im using a IBMx3400(2 Xeon) on the 1.4.15 and a IBMx3200(1 Xeon) on the gateway.

By: Ognjen Seslija (sekil) 2007-12-05 11:08:50.000-0600

I have almost the same setup as described in note 0074825, and same problems. I have reverted from 1.4.15 to 1.4.13 to see if it helps. * stops processing SIP calls all of the sudden. I have about 150 extensions, 12 concurent calls max. and a lot of SUBSCRIBEs.

By: Russell Bryant (russell) 2007-12-05 11:11:56.000-0600

For anyone that is experiencing a deadlock, the thing that you can do to help is this:

Build your system with DEBUG_THREADS turned on.  Run "make menuselect", Compiler Flags, hit space to turn on DEBUG_THREADS, 'x' to save and quit.  Recompile, reinstall ...

Then, when this issue occurs, grab the output of the "core show locks" CLI command.

# asterisk -rx "core show locks" > showlocks.txt

By: callguy (callguy) 2007-12-05 11:15:46.000-0600

denisgalvao/sekil: our setup is similar as well, where we have gateways that feed calls to the servers that area experiencing this issue. We experienced this issue less in 1.4.9, but there have been so many other changes (crash fixes, deadlock fixes, etc.) that I think we're still better off on the more recent release.

By: callguy (callguy) 2007-12-05 11:18:06.000-0600

russell: the issue that we're all having is that when this occurs, the console locks up as well and there's no way to get any output of core show locks (or any output of any sort). That's why we've been looking at the gdb output from the running process, at least it give some clue since we can't get the lock debugging info in this case.

I've only had a single instance where I was able to capture core show locks (which  is posted to bug 11165). The backtrace on that running process shows it to be the same issue as what's posted here.

By: David Brillert (aragon) 2007-12-05 13:30:44.000-0600

I'm trying to recompile Asterisk 1.4.15 with DEBUG_THREADS turned on.
I'll post what I can once enabled.

In the meantime maybe this might help out.
It came from /var/log/asterisk/messages
The messages log is full of these errors

Dec 5 13:32:28 chan_sip.c NOTICE Unable to create/find SIP channel for this INVITE
Dec 5 13:32:28 chan_sip.c ERROR We could NOT get the channel lock for SIP/404-0b5e00a0!
Dec 5 13:32:28 chan_sip.c ERROR SIP transaction failed: 6a83288227dfc20c79e897c543858de9@

By: Russell Bryant (russell) 2007-12-05 13:32:40.000-0600

Ok, well, I would say if the problem occurs, please continue to attempt to get the "core show locks" output, as that will be more useful than anything else.  Unfortunately, the existing locks output posted here isn't helping.  I'm thinking that they may have come from a version when I still had some bugs in the lock tracking code.

Also, when it comes to deadlocks, backtraces aren't usually enough.  I will probably need to log in to a machine experiencing the problem so that I can take a look at it.

First, please make sure that Asterisk is compiled without optimizations.

Then, if it locks up, grab a core dump of the process.  You can use the script, contrib/scripts/ast_grab_core to help with that.  Then, you can restart Asterisk and I can log in and use gdb to look at the core dump.

By: Ognjen Seslija (sekil) 2007-12-06 04:02:45.000-0600

showlocks.txt: Deadlock caught in 1.4.12.1 (same problems in 1.4.13, 14, and 15). I didn't have problems until I configured two Linksys SPA962s with SPA932s console to watch same extensions. Hope this helps, I need to restart my machine hourly now.

By: Daniel Kühl Lima (dklima) 2007-12-06 06:08:39.000-0600

Using asterisk 1.4.15, deadlock again.
No output from 'core show locks', accessing asterisk CLI I got no CLI> prompt.

By: Daniel Kühl Lima (dklima) 2007-12-06 08:42:18.000-0600

[2007-12-06 10:49:17] ERROR[18032] /usr/src/1415/asterisk-1.4.15/include/asterisk/lock.h: chan_local.c line 180 (local_queue_frame): mutex '&us->lock' freed more times than we've locked!
[2007-12-06 10:49:17] ERROR[18032] /usr/src/1415/asterisk-1.4.15/include/asterisk/lock.h: chan_local.c line 180 (local_queue_frame): Error releasing mutex: Operation not permitted
[2007-12-06 10:49:17] WARNING[18032] chan_local.c: Local/3219@extensions-3cbf,1 wasn't locked while sending 12/35

By: Daniel Kühl Lima (dklima) 2007-12-06 09:24:18.000-0600

Used 'ast_grab_core'.
Result attached.



By: Russell Bryant (russell) 2007-12-06 14:45:10.000-0600

sekil: The deadlock that the output of "core show locks" shows is not possible with the current version.  So, if you're having a problem with later versions, it's not the same one.

By: callguy (callguy) 2007-12-06 14:59:35.000-0600

russell: We experienced one again where it wasn't possible to get any output from the console, and used the ast_grab_core script. I've attached the gdb_dump it generated here. I will email you separately - if you would like access to the box so you can take a look at the core file itself just reply back.

By: Ognjen Seslija (sekil) 2007-12-07 02:05:46.000-0600

I got back to 1.4.15 with DONT_OPTIMIZE and all of the relevant DEBUGs. I also patched it with the patch from rev. #91192. Here's the part of  'core show locks' from it:

=== Thread ID: 3084536720 (do_devstate_changes  started at [  378] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 4978 ast_channel_trylock &chan->lock 0xb751c3e8 (5)

Next few 'core show locks' that I typed, killed Asterisk completely.

Edit: deadlock again, Asterisk crashes when 'core show locks' tried.



By: Tilghman Lesher (tilghman) 2007-12-07 08:17:03.000-0600

sekill:  if Asterisk crashes when you type: "core show locks", please upload a backtrace as documented in doc/backtrace.txt.

By: Daniel Kühl Lima (dklima) 2007-12-07 09:11:04.000-0600

Yet another deadlock.
Wasn't possible to get output from the console and used the ast_grab_core script.

*Again: 'core show locks' locked up CLI...

By: Russell Bryant (russell) 2007-12-07 09:39:38.000-0600

1) Please make sure you build with DONT_OPTIMIZE and DEBUG_THREADS both enabled.

2) For a deadlock, a backtrace is not likely to be enough for me to fix the problem.  I will need to log in to the system and look through the core dump.

By: Daniel Kühl Lima (dklima) 2007-12-07 09:43:00.000-0600

Yeah, it's build with both DONT_OPTIMIZE and DEBUG_THREADS.
I can provide you access to the machine. How can I send it to you?

By: Digium Subversion (svnbot) 2007-12-07 15:08:18.000-0600

Repository: asterisk
Revision: 91826

U   branches/1.4/include/asterisk/lock.h

------------------------------------------------------------------------
r91826 | russell | 2007-12-07 15:08:17 -0600 (Fri, 07 Dec 2007) | 6 lines

I love fixing lock related errors in the lock debugging code.  That's about as
ironic as it gets in Asterisk programming land.  Anyway, I spotted this bug while
trying to track down why systems are locking up and acting weird in issue ASTERISK-10613.
The mutex attribute object was marked as static in this function when it should
not have been.

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91826

By: Digium Subversion (svnbot) 2007-12-07 15:08:53.000-0600

Repository: asterisk
Revision: 91827

_U  trunk/
U   trunk/include/asterisk/lock.h

------------------------------------------------------------------------
r91827 | russell | 2007-12-07 15:08:53 -0600 (Fri, 07 Dec 2007) | 14 lines

Merged revisions 91826 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r91826 | russell | 2007-12-07 15:11:08 -0600 (Fri, 07 Dec 2007) | 6 lines

I love fixing lock related errors in the lock debugging code.  That's about as
ironic as it gets in Asterisk programming land.  Anyway, I spotted this bug while
trying to track down why systems are locking up and acting weird in issue ASTERISK-10613.
The mutex attribute object was marked as static in this function when it should
not have been.

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91827

By: Digium Subversion (svnbot) 2007-12-07 15:14:33.000-0600

Repository: asterisk
Revision: 91828

U   branches/1.4/include/asterisk/lock.h

------------------------------------------------------------------------
r91828 | russell | 2007-12-07 15:14:33 -0600 (Fri, 07 Dec 2007) | 6 lines

Fix another bug in the DEBUG_THREADS code.  The ast_mutex_init() function had
the mutex attribute object marked as static.  This means that multiple threads
initializing locks at the same time could step on each other and end up with
improperly initialized locks.
(found when tracking down locking issues related to issue ASTERISK-10613)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91828

By: Digium Subversion (svnbot) 2007-12-07 15:15:01.000-0600

Repository: asterisk
Revision: 91829

_U  trunk/
U   trunk/include/asterisk/lock.h

------------------------------------------------------------------------
r91829 | russell | 2007-12-07 15:15:00 -0600 (Fri, 07 Dec 2007) | 14 lines

Merged revisions 91828 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r91828 | russell | 2007-12-07 15:17:24 -0600 (Fri, 07 Dec 2007) | 6 lines

Fix another bug in the DEBUG_THREADS code.  The ast_mutex_init() function had
the mutex attribute object marked as static.  This means that multiple threads
initializing locks at the same time could step on each other and end up with
improperly initialized locks.
(found when tracking down locking issues related to issue ASTERISK-10613)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91829

By: Digium Subversion (svnbot) 2007-12-07 15:21:43.000-0600

Repository: asterisk
Revision: 91830

U   branches/1.4/main/utils.c

------------------------------------------------------------------------
r91830 | russell | 2007-12-07 15:21:42 -0600 (Fri, 07 Dec 2007) | 5 lines

Make the lock protecting each thread's list of locks it currently holds
recursive.  I think that this will fix the situation where some people have
said that "core show locks" locks up the CLI.
(related to issue ASTERISK-10613)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91830

By: Digium Subversion (svnbot) 2007-12-07 15:22:13.000-0600

Repository: asterisk
Revision: 91831

_U  trunk/
U   trunk/main/utils.c

------------------------------------------------------------------------
r91831 | russell | 2007-12-07 15:22:12 -0600 (Fri, 07 Dec 2007) | 13 lines

Merged revisions 91830 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r91830 | russell | 2007-12-07 15:24:33 -0600 (Fri, 07 Dec 2007) | 5 lines

Make the lock protecting each thread's list of locks it currently holds
recursive.  I think that this will fix the situation where some people have
said that "core show locks" locks up the CLI.
(related to issue ASTERISK-10613)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=91831

By: Russell Bryant (russell) 2007-12-07 15:24:10.000-0600

As you can see, I just fixed multiple errors related to lock handling with DEBUG_THREADS turned on.

If you are experiencing a problem, please install the latest code from the 1.4 branch of Asterisk with both DONT_OPTIMIZE and DEBUG_THREADS enabled.  Then, if your system locks up again, hopefully "core show locks" will properly return output.

$ svn co http://svn.digium.com/svn/asterisk/branches/1.4 asterisk-1.4

Thank you for your patience.

By: Ognjen Seslija (sekil) 2007-12-07 17:45:33.000-0600

russel: I did the exact thing you explained. I'm getting now following output from 'core show locks' (no * crashes any more):

=== Thread ID: 3084319632 (do_devstate_changes  started at [  378] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 5007 ast_channel_trylock &chan->lock 0xb786b8a8 (16)

=== Thread ID: 3076918160 (do_monitor           started at [15571] chan_sip.c restart_monitor())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 5007 ast_channel_trylock &chan->lock 0xb786b8a8 (3)

This is accumulating in time and leads to a * stop processing SIP calls. Problems began when I installed two phone sidecars configured to "watch" same extensions. I have turned off hints on one of them, but still getting deadlocks almost hourly now. Problem is hard to reproduce on a slow traffic day.
B.t.w, I have 1.2 in production on other places with tens of SNOMs and plenty of SUBSCRIBE/NOTIFYs and everything is working as a charm.

Edit: Ok, I woke up and logged in to my server and 'core show locks' again crashed *. I use svn version #91830.



By: Daniel Kühl Lima (dklima) 2007-12-10 06:42:53.000-0600

New deadlock, using Asterisk SVN-branch-1.4-r91890, 'core show locks' locked up CLI without showing any results...
Russell, the core dump is on /root/core.13728 and yes, the sources is on machine.

I've attached output generated by that script.



By: Alisher (licedey) 2007-12-11 02:16:20.000-0600

I am still experiencing this problem. Once per day. Average number of daily calls is 5500. Mostly outbound calls, 16 agents.

With Asterisk 1.4.15, the crash happened 3~4 times per day.

After installing the 92102 trunk update, the Asterisk stopped crashing, but still, sip channels get lock once a day. So I am forced to restart the asterisk.

I tried to reduce the rtptimeout and rtpholdtimeout options, but it didn't help.



By: Martin Vit (festr) 2007-12-11 03:00:02.000-0600

I'm expririencing SIP deadlock with the latest 1.4 SVN too. Unfortunately, i've no deadlock report yet.

By: callguy (callguy) 2007-12-11 06:51:00.000-0600

festr/licedey: please make sure you have asterisk compiled with DEBUG_THREADS and DONT_OPTIMIZE, and the next time you have a deadlock type "core show locks" at the asterisk command line before restarting - then post the results here.

EDIT: If "core show locks" doesn't yield any output (because your console is unresponsive) then run the ast_grab_core script located in the contrib/scripts directory in the asterisk source. This will generate a core file and a debug text file in /tmp. You can upload the text file here.



By: Daniel Kühl Lima (dklima) 2007-12-11 06:53:37.000-0600

Callguy: I'm wondering how 'core show locks' works to you and for me nothing... Lock all CLI> prompt...

By: Daniel Kühl Lima (dklima) 2007-12-11 07:41:50.000-0600

After a dealock, 'core show locks' worked!!!!!!!!!!!! =)
Attached output!



By: Tilghman Lesher (tilghman) 2007-12-11 09:15:10.000-0600

And yeow... we seem to have memory corruption in the mutex, because it's otherwise not possible for two threads to have locked the same mutex.

I see 2 different occurrences of this.



By: Ognjen Seslija (sekil) 2007-12-11 10:36:06.000-0600

I have reverted my setup to 1.2.25 and same problems occur. Not only SIP but IAX calls don't get processed.

By: Alisher (licedey) 2007-12-12 01:37:37.000-0600

to sekil - reporter
>I have 1.2 in production on other places with tens of SNOMs and plenty of >SUBSCRIBE/NOTIFYs and everything is working as a charm.

What version of 1.2 asterisk do you use in production? I tried with 1.2.25 it also had same problems.



By: Daniel Kühl Lima (dklima) 2007-12-12 05:16:41.000-0600

Sorry, but, IMHO this ticket is for 1.4.1x and branches...



By: Alisher (licedey) 2007-12-12 12:26:21.000-0600

To dklima:
 We use it in production, and every time it crashes agents lose contracts. This problem cause a lot of problem.  Do you know any previous 1.4.x or 1.2.x version that doesn't have this issue?

 Today I tried to install trunk 92424, had a deadlock again, after a couple of hours asterisk crashed one time for no reason. Then I reverted it back to trunk 92202.
 
 Tomorrow I will try to change ISP and voip provider, and see if this help.

By: Daniel Kühl Lima (dklima) 2007-12-12 12:29:33.000-0600

I think and I'm not sure that 1.4.2 and 1.4.9 doesn't have this kind of deadlock...



By: Alisher (licedey) 2007-12-12 12:38:50.000-0600

Before I tried 1.4.9, it does have deadlocks, also it crashes randomly on heavy traffic.



By: Alisher (licedey) 2007-12-12 20:52:42.000-0600

in devicestate.c line:319,354 ==> AST_LIST_LOCK(&state_changes);
if changed with ==>  ast_mutex_lock(&state_changes);

and in line 324,363: AST_LIST_UNLOCK(&state_changes);
if changed with ==> ast_mutex_unlock(&state_changes);

will it fix the bug?

Also in function *do_devstate_changes(void *data),line 350

line 354: AST_LIST_LOCK(&state_changes); is not unlocked at the end before return NULL;
What if remove line 354 from code?

static void *do_devstate_changes(void *data)
{
struct state_change *cur;

AST_LIST_LOCK(&state_changes); //line 354
              .....
              //some code
    AST_LIST_LOCK(&state_changes);
      do_state_change(cur->device);
          free(cur);
      AST_LIST_UNLOCK(&state_changes);
            .....
             // some code
return NULL;
}



By: Eliel Sardanons (eliel) 2007-12-12 21:29:28.000-0600

licedey: Could you upload another 'core show locks'?

Also regarding your questions:
- state_changes is not a mutex! so you cant just 'ast_mutex_lock(state_changes)'. You could read the Macro definition to know what is going on inside AST_LIST_LOCK(). [state_changes->lock]

- do_devstate_changes() never unlocks state_changes (before returning) because it never returns, it is unlocking the structure inside the for and locking it again to let other threads work with the structure. But stays inside the for() FORever.

Thanks

By: Alisher (licedey) 2007-12-13 00:51:03.000-0600

I have uploaded the core show locks and other log info.



By: Dmitry Andrianov (dimas) 2007-12-13 01:06:34.000-0600

There is a small bug in the lock code making Asterisk to produce flood of "Deadlock?" messages every 5 seconds. Note that it is not a bug with locking itself - only with deadlock diagnostics and reporting.

I'm attaching the patch (lockrep.patch) which should prevent asterisk from flooding exactly the same message several times a second. Could you please give it a try?

By: Alisher (licedey) 2007-12-13 02:29:24.000-0600

Today,I changed internet line, from 4M to 70M, so the network is stable now. But few minutes ago,trunk 92202,92736 and 92876 crashed for no reason, nothing in the log. Although 1.2.25 suffers from same deadlock and sip locked channels, it didn't crash same way as Asterisk 1.4.15, or 1.4 Trunk do.



By: Eliel Sardanons (eliel) 2007-12-13 07:12:25.000-0600

dimas: If you have a patch to fix other issue not related to the deadlock please report it in a new issue. I know it has something to do with deadlock reporting but it doesn't fix this issue at all, so please upload it in a new issue to make more easy to track it.

Thanks!

By: Martin Vit (festr) 2007-12-13 08:42:01.000-0600

I'm seeing also deadlocks with the latest 1.4 svn. So I've recompiled with these flags: DEBUG_CHANNEL_LOCKS, DEBUG_THREADS, DETECT_DEADLOCKS and DONT_OPTIMIZE. But now when deadlock occurs, i cannot connect to CLI (locked).

So i've attached gdb to asterisk PID and made apply all thread bt. See attached  gdb-thread-bt-festr-1 file.

edit: i've restarted asterisk and while there is no deadlock, show core locks is showing this:

--------
=== Thread ID: 3083357104 (do_devstate_changes  started at [  378] devicestate.c ast_device_state_engine_init())
=== ---> Tried and failed to get Lock #0 (channel.c): MUTEX 5045 ast_channel_trylock &chan->lock 0x83846b8 (0)
---------

is it wrong?



By: Martin Vit (festr) 2007-12-13 11:51:17.000-0600

i've catched another deadlock. source of this was call to the queue which has ring all strategy with 5 phones (some SIP and some Local channels). I'm also using BLF on several phones. I've attached 'core show locks' and gs snapshot of asterisk process 'thread apply all bt full'. See attached file deadlock-festr-2.

By: Russell Bryant (russell) 2007-12-13 14:34:01.000-0600

=== ---> Waiting for Lock #3 (chan_sip.c): MUTEX 7226 transmit_state_notify &iflock 0xb66c3080 (1)

festr: I see that line in your core show locks output.  What version of Asterisk is this from?  transmit_state_notify() does not appear to lock the iflock.  Do you have any patches applied?

By: Martin Vit (festr) 2007-12-14 05:54:08.000-0600

I'm very sorry. I did not realize, that i've applied patch to chan_sip to support SNOM BLF from this thread: http://bugs.digium.com/view.php?id=5014 -> look at the sip.snom.pu file. There is locking iflock. I though i'm using vannila sip channel :(

I'm removing this patch on my server and i will try to catch next deadlock if any.

By: Dmitry Andrianov (dimas) 2007-12-14 07:45:49.000-0600

festr, if any locking of yhr "new code" of the 5014 patch _may_ be involved in the poblem, it worth notifying authors of 5014 patch too...

By: Martin Vit (festr) 2007-12-14 08:03:16.000-0600

i've posted report to bug 5014.

By: Digium Subversion (svnbot) 2007-12-17 16:25:00.000-0600

Repository: asterisk
Revision: 93377

U   branches/1.4/main/utils.c

------------------------------------------------------------------------
r93377 | file | 2007-12-17 16:24:59 -0600 (Mon, 17 Dec 2007) | 7 lines

Do not try to access information about a lock when printing out a trylock attempt. It is possible for the lock that it references to no longer be valid. This would have caused segfaults or deadlocks.
(issue #BE-263)
(closes issue ASTERISK-10613)
Reported by: callguy
(closes issue ASTERISK-10632)
Reported by: callguy

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=93377

By: Digium Subversion (svnbot) 2007-12-17 16:26:47.000-0600

Repository: asterisk
Revision: 93378

_U  trunk/
U   trunk/main/utils.c

------------------------------------------------------------------------
r93378 | file | 2007-12-17 16:26:47 -0600 (Mon, 17 Dec 2007) | 15 lines

Merged revisions 93377 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r93377 | file | 2007-12-17 18:28:09 -0400 (Mon, 17 Dec 2007) | 7 lines

Do not try to access information about a lock when printing out a trylock attempt. It is possible for the lock that it references to no longer be valid. This would have caused segfaults or deadlocks.
(issue #BE-263)
(closes issue ASTERISK-10613)
Reported by: callguy
(closes issue ASTERISK-10632)
Reported by: callguy

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=93378

By: callguy (callguy) 2007-12-20 18:18:18.000-0600

russell: we just experienced the same issue on 1.4.16 (which included the patch for this). I was able to grab core show locks (posted here). I also ran the ast_grab_core, let me know if you'd like to login and take a look and i'll email you login info.

By: Alisher (licedey) 2007-12-21 01:30:05.000-0600

I have installed the latest release 1.4.16.1, no more crashes. But,still I got locked channels.

By: Volnikov Ivan (ivan) 2007-12-21 02:26:12.000-0600

Hi, callguy.
Is this issue heppends without DEBUG_THREADS option?

By: callguy (callguy) 2007-12-21 05:29:08.000-0600

ivan: yes, this is happening with debug_threads enabled (but we have that turned on so we can get the output of core show locks to help identify the deadlock). I know you helped make debug_threads thread safe a while back on a different bug I had opened - do you think there's something related here?



By: Digium Subversion (svnbot) 2007-12-21 10:34:34.000-0600

Repository: asterisk
Revision: 94466

U   branches/1.4/include/asterisk/pbx.h
U   branches/1.4/main/pbx.c

------------------------------------------------------------------------
r94466 | russell | 2007-12-21 10:34:32 -0600 (Fri, 21 Dec 2007) | 6 lines

Convert the contexts lock to a read/write lock to resolve a deadlock.  This
has a nice side benefit of improving performance.  :)

(closes issue ASTERISK-11090)
(closes issue ASTERISK-10613)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=94466

By: Digium Subversion (svnbot) 2007-12-21 10:39:37.000-0600

Repository: asterisk
Revision: 94467

_U  trunk/

------------------------------------------------------------------------
r94467 | russell | 2007-12-21 10:39:36 -0600 (Fri, 21 Dec 2007) | 13 lines

Blocked revisions 94466 via svnmerge

........
r94466 | russell | 2007-12-21 10:37:47 -0600 (Fri, 21 Dec 2007) | 6 lines

Convert the contexts lock to a read/write lock to resolve a deadlock.  This
has a nice side benefit of improving performance.  :)

(closes issue ASTERISK-11090)
(closes issue ASTERISK-10613)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=94467

By: callguy (callguy) 2007-12-26 16:16:57.000-0600

russell: we've experienced several deadlocks with 94769 today. it seems we can pretty reliably reproduce by doing an extensions reload (I believe that's just because it slows things down enough for some of these conditions to present themselves). Output of 2 core show locks attached (both appear similar). These do look similar to some of the previous stuff we've run into here before.

By: Volnikov Ivan (ivan) 2007-12-27 02:38:05.000-0600

callguy: I think that the problem was solved after last closing. I see, that it not so. For New Year's holidays I will have enough time to look this long-living problem. It looks like simple back-call locking with complex conditions. Tell me please, do you have some specific addons under Asterisk? If the problem repeated on our systems, it would be much easier. Anything concrete I do not promise, but I shall try to help.

By: callguy (callguy) 2007-12-27 05:58:40.000-0600

ivan: the only addon we are using is the realtime module, otherwise our build is completely stock. the conditions to trigger this did seem more random in the past - now it seems we can somewhat reliably get the system to deadlock on reload (though even that isn't totally reproducible).

By: callguy (callguy) 2007-12-27 08:15:00.000-0600

we just had another deadlock, looks similar but slightly different than the previous two. we are getting these every few hours on recent builds, with 1.4.16 it was every couple of weeks on any given machine. It appears from what I can see that manager logon/logoff at the same time as a phone changing state and hints updating (sometimes with a reload thrown in) at least appears to be one trigger.

russell - if you need access just let me know, we have the core files from each of these generated as well.

By: Russell Bryant (russell) 2007-12-27 08:19:24.000-0600

Yes, I would like to log in if you don't mind.

By: callguy (callguy) 2007-12-27 09:02:19.000-0600

russell: all set, just sent you an email with login details.

By: Digium Subversion (svnbot) 2007-12-27 09:13:31.000-0600

Repository: asterisk
Revision: 94831

U   branches/1.4/main/pbx.c

------------------------------------------------------------------------
r94831 | russell | 2007-12-27 09:13:30 -0600 (Thu, 27 Dec 2007) | 5 lines

Now that the contexts lock is a read/write lock, it should not be locked here
in ast_hint_state_changed().  This makes it get locked recursively which now
causes a deadlock.
(closes issue ASTERISK-10613, thanks to callguy for the access to a deadlocked machine)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=94831

By: Digium Subversion (svnbot) 2007-12-27 09:14:04.000-0600

Repository: asterisk
Revision: 94832

_U  trunk/

------------------------------------------------------------------------
r94832 | russell | 2007-12-27 09:14:04 -0600 (Thu, 27 Dec 2007) | 12 lines

Blocked revisions 94831 via svnmerge

........
r94831 | russell | 2007-12-27 09:16:56 -0600 (Thu, 27 Dec 2007) | 5 lines

Now that the contexts lock is a read/write lock, it should not be locked here
in ast_hint_state_changed().  This makes it get locked recursively which now
causes a deadlock.
(closes issue ASTERISK-10613, thanks to callguy for the access to a deadlocked machine)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=94832

By: callguy (callguy) 2007-12-27 22:29:16.000-0600

russell: ran for about 30m and hit another deadlock. core show locks attached. if you need access let me know - it looks similar to the one from this a.m. I ran a make distclean just to be sure there wasn't anything stale in my source and recompiled, the checksum was the same and I verified in the source I have your updated code. This is from r94831.

By: Marty Riedling (mariedling) 2007-12-28 14:51:53.000-0600

I added my core show locks when mine locked today. This seems to happen more since the patch. Mine has has locked 3 times today so far.

By: Marty Riedling (mariedling) 2007-12-28 15:12:37.000-0600

just happened again added locks-20071228-2.txt

By: Russell Bryant (russell) 2007-12-31 12:14:26.000-0600

Yeah, I'd like access again if you don't mind ...

By: Digium Subversion (svnbot) 2007-12-31 17:39:49.000-0600

Repository: asterisk
Revision: 95577

U   branches/1.4/main/pbx.c

------------------------------------------------------------------------
r95577 | mmichelson | 2007-12-31 17:39:49 -0600 (Mon, 31 Dec 2007) | 9 lines

Avoiding a potentially bad locking situation. ast_merge_contexts_and_delete writelocks the conlock, then
calls ast_hint_extension, which attempts to readlock the same lock. Recursion with read-write locks is
dangerous, so the inner lock needs to be removed. I did this by copying the "guts" of ast_hint_extension
into ast_merge_contexts_and_delete (sans the extra lock).

(this change is inspired by the locking problems seen in issue ASTERISK-10613, but I have no idea if this is the
problematic area experienced by the reporters of that issue)


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=95577

By: Digium Subversion (svnbot) 2007-12-31 17:41:12.000-0600

Repository: asterisk
Revision: 95578

_U  trunk/
U   trunk/main/pbx.c

------------------------------------------------------------------------
r95578 | mmichelson | 2007-12-31 17:41:11 -0600 (Mon, 31 Dec 2007) | 17 lines

Merged revisions 95577 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r95577 | mmichelson | 2007-12-31 17:43:13 -0600 (Mon, 31 Dec 2007) | 9 lines

Avoiding a potentially bad locking situation. ast_merge_contexts_and_delete writelocks the conlock, then
calls ast_hint_extension, which attempts to readlock the same lock. Recursion with read-write locks is
dangerous, so the inner lock needs to be removed. I did this by copying the "guts" of ast_hint_extension
into ast_merge_contexts_and_delete (sans the extra lock).

(this change is inspired by the locking problems seen in issue ASTERISK-10613, but I have no idea if this is the
problematic area experienced by the reporters of that issue)


........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=95578

By: Marty Riedling (mariedling) 2008-01-04 07:50:08.000-0600

I updated to SVN 95577 late on 1/1/08 and my problems seem to be resolved. I was having problems just about everyday before that. Will continue to monitor.

By: Russell Bryant (russell) 2008-01-04 13:06:32.000-0600

callguy:  Your locks output is very similar to that from the person that just reported that their problem appears to be solved.  Given that, it would be worth updating to the latest code in svn when you feel that it is a good time to do so.

By: Michiel van Baak (mvanbaak) 2008-01-04 15:58:00.000-0600

It would be nice if all the commenters who had this problem can provide feedback after updating.

By: Joel Vandal (jvandal) 2008-01-04 22:41:27.000-0600

I have update to Asterisk 1.4.17 some 'call-center' (5-10k calls per days) installations that have lots of deadlock (related to this ticket) and crash at least twice a day and for now this look good. I will monitor theses installations all weekends and if look good, will update more installations next week and continue to watch this.

By: Denis Galvao (denisgalvao) 2008-01-05 19:27:35.000-0600

Unfortunetly my customer didn't wait for the fix... We moved for another solution.

As soon as we have another one I will post here.

Thanks anyway for all of the people that helped this ticket.

By: Ronald Chan (loloski) 2008-01-07 08:32:15.000-0600

so far this is good :)

By: David Brillert (aragon) 2008-01-07 09:51:34.000-0600

I have updated a site to 1.4.17 that would crash within 5 minutes on 1.4.16

So far so good
Now running 1 day and 14 hours without reported incident.

By: Russell Bryant (russell) 2008-01-07 10:47:46.000-0600

Ok, then I am closing this out for now, in hopes that all is well.

Multiple issues have already been fixed as a result of this bug report.  So, if you have any further issues, please open a new report instead of continuing to use this one.  Thank you!