[Home]

Summary:ASTERISK-10177: SIP hairpin invokes Local within app_dial to produce a crash.
Reporter:dtyoo (dtyoo)Labels:
Date Opened:2007-08-27 08:43:21Date Closed:2007-10-16 17:01:01
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Applications/app_dial
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 1_4_12_lock.h
( 1) 10571_chan_sip_crash_2.txt
( 2) 10571_for_ivan_additional_debug_9-27-07.txt
( 3) 10571-bt-btfull-sip-crash.txt
( 4) 20070828__bug10571.diff.txt
( 5) 20071009__bug10571.diff.txt
( 6) additional_trace_info.txt
( 7) bt_full_ivan.txt
( 8) bt.txt
( 9) bt-btfull.txt
(10) console-output.txt
(11) core100615102007.log
(12) core100615102007full.log
(13) core100615102007fullpagoff.log
(14) ivan_ast_1_4_12_rel_patch_lock.h.diff
(15) lock.h.diff.10571.txt
(16) lock.h.diff.txt
Description:We are getting crashes in app_voicemail on a fairly regular basis.  I'm still working on steps to re-produce, but I thought I would post the backtraces here in case someone could glean anything from them.  I will update if I can figure out the steps to re-produce.

****** ADDITIONAL INFORMATION ******

CentOS 4.5, Dell 1950, 2 x 3GHz Xeon CPU, 2GB RAM
Comments:By: dtyoo (dtyoo) 2007-08-27 09:00:00

I do also get the following error in the asterisk log file as the last message before the server goes down:

[Aug 27 09:34:24] ERROR[23957] /usr/src/asterisk-test/1.4.9/asterisk-1.4.9/include/asterisk/lock.h: chan_local.c line 180 (local_que
ue_frame): mutex '&us->lock' freed more times than we've locked!


The source channel for the call going into voicemail is a LOCAL channel.

By: Mark Michelson (mmichelson) 2007-08-27 11:21:00

I've changed the categorization to channels/chan_local since the problem seems to be actually happening in the chan_local code and not the voicemail code. It is worth noting, though, if voicemail is the most likely culprit in causing the local channel to crash.

By: Tilghman Lesher (tilghman) 2007-08-27 18:11:18

I'd like to see the following out of this backtrace:

(gdb) frame 1
(gdb) p *other

I'd also like to see the portion of your dialplan where you're calling the Local channel before it gets to VoiceMailMain.  Specifically, I'd like to see the flags that you're passing to the Local proxy channel.

By: dtyoo (dtyoo) 2007-08-27 19:59:49

putnopvut-

You are probably right that this is an issue with the local channel.  Since upgrading our servers from 1.2 to 1.4 we have been having other issues related to server hangs (asterisk stops taking calls) where the common thread is that the source channel for the calls is a local channel.  Running "show channels" on a server in this state will go into an infinite loop and won't return control to the console.  I'm still trying to get re-producible scenarios for those issues as well.  Maybe its all related?

Corydon76-

See upload that has the info you requested.  The unusual thing about the way these calls are getting to voicemail is that they are falling out of a sip dialing loop.  The Dial command is dialing the extension at the currently running server.  E.g.

Dial(SIP/1231231234@this_server,70)

This is followed on the console by a:

Got SIP response 482 "Loop Detected" from this_server

Then a:

Now forwarding SIP/SOURCE_SIP_PEER-b7bc2938 to 'Local/1231231234@pstn-in' (thanks to SIP/this_server-08cec0c8)

Then into a voicemail macro that does little more than call VoiceMail with appropriate arguments.

I'm not passing any explicit flags to the LOCAL channel as it seems that asterisk is creating it for me based on the sip loop.

I understand that this is a less then ideal way to get a call delivered to a different part of the dialplan on the same server.  It didn't seem to be causing issues under 1.2 and doing it this way allowed us to simplify our dialplan considerably.

By: dtyoo (dtyoo) 2007-08-27 20:09:09

Console snippet posted.

By: Tilghman Lesher (tilghman) 2007-08-28 15:18:56

dtyoo:  instead of relying on SIP hairpinning, you can do Dial(Local/extension@context).

By: Tilghman Lesher (tilghman) 2007-08-28 15:24:41

dtyoo:  try that change that I suggested, and append "/n" on the end to disable local channel optimization OR apply this patchset to your installation and try again.  Either way, it does the same thing, preventing local channel path optimization.

By: dtyoo (dtyoo) 2007-08-29 13:50:18

Corydon76-

We changed our dialplan to avoid the hairpin and thus the creation of the local channel altogether.  We were getting at least 1 crash or hang per day previously, but have been stable (knock on wood) since making these changes.  Of course this doesn't help us get to the bottom of this issue.  We are going to test your patch on some other servers and let you know what the results are.

By: Jason Parker (jparker) 2007-09-14 13:54:58

Any luck with testing this?

By: dtyoo (dtyoo) 2007-09-16 12:02:24

qwell-

We haven't had any more instances of this crash, but we also have changed our dialplan to minimize the use of the local channel.  We've been running Corydon76's patch on 1.4.11 on a couple of our servers with no adverse effects or crashes.  I've tried to re-produce this issue in our dev environment with the previous state of our dialplan that makes extensive use of the local channel, but am unable to do so.  The issue was only happening under load in production.  

At this point I would say we are avoiding this crash with the dialplan changes we made.  I can't say conclusively if the patch fixes the issue.

By: dtyoo (dtyoo) 2007-09-19 11:03:27

Corydon76, qwell-

We did get another crash that looks very similar but is not exactly the same to the ones we originally reported here.  This is on one of the upgraded servers that is running 1.4.11 with Corydon76's local channel de-optimization patch.  This crash points to chan_sip, but it definitely seems related.  This is the last line of the messages file:

[Sep 19 09:17:29] ERROR[15843] /usr/src/asterisk-test/1.4.11/asterisk-1.4.11/include/asterisk/lock.h: chan_sip.c line 15175 (sipsock
_read): mutex '&p->owner->lock' freed more times than we've locked!

What was happening at the time of the crash is a call being delivered to a large ring group with 32 sip peers being dialed with Dial(SIP/PEER1&SIP/PEER2...).

I've also uploaded a new bt/btfull from this core file.

By: dtyoo (dtyoo) 2007-09-25 12:26:22

Corydon76, qwell-

I have another example / bt of the last reported crash that points to chan_sip.  Should I open this in a totally separate bug instead of this one?  I realize that this may not be directly related to the original issue reported here, and that these updates may be misplaced.  Let me know if I should open a separate bug and I am happy to do so.  Just as before, this again happened during an inbound call to a large ring group with 25-30 sip peers being dialed simultaneously.

This is 1.4.11.

Here are the last messages:

[Sep 25 09:06:09] ERROR[17350]: /usr/src/asterisk-test/1.4.11/asterisk-1.4.11/include/asterisk/lock.h:381 __ast_pthread_mutex_unlock: chan_sip.c line 15175 (sipsock_read): mutex '&p->owner->lock' freed more times than we've locked!

[Sep 25 09:06:09] ERROR[17350]: /usr/src/asterisk-test/1.4.11/asterisk-1.4.11/include/asterisk/lock.h:397 __ast_pthread_mutex_unlock: chan_sip.c line 15175 (sipsock_read): Error releasing mutex: Operation not permitted

I'm uploading the bt as well.

By: Volnikov Ivan (ivan) 2007-09-26 06:30:11

I think that to see the similar situation (bt_full_ivan.txt). The first time I have thought, that it because of my changes take place (http://bugs.digium.com/view.php?id=10821). It appears, it at all so. Other vulnerability Takes place.

By: Volnikov Ivan (ivan) 2007-09-26 08:16:51

If any one look at this:
(gdb) frame 0
#0  0x00757aef in __ast_pthread_mutex_trylock (filename=0x75a513 "chan_local.c", lineno=176, func=0x75a60e "local_queue_frame",
   mutex_name=0x75a629 "&other->lock", t=0xb7bfc5b8) at /usr/src/asterisk-1.4.11-debug/include/asterisk/lock.h:345
345                             t->thread[t->reentrancy] = pthread_self();
(gdb) p *t
$15 = {mutex = {__data = {__lock = 1, __count = 1, __owner = 5666, __kind = 1, __nusers = 1, {__spins = 0, __list = {
         __next = 0x0}}},
   __size = "\001\000\000\000\001\000\000\000\"\026\000\000\001\000\000\000\001\000\000\000\000\000\000", __align = 1},
 track = 1, file = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xb0 <Address 0xb0 out of bounds>}, lineno = {0, 0, 0, 0, 0,
   0, 0, 0, 0, 0}, reentrancy = 7710222, func = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, thread = {0, 0, 0, 0, 0,
   0, 0, 0, 0, 0}}
(gdb)
Much becomes clear. How can happend reentrancy = 7710222?



By: Volnikov Ivan (ivan) 2007-09-27 03:38:18

dtyoo: if happend another crash can you display me:
(gdb) frame 1
(gdb) p *other
and
(gdb) frame 0
(gdb) p *t
?



By: dtyoo (dtyoo) 2007-09-27 08:01:46

Ivan-

I just added the info you requested from one of the core files from a previous instance of the local channel crash.  We were still running at 1.4.9 at the time this happened.

By: Volnikov Ivan (ivan) 2007-09-27 09:01:08

dtyoo: Thanks
In my backtrace the same problem is: "reentrancy" counter has an invalid value.
For developers:
The reasons, I think, can be:
1. The "race collision" in the wrapper functions (lock.h) for mutex operation under non-atomic access to counter "reentrancy" (The most possibly)
2. The "race collision" with "ast_channel" operation  without handshake before destroy it (that problem can be most possibly in the res_features.c module)
3. Linux Kernel level trubles (Poorly possibly) when make thread context switching in multicore CPU

By: Volnikov Ivan (ivan) 2007-10-05 03:16:50

dtyoo:
If you are ready to test my Hypothesis ?1 I can do patch in <lock.h> for you.
For this purpose you need to lay out your module <lock.h> (you use version 1.4.9 - there are distinctions). I cannot guarantee the decision of a problem (at me it has disappeared), but I can guarantee working capacity.

By: callguy (callguy) 2007-10-05 08:01:36

Ivan - I saw your note and agree that at least the current traces here to point to it being the same issue as the one I reported in 10875. If you could make a patch to lock.h for asterisk 1.4.12 we'd be happy to test.

By: Volnikov Ivan (ivan) 2007-10-05 09:11:32

callguy - I post for you <1_4_12_lock.h> rename it to <loch.h> and replace existed. Rebuild. I hope it has to help.

By: callguy (callguy) 2007-10-05 09:19:46

Ivan - thanks. we'll test this patch and let you know the results.

By: Mark Michelson (mmichelson) 2007-10-05 11:10:33

Ivan, could you post that lock.h in unified diff format so that it can easily be seen what was changed?

By: callguy (callguy) 2007-10-05 14:09:56

while i was compiling i ran a diff on ivan's file. uploaded as lock.h.diff. note - diff was calculated against 1.4 trunk, but i verified that lock.h in 1.4.12 is identical, so this should apply to both.

By: callguy (callguy) 2007-10-05 15:46:32

Ivan - your patch causes the following compiler warning to be repeated:

/usr/src/asterisk-test/1.4.12/asterisk-1.4.12.1/include/asterisk/lock.h:339: warning: unused variable `canlog'

By: callguy (callguy) 2007-10-05 23:21:45

ivan - just tried testing your patch, asterisk crashes on any attempt to make a call with memory corruption. If you need bt's please let me know and I'll provide them - but I suspect there's something obvious going on.

By: Volnikov Ivan (ivan) 2007-10-08 01:47:58

putnopvut -
I think, that it is not necessary to do while any patch. It is not a patch. It is check trying for hypothesis only.
Callguy speaks, that it has a crash at a call. I think, that we shall personally understand. I try myself at 1.4.11 - was OK. It is necessary to try in 1.4.12.
callguy -
Sorry for bad attempt. I shell try in my system on 1.4.12.



By: Volnikov Ivan (ivan) 2007-10-08 01:52:44

callguy -
Is you setup option
DEBUG_THREADS
DON'N_OPTIMIZE
MALLOC_DEBUG
LOADABLE_MODULES
in menuselect while build?
Please do not set DETECT_DEADLOCKS.
I have tested this <lock.h> on my system.
Any crash has not occured.
Can you show me back trace?
Do you have IRC or ICQ connection?



By: Volnikov Ivan (ivan) 2007-10-08 02:42:44

Though for me in all poses works :(
callguy -
 I need back trace...
 Is there crash happends any time on call or after some successfull calls?



By: callguy (callguy) 2007-10-08 05:33:08

Ivan - I realized the issue was actually a different patch that I was testing at the same time. I'm retesting yours now and will let you know the results. My apologies for the trouble.

By: callguy (callguy) 2007-10-08 05:42:35

Ivan - I've confirmed that it is working with the file you                               provided. I will run it this way for the day and let you know if we experience any crashes.

By: Volnikov Ivan (ivan) 2007-10-08 06:31:39

callguy - OK. I'll be waiting for results.

By: callguy (callguy) 2007-10-08 16:46:21

Ivan - so far so good. We made it a full day without a crash, which we haven't done before on 1.4.12. That said - it was a bit of a quiet day today due to the holiday here, so I'll run this through tomorrow and let you know the results.

By: Volnikov Ivan (ivan) 2007-10-09 01:25:13

To be shown vulnerability began similar more often in 1.4.12 that guys from Digitum have made many changes, concerning detours of deadlocks.
If my assumption is true, will have to alter this module (<lock.h>).
I had to disconnect there the doubtful debugging code supervising entring in critical sections, together with one watchdog.
callguy -
I wait for results. If I have appeared the rights - I shall correct.



By: Dmitry Andrianov (dimas) 2007-10-09 01:49:50

Ivan, maybe he just can not uderstand what you are trying to say :)
Belive me, it is difficult even for me, the guy who speaks the same language as you.

By: Volnikov Ivan (ivan) 2007-10-09 02:26:21

dimas -
For such cases always there is a phrase: "I do not understand". I with ease shall paraphrase. You are possible are right. But let's discuss technical questions better. Multithreading also there is one of the most complex themes in coding. But developers each other should understand from a half-word. I realy sorry that I have afforded the previous statement. I now shall clean it.

By: callguy (callguy) 2007-10-09 15:22:42

Ivan - it looks like you are on the right track here. We were seeing crashes every 2-3 hours consistently w/1.4.12. With your patch we haven't had a single crash in two days. We actually added extra load to this server for testing today and haven't had any issues. Is there anything we can do to make this into a formal patch since it appears your theory was correct?

By: Tilghman Lesher (tilghman) 2007-10-09 15:41:05

This would appear to be the essence of Ivan's changes, correct?  All of the other changes are simply commenting debugging output.

By: callguy (callguy) 2007-10-09 15:44:14

Corydon76: These were Ivan's changes, I just reformatted them into a diff for easier viewing.

By: Digium Subversion (svnbot) 2007-10-09 16:34:35

Repository: asterisk
Revision: 85158

U   branches/1.4/include/asterisk/lock.h
U   branches/1.4/main/channel.c
U   branches/1.4/main/utils.c

------------------------------------------------------------------------
r85158 | tilghman | 2007-10-09 16:34:34 -0500 (Tue, 09 Oct 2007) | 5 lines

This commit fixes the following issues:
- Deadlock in ast_write (issue ASTERISK-10043)
- Deadlock in ast_read (issue ASTERISK-10043)
- Possible mutex initialization error in lock.h (issue ASTERISK-10177)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-10-09 17:02:01

Repository: asterisk
Revision: 85176

_U  trunk/
U   trunk/include/asterisk/lock.h
U   trunk/main/channel.c
U   trunk/main/utils.c

------------------------------------------------------------------------
r85176 | tilghman | 2007-10-09 17:02:00 -0500 (Tue, 09 Oct 2007) | 13 lines

Merged revisions 85158 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r85158 | tilghman | 2007-10-09 16:55:06 -0500 (Tue, 09 Oct 2007) | 5 lines

This commit fixes the following issues:
- Deadlock in ast_write (issue ASTERISK-10043)
- Deadlock in ast_read (issue ASTERISK-10043)
- Possible mutex initialization error in lock.h (issue ASTERISK-10177)

........

------------------------------------------------------------------------

By: Volnikov Ivan (ivan) 2007-10-10 02:06:06

Corydon76 - I comment not only debug code. There is a WATCHDOG to look at correct owning for free mutex in three cases:
1. unlock mutex
2. wait condition in mutex
3. wait timed condition in mutex
Is it realy need?
Except for that, "reentrancy" counter is using in <chan_h323.c> and <channels.c> with debug perposes (in DEBUG_THREADS mode).
To fill "reentrancy" correct there are two ways:
1. Use atomic function to increment or decrement counter (it more quickly than way 2 but demands more experience in implementation)
2. Use global (or local - will be more quickly but more prodigally) critical section (specific mutex in Linux) in that debug mode for perfom "read\write" operation under "reentrancy" counter (slower than way 1 but
more suitable for the debugging purposes)
I think what to do it the one who wrote this debugging should.
At the personal request (ICQ:138890162) I can render consultation or make patch (But not earlier than on November, 15th. Greater current loading).



By: Volnikov Ivan (ivan) 2007-10-10 03:30:04

callguy -
I tried, try and I shall try to explain to developers in what a problem in a code of a wrapper for mutex. With your help we have found out, that I most likely am right. But :( for our efforts have turned in patch it is necessary to write it. I need approximately 4-6 working hours for writing correct patch and test it (more time to test it). If till November 12 developers will not connect on me with questions on this Weekend I shall make the patch. I hope with your help we can check up it.

By: callguy (callguy) 2007-10-10 13:25:03

Corydon76: We reviewed this in more detail today, and agree with Ivan, this is not just an initialization issue. The reason is that the debug code that he had commented out is performing operations against the reentrancy variable (this is only if DEBUG_THREADS is set).

The core issue is that the debugging code isn't thread safe, so there are several situations where the incrementing/decrementing of reentrancy in the debug code could lead to an out of bounds array index.

It looks like the easiest way to deal with this is instead of a flat structure of arrays, we need an array of structures. We're going to create a patch over here that should resolve this and will submit later today.

By: callguy (callguy) 2007-10-10 16:35:36

Corydon76: Please take a look at the patch I just uploaded, it's an attempt at making the thread debugging thread safe. We've compiled and tested this, and it appears to be functioning correctly.

By: Volnikov Ivan (ivan) 2007-10-11 01:47:11

callguy -
 I have looked your realization.
 There are two doubts:
 1. Moving a critical counter in stack variable reduces probability of
    "race condition", but does not exclude it completely (especially in
    muli-core systems)
 2. Output abroad of "reentrancy" counter in you code only exclude out of
    range for array, but not correct the "race condition" situation
 I think needs to make the next changes in <lock.h> module:
 1. Add to reetrancy attributes the critical section object (mutex in Linux)
 2. Make all operatins on reetrancy attributes inside of locked critical section
--> It is the most simple and reliable way will get rid of a problem (but not unique).
 I well understand, that the decision of the given problem for you is
 very important and I shall try to find the time to patch it in this weekend.



By: Volnikov Ivan (ivan) 2007-10-11 04:35:47

Corydon76 -
 Sorry. You were right - that Issues happends only in DEBUG_THREADS mode.
 I am not seen global definition at once.
callguy -
 I found some time to make my decision for the problem (ivan_ast_1_4_12_rel_patch_lock.h.diff).
 If there will be an opportunity - test it. On my system it work in DEBUG_THREADS mode. Patch was made on release 1.4.12 version.



By: callguy (callguy) 2007-10-11 09:24:12

Ivan - thanks for cleaning this up, it's much appreciated. Without this fixed we couldn't move forward on trying to sort out some of the other issues we've been having. We will start testing this tonight and let you know if we see any problems.

By: callguy (callguy) 2007-10-12 15:16:23

Corydon76: We've been running the latest patch since last night on two moderately loaded servers (about 50 concurrent calls each) - either of which would crash within 2-3 hours previously if running with DEBUG_THREADS enabled. I also went back and reviewed the earlier traces related to this bug and believe that they were all in-fact caused by the same underlying problem.

Unless anyone else has any objections I think this one can be closed out.

Thanks for your help and also many thanks to Ivan for helping to get this one resolved.

By: David Brillert (aragon) 2007-10-12 15:35:37

Will ivan_ast_1_4_12_rel_patch_lock.h.diff apply to 1.4.13
I have been experiencing core dumps on multiple sites related to this bug
I am using version 1.4.13
I'm curious if this patch will supersede r85158 | tilghman | 2007-10-09 16:55:06 -0500 (Tue, 09 Oct 2007) | 5 lines
r85158 did not work for me.

By: Mark Michelson (mmichelson) 2007-10-12 15:52:51

aragon:

This patch will supersede 85158. I just tried applying the patch to 1.4.13, and had 1 hunk fail. I looked at what got rejected when patching, and it shouldn't be difficult to manually patch the failed section (it's a matter of removing one line and adding another). I suggest giving this patch a try because this is a pretty important issue to get settled and if we have more confirmation that Ivan's patch is working, it will get merged sooner.

By: callguy (callguy) 2007-10-12 16:03:46

Aragon:

If you don't want to reconcile the rejected hunk by hand, you can do the following (I just tested this and it results in the desired effect):

-Get the sources for both 1.4.12.1 and 1.4.13.

Apply the following patches to the 1.4.12.1 tree:
-patch: 20071008__bug10406__3.diff.txt from bug 10406
-patch ivan_ast_1_4_12_rel_patch_lock.h.diff from 10571
-Copy the resulting lock.h into include/asterisk/lock.h in the 1.4.13 code

By: David Brillert (aragon) 2007-10-12 17:51:41

putnopvut

Can you upload this patch for 1.4.13
I would like to test this patch.
I have three sites segfaulting twice each daily
Each site with the same backtraces

By: David Brillert (aragon) 2007-10-13 14:00:33

I have Ivan's patch in tests at 2 sites and will update bugtracker.
Patched against 1.4.13

By: David Brillert (aragon) 2007-10-15 10:06:20

A segfault this morning with an hour of appying patch and restarting Asterisk 1.4.13 with Ivan's patch.
backtrace is core100615102007.log uploaded to bugtracker

By: Mark Michelson (mmichelson) 2007-10-15 10:08:12

aragon, I don't see the backtrace you mentioned.

Edit: It's there now. Sorry.



By: callguy (callguy) 2007-10-15 10:09:09

Aragon: can you upload the output of "bt full" from your core file?

By: David Brillert (aragon) 2007-10-15 10:41:43

full bt uploaded core100615102007full.log

Could someone be so kind as to remove hostname etc from header of my backtrace?



By: callguy (callguy) 2007-10-15 11:01:06

Aragon: Can you do this again, but after starting gdb type "set pagination off" then upload the resulting output. This should get the entire backtrace.

Edit: After typing "set pagination off" then run bt full and upload the output.



By: David Brillert (aragon) 2007-10-15 11:56:22

OK callguy full trace is uploaded

By: callguy (callguy) 2007-10-15 12:58:17

putnopvut: I don't really know what to make of aragon's bt, but it doesn't appear to be related in any obvious way to the core issue in this bug.

By: Mark Michelson (mmichelson) 2007-10-15 13:49:10

aragon, your bt full is missing a lot of information that should be there. Are you compiling with the DONT_OPTIMIZE flag set in menuselect?

If you haven't already done it, could you run asterisk with Ivan's patch again so we can gauge if it is actually that patch which is causing the crash you just had?

By: David Brillert (aragon) 2007-10-15 14:36:51

Ivan's patch is applied to 1.4.13
We are going going to recompile asterisk 1.4.13 with Ivan's patch with debug_thread + dont optimize; tonight
If Asterisk segfaults again I will upload the backtraces

By: callguy (callguy) 2007-10-15 16:22:42

FYI - we added a few more servers to our testing today, and continued to run with out any occurrence of this issue. All servers are under substantial load (between 50-100 concurrent calls during business hours).

By: Volnikov Ivan (ivan) 2007-10-16 01:43:07

aragon -
 It seems to me, that your crash does not concern at all to the given theme.
 I have just looked your first core171711102007.log. This is a crash in voice mail function. I think that the separate Issue is required.



By: Volnikov Ivan (ivan) 2007-10-16 01:54:05

aragon -
There is a suspicion, that your case is connected with processing of your extensions scenario specific only for your application. In any case additional researches are necessary.



By: David Brillert (aragon) 2007-10-16 08:57:44

I have not yet seen a lockup this morning on two sites.
I was going to wait until tomorrow morning to post results on this bug ID.
Systems are compiled with dont optimize and debug threads enabled so either way I can provide an accurate BT trace if anything segfaults again.

By: David Brillert (aragon) 2007-10-16 09:05:24

My backtraces looked more similar to those found in http://bugs.digium.com/view.php?id=10875

I was referred back to 10571 by callguy and a relationship to 10571 has been added to 10875

I'll be sure to update both bugs in the tracker.

Ivan can you shed any additional info on what you see that might be a problem?

I appreciate everyone's help with this and I'll do my best to help...

By: David Brillert (aragon) 2007-10-16 11:12:34

My site segfaulted already this morning.
I dont think anymore these problems are related to 10571
I have attached bt.txt if anyone wants to look at the backtrace.
I have opened a new bug report
http://bugs.digium.com/view.php?id=10997

By: Digium Subversion (svnbot) 2007-10-16 16:53:53

Repository: asterisk
Revision: 85994

U   branches/1.4/include/asterisk/lock.h

------------------------------------------------------------------------
r85994 | russell | 2007-10-16 16:53:52 -0500 (Tue, 16 Oct 2007) | 16 lines

Some locking errors exposed the fact that the lock debugging code itself was
not thread safe.  How ironic!  Anyway, these changes ensure that the code that
is accessing the lock debugging data is thread-safe.  

Many thanks to Ivan for finding and fixing the core issue here, and also
thanks to those that tested the patch and provided test results.

(closes issue ASTERISK-10177)
(closes issue ASTERISK-10446)
(closes issue ASTERISK-10436)
(might close some others, as well ...)

Patches: (from issue ASTERISK-10177)
     ivan_ast_1_4_12_rel_patch_lock.h.diff uploaded by Ivan (license 229)
      - a few small changes by me

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-10-16 17:01:01

Repository: asterisk
Revision: 85995

_U  trunk/
U   trunk/include/asterisk/lock.h

------------------------------------------------------------------------
r85995 | russell | 2007-10-16 17:01:01 -0500 (Tue, 16 Oct 2007) | 24 lines

Merged revisions 85994 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r85994 | russell | 2007-10-16 17:14:36 -0500 (Tue, 16 Oct 2007) | 16 lines

Some locking errors exposed the fact that the lock debugging code itself was
not thread safe.  How ironic!  Anyway, these changes ensure that the code that
is accessing the lock debugging data is thread-safe.  

Many thanks to Ivan for finding and fixing the core issue here, and also
thanks to those that tested the patch and provided test results.

(closes issue ASTERISK-10177)
(closes issue ASTERISK-10446)
(closes issue ASTERISK-10436)
(might close some others, as well ...)

Patches: (from issue ASTERISK-10177)
     ivan_ast_1_4_12_rel_patch_lock.h.diff uploaded by Ivan (license 229)
      - a few small changes by me

........

------------------------------------------------------------------------