[Home]

Summary:ASTERISK-05996: "Failed to create update thread"
Reporter:Tiszai (ptiszai)Labels:
Date Opened:2006-01-06 09:12:32.000-0600Date Closed:2006-05-03 09:28:21
Priority:MinorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) asterisk.diff
Description:Hello,

on a Asterisk SVN-branch-1.2-r7406M production box we noticed twice
system death (asterisk processes were running, so no core dump
available) during 3 days.

The call center was used at time of malfunction only for outbound calls
(IAX2 softphones to PRI). Calls were recorded with monitor application.

In the 1st failure I was able to log onto console (asterisk -rv) but the
console stucked many times - this means, did not react for anything
(running from screen).
In the 2nd case I wasn't able to log into console.
In both cases the asterisk process ate 98-99% of CPU and did no any
usable work (we was not able to place calls). 'restart now' command
didn't work. kill itself didn't work - just kill -9 at the end.

Some interesting things I found:
Dec 17 13:58:26 DEBUG[21182] channel.c: Avoiding initial deadlock for
'Zap/32-1'
Dec 17 13:58:26 VERBOSE[7937] logger.c:     -- Zap/32-1 answered
IAX2/3225-35
Dec 17 13:58:26 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:26 DEBUG[21182] channel.c: Avoiding initial deadlock for
'IAX2/3225-35'
Dec 17 13:58:26 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:26 WARNING[21182] app_queue.c: Failed to create update thread!

Last lines in messages log were:
Dec 17 13:58:30 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:30 DEBUG[7984] chan_iax2.c: We're hanging up IAX2/3202-47
now...
Dec 17 13:58:30 DEBUG[7984] chan_iax2.c: Really destroying IAX2/3202-47
now...
Dec 17 13:58:30 VERBOSE[7984] logger.c:     -- Hungup 'IAX2/3202-47'
Dec 17 13:58:30 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:30 DEBUG[21188] chan_zap.c: Echo cancellation already on
Dec 17 13:58:30 DEBUG[21182] channel.c: Avoiding initial deadlock for
'Zap/35-1'
Dec 17 13:58:30 VERBOSE[7947] logger.c:     -- Zap/35-1 answered
IAX2/3224-43
Dec 17 13:58:30 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:30 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:30 DEBUG[21185] chan_iax2.c: chan_iax2: ast_sched_runq ran
772 scheduled tasks all at once
Dec 17 13:58:35 DEBUG[21188] chan_zap.c: Echo cancellation already on
Dec 17 13:58:35 DEBUG[21182] channel.c: Avoiding initial deadlock for
'Zap/46-1'
Dec 17 13:58:35 VERBOSE[7967] logger.c:     -- Zap/46-1 answered
IAX2/3235-44
Dec 17 13:58:35 WARNING[21182] app_queue.c: Failed to create update thread!
Dec 17 13:58:37 DEBUG[7458] channel.c: Didn't get a frame from channel:
Zap/43-1
Dec 17 13:58:37 DEBUG[7458] channel.c: Bridge stops bridging channels
IAX2/3267-31 and Zap/43-1

In logs I noticed many times:
DEBUG[21182] channel.c: Avoiding initial deadlock for 'Zap/36-1'

What can this mean?
I checked app_queue and saw that the given message came from a method
which intended to handle device state changes and it had to issue
another thread to make the work (in app_queue). I have three things in
my mind:
1. there is something wrong with device state handling - not related to
app_queue
2. there is something wrong in system not allowing to start new thread -
not enough resources? I doubt.pl. the system-imposed limit on the total number of threads in a process PTHREAD_THREADS_MAX would be exceeded, because the more "changethread()" thread functions locked, and there are not exit.
3. something wrong with app_queue

System specs:
- Asterisk 1.2 SVN branch with some small modifications, patches.
- Linux-2.6.13.2, Ubuntu 5.04
- SuperMicro Dual Xeon 2.8GHz (HT disabled), 2GB RAM, HW SATA RAID5
- 2xDigium TE110P

What can I do in such case? That's a production box, we have a very
similar box (just the motherboard differs a bit) with exactly the same
asterisk version, kernel, configuration and didn't notice the problem at
all. We are restarting asterisk after midnight every day.

Any idea what do I have to check in configuration? What to try if the
problem occurs again, how to help you, the developers to catch the
source of the issue (maybe a bug?)?
Comments:By: Mark Spencer (markster) 2006-01-07 17:10:42.000-0600

I think it might be useful to treat this as a potential deadlock and generate a backtrace on a non-optimized build when this happens.  Post the resulting backtrace here and we may be able to determine if it's a deadlock rather than truly just running out of resources.

By: Russell Bryant (russell) 2006-01-08 22:03:46.000-0600

You said that this is occurring on a system that has patches applied.  Could you please show what changes you have made?

By: Tiszai (ptiszai) 2006-01-10 07:03:38.000-0600

I add the our patch file.
Very similary bug:
http://bugs.digium.com/file_download.php?file_id=8789&type=bug
"Thread 189 (Thread -1222902864 (LWP 19916)):
#0  0xb7f8c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1  0xb7f8989f in _L_mutex_lock_73 () from /lib/tls/libpthread.so.0
#2  0xb71b9050 in ?? ()
#3  0xb71b8a50 in ?? ()
#4  0xb71b8758 in ?? ()
ASTERISK-1  0xb7607d6d in hangupcalls (outgoing=0xb7616238, exception=0xb7616738) at app_queue.c:1249
ASTERISK-2  0xb7611208 in ast_mutex_lock (pmutex=0xb7616238) at lock.h:584"

I think the this is the first deadlock, and after bad the system.
Interesting thing is:
outgoing=0xb7616238 pointer = (pmutex=0xb7616238) !!!?
The analog bug closed:6196



By: Tiszai (ptiszai) 2006-01-12 04:15:19.000-0600

I should like to ask you, if modified the part of app_queue.c (removing the ast_mutex_lock(&q->lock) and ast_mutex_unlock(&q->lock)) it is good "brainstorm"?:

static void *changethread(void *data)
{
struct ast_call_queue *q;
struct statechange *sc = data;
struct member *cur;
char *loc;
char *technology;

technology = ast_strdupa(sc->dev);
loc = strchr(technology, '/');
if (loc) {
*loc = '\0';
loc++;
} else {
free(sc);
return NULL;
}
if (option_debug)
ast_log(LOG_DEBUG, "Device '%s/%s' changed to state '%d' (%s)\n", technology, loc, sc->state, devstate2str(sc->state));
ast_mutex_lock(&qlock);
for (q = queues; q; q = q->next) {
// REMOVING ( OR COMMENT) LINE !!!! ast_mutex_lock(&q->lock);
cur = q->members;
while(cur) {
if (!strcasecmp(sc->dev, cur->interface)) {
if (cur->status != sc->state) {
cur->status = sc->state;
if (!q->maskmemberstatus) {
manager_event(EVENT_FLAG_AGENT, "QueueMemberStatus",
"Queue: %s\r\n"
"Location: %s\r\n"
"Membership: %s\r\n"
"Penalty: %d\r\n"
"CallsTaken: %d\r\n"
"LastCall: %d\r\n"
"Status: %d\r\n"
"Paused: %d\r\n",
q->name, cur->interface, cur->dynamic ? "dynamic" : "static",
cur->penalty, cur->calls, (int)cur->lastcall, cur->status, cur->paused);
}
}
}
cur = cur->next;
}
// REMOVING (OR COMMENT) LINE !!!! ast_mutex_unlock(&q->lock);
}
ast_mutex_unlock(&qlock);
free(sc);
return NULL;
}

By: jalsot (jalsot) 2006-01-12 13:57:33.000-0600

As a note:
My collegue (ptiszai) didn't mention that we don't use weights in queues (every queue has weight undefined).
We were monitoring ASTERISK-6002 and ASTERISK-6044, however those can help in case, the weight parameter is used.

I think our problem relates to some lockings in queue application or agent channel, something similar as was solved in mentioned issues.

Could anybody revisit those codes?
(we checked all of our modifications we made with our patch - attached - and it seems, that cannot cause this deadlock).

By: jalsot (jalsot) 2006-01-12 14:09:16.000-0600

I read on some places that removing LD_ASSUME_KERNEL set could help in some problems. What do you think? Could it relate to such problem? (we unset it, but we would like to avoid next such trouble - customer starts to be angry)

By: Olle Johansson (oej) 2006-04-07 01:53:31

Since this is a modified Asterisk, it's not really a bug for the bug tracker. Maybe you should ask your questions on the asterisk-dev mailing list to get some feedback on your changes.

By: Andrey S Pankov (casper) 2006-04-07 10:18:50

DEBUG[21182] channel.c: Avoiding initial deadlock for 'Zap/36-1'

I'm about sure this is form ast_setstate() doing ast_get_channel_by_name_locked() (please see http://bugs.digium.com/view.php?id=6821 for details)

By: Serge Vecher (serge-v) 2006-04-07 10:30:32

Does this problem still occur? It's been 3 months with no report from ptiszai

By: Andrey S Pankov (casper) 2006-04-07 11:12:07

> That's a production box, we have a very
> similar box (just the motherboard differs a bit) with exactly the same
> asterisk version, kernel, configuration and didn't notice the problem at all.

Even if devicestate subsystem and app_queue are still buggy this issue may be safely closed. It's to reporter to re-open it if he has smth that would help to find a bug.

By: Tiszai (ptiszai) 2006-04-12 01:27:03

Checking out the newest 1.2 branch code from SVN and it see may be solve the problem.

By: BJ Weschke (bweschke) 2006-04-12 20:40:34

ptiszai  - please let us know as there is another bug I've just linked to this one that looks very similar.

By: richardhh (richardhh) 2006-04-13 01:57:55

Bweschke, thank you for making the relationship between 6153 and the one i reported; 6950.

I continue adding notes here and put a link in the other bugreport
(sorry, i DID use search but this one here didnt show up).

We also dont use weight in our queues, i saw the other bugreport and checked it.
Our box is pure 1.2.4, unpatched, E1 Pri 31 channel card, 2K calls per day

It DOES look like a ressourceproblem, it happens when many calls took place, many agents log in and off and when the machine is up for some time (3-5 days).
The threads (processlist) are not growing in size (not "unaturally") so it looks like reserved memory not freed by asterisk.

Btw: The rest of the whole box is running fine when this happens:
I can use SSH (putty), i can still ping the machine, i can even do other things on the commandprompt. Its only Asterisk looping the message in its CLI and stop working (responding).

I tried to issue a "injected reset" to asterisk from a 2nd commandline like:
asterisk -r -x restart now. Asterisk is restarting then but the problem remains.

Only a restart of the whole machine is helping.

I didnt try 1.2.6 so far, but it looks like i have to install 1.2.6 tonite.

Will report back asap.



By: richardhh (richardhh) 2006-04-13 14:20:38

Can we PLEASE change the priority of this bug from "MINOR" to "MAJOR" ?

Im spotting this bug affecting more and more people regarding the leading asterisk forums.

More and more peeps are reporting this "memory leak" and its causing SERIOUS problems out there on production servers.

Im willing to help/support tracing down this bug in every needed way, but it needs to get adressed ASAP and with all required seriousity, its a MAJOR problem in companys having asterisk running with a bigger call load.

See here:
http://forums.digium.com/

There are already 3 similar reported memory leak problems.

Let me emphase again: This is a VERY SERIOUS problem !

My machine lost 300!!! MB today, within 10 hours....

Like said, no rant (im not native english, so it might sound like, so sorry in advance) - but this needs a high priority.

Did i meation "RED ALLLERRRTT" already ? :P

BTW, i lost a buttload RAM also in the nightime, with asterisk running but no calls present...odd huh ?

When asterisk isnt running, there is no leak, so it IS asterisk.



By: Serge Vecher (serge-v) 2006-04-13 14:39:29

RichardHH: did you in fact upgrade to 1.2.6 and still observe this problem? Also, please note that v.1.2.7.1 was released last night with some memory leaks resolved (descriptor leak and others). Please, always report which version or revision of Asterisk you are using and having a problem with so that developers have enough information to help you.

If you are still having a problem with v.1.2.7.1, please follow the following instructions originally written by jamesgolovich in one of the older bugs:

"I've found the best way to debug memory leaks in asterisk is by using the built in memory debugger that can be enabled.

To enable it edit the Makefile in the asterisk src dir, and change the following line (101):
MALLOC_DEBUG = #-include $(PWD)/include/asterisk/astmm.h
to:
MALLOC_DEBUG = -include $(PWD)/include/asterisk/astmm.h

Then make sure to 'make clean' and recompile asterisk.

Once thats happened you can use the commands 'show memory summary' and 'show memory allocations <filename>' to help track down the leak.

valgrind has also been very useful in tracking down leaks."

Good luck!



By: richardhh (richardhh) 2006-04-13 17:43:11

Thanks for the hints, vechers.

Yep, figured the 1.2.7 release earlier, made the update just.

Will report back tomorrow morning (GMT+1 here) what it looks like.
We have easter-holidays here in germany, so no load on the pbx for 4 days :(

Richard

By: richardhh (richardhh) 2006-04-13 17:53:58

uuuhhhh.....i THINK i know where it comes from, here is a hot tip:

MUSIC ON HOLD !

Could it be, that the musicfiles are cached in RAM but not freed ?
We have 75 diff. tracks which make a total of 275 MB...

And i just figured that i loose memory each time another track is played by MOH...

By: Serge Vecher (serge-v) 2006-04-19 11:15:35

RichardHH: are you using format_mp3 by any chance for MOH? If yes, I would highly recommend to convert those files into a raw format and not use format_mp3 module (which has known issues and no known maintainer).

By: richardhh (richardhh) 2006-04-20 09:54:06

Hi Vechers,

nope, i dont use mpeg123 (mpeg321) for asterisk, that was the very first thing i changed when i installed asterisk.

I am using RAW format, 16 bit signed, 8KHz mono.

The problem is, i cant afford in ANY way to have our production PBX going down, thats why i have a daily cron running now (shutdown -r now) at 06:01am.

So unfortunatly, i cant say if the bug is still present, lol.

I updated to 1.2.7 the other night, shortly before 1.2.7.1 was released.
1.2.7 is the now running version here.

Well, i guess i cant really help tracking it down anymore due to my cron job :(

Im so sorry, but the PBX being up is the absolute #1 prio here.

If i can provide anymore help without "disturbing" our system, i'd gladly help.

PS.:
I had valgrind running on the box, all was fine, no leaks reported.
BUT.....our asterisk is running as a respawn daemon which will of course not "really end", so valgrind has no chance to report a possible leak since asterisk isnt really ending.

To do that, i need to change the inittab, comment out asterisk there, restart, start * by hand with valgrind and have it REALLY ending to see if something leaks.

But like said before....production system with support lines 24/7.
Only chance is somewhere on the weekend....

yay...weekend work...fun (not) :P

I have built a spareserver (always good to have one..) incase the mainmachine freaks out someday.

Just waiting for the second digium 110P card, then i can swap servers, prepare the problem-machine for debug, swap back and watch it.



By: Serge Vecher (serge-v) 2006-04-20 10:28:10

well, since the original problem as reported by ptiszai was queue-related and richardhh has linked his problems to music-on-hold, let's close this issue down.

richardhh: if your problems resurface on the backup server, please file another bug report.

By: Serge Vecher (serge-v) 2006-05-03 09:28:21

Closing this report as the original poster, ptiszai, has not indicated that this problem persists in the latest stable v.1.2.7.1