ASTERISK-18223: using http manager commands causes global file descriptor instability, crashing Asterisk

[Home]

Summary: ASTERISK-18223: using http manager commands causes global file descriptor instability, crashing Asterisk

Reporter: James Van Vleet (jvanvleet) Labels:

Date Opened: 2011-08-03 10:55:42 Date Closed: 2012-04-10 14:04:58

Priority: Critical Regression?

Status: Closed/Complete Components: Core/HTTP

Versions: Frequency of
Occurrence Constant

Related
Issues:
is related to ASTERISK-18453 manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

Environment: Attachments: ( 0) asterisk-18223.pl
( 1) asterisk-httpmanager-filedescriptor.patch

Description: I have been synthetically testing our asterisk environment and am running into a repeatable issue. After a fresh asterisk restart I can pump SIP calls through the system fine. I am currently running sets of about 170 calls with a max concurrent call count in the 10-20 range. The calls in question are recorded and use fastAGI.

The system will process approx the first 1000 calls without an issue - load average and CPU usage are in an acceptable range. After approx 1000 calls have been processed the same calls will consume higher and higher amounts of CPU/load until the same number of concurrent calls start getting "bad file descriptor" in areas all over asterisk - opening sockets, opening files and reading the timer device. Eventually something will happen to cause asterisk to crash.

Once asterisk is restarted the next 1000(ish) calls work fine and the system will repeatably break in the 1000-2000 call range.

I have tried both timerFD and DAHDI. I have tried asterisk 1.8.0, 1.8.5 and SVN as of a few days ago. I have tried running as root and turned on and off realtime priority. File limit 65k. None of these have the slightest impact on the issue.

If once I approach 1000 processed calls I leave the system it for 15 minutes or so it does not matter - somewhere in the next 1000 calls asterisk will start to consume large CPU doing the same old synthetic calls and eventually fail in some manner (generally dealing with opening files - fdopen, ioctl, etc). Sometimes either timerfd or dahdi get an error working with their devices, sometimes a file can't be read or wrote to, etc.

So, is there a recommendation on the best way to track this down? Unfortunately I can't claim a certain area of code, module or anything. All I have that is useful is that I can repeat it without fail. :-)

I am kind of hoping for suggestions on how to best isolate this - in the mean time I plan to make (at best) educated guess type changes in the system to try to narrow down what is happening.

[Edit: Matt Jordan]
The apparent solution was provided by Rosen:

Hi guys,

I was digging into this problem for a couple of days.
I did manage to get rid of that error message:
[Oct 2 11:24:16] WARNING[90943] manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

The problem was solved for us, by removing
close(s->fd);
from process_output.
manager.c Revision: 331315
on line 5510

"The fdopen() function associates a stream with the existing file descriptor, fildes. The mode of the stream must be compatible with the mode of the file descriptor. When the stream is closed via fclose(3), fildes is closed also."

Comments: By: Paul Belanger (pabelanger) 2011-08-03 12:37:00.976-0500

Thanks for your comments. This does not appear to be a bug report and we are closing it. We appreciate the difficulties you are facing, but it would make more sense to raise your question in the support tracker, http://www.asterisk.org/support.

We cannot provide support on the issue tracker, please track this to IRC or the mailing lists, once you have the proper information (backtrace in the case of Asterisk crashing) you can reopen this issue.
By: James Van Vleet (jvanvleet) 2011-08-03 16:06:43.294-0500

Really? I was asked to by David V: https://issues.asterisk.org/jira/browse/ASTERISK-18082?focusedCommentId=180723#comment-180723

By: James Van Vleet (jvanvleet) 2011-08-04 10:02:56.009-0500

After getting shot down on this (even though this a bug) I did some more testing and am becoming convinced this is an issue with FASTAGI. I turned off call recording and used only fastagi with ims_bench testing and got the same result - after about 1000-2000 consecutive calls asterisk consumes lots of CPU and fails.

Please note the likely relation with the following other tickets:

https://issues.asterisk.org/jira/browse/ASTERISK-18142
https://issues.asterisk.org/jira/browse/ASTERISK-5052

I suspect now only heavy fastAGI will make this happen.
By: David Vossel (dvossel) 2011-08-10 10:52:50.714-0500

So is this a fd leak? What happens if you up the number of fds Asterisk is allowed to open? Does this problem take longer to occur?
By: David Vossel (dvossel) 2011-08-10 10:54:34.655-0500

I'm re-opening this issue to get more feedback. This may be an issue.
By: Leif Madsen (lmadsen) 2011-08-10 16:14:40.350-0500

Assigned to James to provide additional information per David.
By: James Van Vleet (jvanvleet) 2011-08-10 21:45:18.658-0500

I actually have run my tests with and without a ulimit at 65k and it makes no difference. I have run the tests watching both an lsof and the variant that cats something in /proc having to do with the pid of asterisk (I am away from my main PC at the moment) and the number of open files does not "leak" before the system starts to fail. The number of open files (max of 300-400) increases with the number of concurrent connections (100 legs or 50 calls ) but also decreases back to the original number as the number of concurrent calls goes back to zero. However, even after letting the number of concurrent calls go to zero and the number of fds settle back to the original number (something like 50) if I continue with calls it will inevitably fail in the same 1000-2000 calls.

For the record I have also tested since the last update:

- with and without the real-time priority option in asterisk
- with timerfd and with DAHDI as timing sources
- 1000hz and 250hz kernel timing
- turning on and off call monitoring (recording)
... and none of these change the result.

Based on some other open issues I am becoming more convinced that something in fastAGI is the culprit but getting that turned off in my test environment will be a bit challenging. It is one thing that is always present in my tests. I hope to do a non fastAGI test this week but it may be into next week before that is doable as it depends on some client issues I am working that trump this effort.

Always open to suggestions of things to turn on/off or monitor in a test run otherwise I will be reporting again once the fastAGI related tests are done.

Thanks!

By: James Van Vleet (jvanvleet) 2011-08-15 16:39:30.382-0500

I was able to do some troubleshooting today and I think I narrowed things down quite a bit.

I put in a basic dialplan that does:

exten => 1001,1,Answer()
exten => 1001,n,Wait(1)
exten => 1001,n,Agi(agi://127.0.0.1/agi_DoDialplanAudioTest)
;exten => 1001,n,Playback(199)
exten => 1001,n,Hangup()

I am using perl's Asterisk::FastAGI and I have reduced the dialplan to super simple:

sub agi_DoDialplanAudioTest {
$self = shift;
$self->agi->exec( 'Playback', '199' );
return;
}

So in basic my test system here can either activate playing the "199" file through the dialplan or it can do the exact same thing through FastAGI.

When running without FastAGI, I ran over 6000 calls without a single issue. The system appears to be rock solid.

The same tests with FastAGI and nothing but the playing of the same file yield the stability issues in Asterisk every time. The same test fails somewhere between 500-2000 calls with some sort of semi-random failure to read one sort of file descriptor or another - all over the code:
[Aug 15 18:17:46] WARNING[29741]: res_rtp_asterisk.c:1657 ast_rtcp_read: RTCP Read error: Socket operation on non-socket. Hanging up.

[Aug 15 18:31:24] WARNING[30552]: acl.c:714 ast_ouraddrfor: Cannot get socket name

[Aug 15 21:28:48] WARNING[25031]: res_agi.c:1476 launch_netscript: Fnctl(F_SETFL) failed: Bad file descriptor

[Aug 15 18:33:54] WARNING[7918]: format_pcm.c:140 pcm_seek: offset too large 0, truncating to -1
[Aug 15 18:33:54] WARNING[7918]: app_playback.c:471 playback_exec: ast_streamfile failed on SIP/c5.voicegw.bolderthinking.info-0000049a for 199

[Aug 15 21:29:46] ERROR[26002]: utils.c:1164 ast_carefulwrite: write() returned error: Bad file descriptor
[Aug 15 21:29:46] WARNING[26002]: res_agi.c:1506 launch_netscript: Connect to 'agi://127.0.0.1/agi_DoDialplanAudioTest' failed: Bad file descriptor

[Aug 15 18:18:08] ERROR[30136]: res_timing_timerfd.c:171 timerfd_timer_ack: Read error: Bad file descriptor
[Aug 15 18:18:08] ERROR[30136]: res_timing_timerfd.c:171 timerfd_timer_ack: Read error: Bad file descriptor
[Aug 15 18:18:08] ERROR[30136]: res_timing_timerfd.c:171 timerfd_timer_ack: Read error: Bad file descriptor

(or I get the equivalent DAHDI one if I am testing with DAHDI - makes no difference)

I have also got a file descriptor error after approx 2000 calls with just a return in fastagi (not even playing any files, just returning control back to the dialplan)

So at the moment it appears that continued fastAGI use of any sort will sooner or later cause some sort of wide scale file descriptor corruption in my system. Not using FastAGI seems to either not cause the problem or push it out past 6000 calls.

I am going to dig out the exact FastAGI commands sent to see if anything beyond the playback is being done on each call, but this is what I know at the moment.

Thanks!

By: James Van Vleet (jvanvleet) 2011-08-15 16:50:16.405-0500

Found agi debugging. :-) we only send a single command:

AGI Rx << EXEC Playback "199"

... and nothing else.

By: James Van Vleet (jvanvleet) 2011-08-15 17:23:11.875-0500

Last update for the day (I hope). I tested again sending nothing in fastAGI (just returning and confirming sending nothing back in the agi debug) and one of my many file descriptor issues:

[Aug 15 22:19:14] WARNING[15890]: res_rtp_asterisk.c:2011 ast_rtp_read: RTP Read error: Bad file descriptor. Hanging up.

within the first 400 calls.

So it appears to be around just the act of opening a fastagi connection enough times?

By: James Van Vleet (jvanvleet) 2011-08-15 18:19:33.431-0500

Did a quick test of fastAGIs running longer than the call (basically only had to play 199 twice) instead of the agi ending before the call ends. No impact on the issue.

By: James Van Vleet (jvanvleet) 2011-08-18 17:13:27.206-0500

Continued rounds of testing looks like we have narrowed this down to being by a commit somewhere between 1.8.4.4 and 1.8.5.0rc-1.

By: James Van Vleet (jvanvleet) 2011-08-19 13:45:07.815-0500

I got it! After compiling and testing about 50 different svn commit tags, I have found the issue. The problem is not in FastAGI as I originally thought, but is instead a bug in the HTTP Manager.

Specifically, this commit/issue is causing the instability in Asterisk:

ASTERISK-16460
http://svnview.digium.com/svn/asterisk?view=revision&revision=323608

It turns out that my test system had another backrgound task using the http manager every X seconds. Running any http manager commands along with call volume results in your file descriptors getting mangled all over the system. I have tested before and after this commit and the issue starts there.

I am attaching a perl script that will cause the issue to occur. I suspect it will crash the system sooner or later if left alone but it must take a long time. It always crashes my system in the many ways I list above if you pump about 500 calls through as I describe. In general I suspect that the more file handles you are using the bigger chance whatever is in this manager commit will mess them up. My FastAGI call test opens a lot of network handles and files handles.

I did test with an invalid http manager login before posting this to make sure the remote crashing ability is not to easy to pull off. The system does not appear to go unstable unless you have a valid http manager login.

Whew! This issue was a bugger to find.

By: James Van Vleet (jvanvleet) 2011-08-19 13:46:24.242-0500

quick and dirty perl script that was used to recreate the file descriptor instability
By: James Van Vleet (jvanvleet) 2011-08-19 13:56:54.478-0500

Found the specific bad commit and attached code to reproduce the issue. This should be a critical issue for anyone using http manager commands beyond 1.8.4.4
By: Sean Bright (seanbright) 2011-08-19 15:52:56.068-0500

Do you have a backtrace? I'm not seeing it attached to the issue. Also, what platform/arch are you running this on?
By: James Van Vleet (jvanvleet) 2011-08-19 16:20:58.387-0500

I am running linux 2.6.38-11 i686 arch. I can get a backtrace but I am not sure of what you want a backtrace of. As I mentioned above continued running of the command causes FDs to go crazy everywhere - RTP, SIP, file reading, CDR writing, reading the timer device etc. It does not actually cause a system crash to get a backtrace from or if the system does crash it is due to something like not being able to read the timerFD or DAHDI timer device anymore. Most cases what happens is that the phone call that happens to be using whatever FD gets wacked, gets wacked.

So anyhow if you want a backtrace that sort of happens at some point after I get one of these errors on the console I can do that. But I would guess in most cases the call is aborted and there is not going to be much left to see. But that is coming from a guy that does not know backtraces nearly so well so if that is how you want me to get a BT just let me know and I can make it happen. Also let me know what commit rev of asterisk to do the test with. I can make it happen with complete regularity on any commit 323608 and greater - all of the various 1.8.5 and 1.8.6 tags and commits. I can never make it happen in the commit one prior to 323608. Also, it never ever happens not matter how many thousands of calls I send if I am not running the manager commands (such as in the harness). So it is something with the way that code change is dealing with FDs that must be different from the rest of asterisk.

Let me know the version and scenario you want the BT for and I will get on it.

Thanks!

By: Sean Bright (seanbright) 2011-08-20 19:01:02.820-0500

This is fine, I probably didn't read closely enough. No crash. Got it.
By: James Van Vleet (jvanvleet) 2011-08-20 22:15:50.563-0500

No worries. :-) Just let me know if you do need anything. We have worked around needing http manager for the time being but I am happy to test and/or gather more data to get it sorted.

By: Jaco Kroon (jkroon) 2011-09-07 10:45:12.932-0500

Hi,

I believe this has been fixed for 1.8.6.0 - looking at the changelog I know there was a commit to fix this issue (caused about 5+ daily crashes for me). Didn't track it to the specific commit, but rather located it from the double-free bug that caused the memory corruption.

from http://downloads.asterisk.org/pub/telephony/asterisk/ChangeLog-1.8.6.0

2011-07-12 22:53 +0000 [r327950] Kevin P. Fleming <kpfleming@digium.com>

* main/manager.c: Correct double-free situation in manager output
processing. The process_output() function calls ast_str_append()
and xml_translate() on its 'out' parameter, which is a pointer to
an ast_str buffer. If either of these functions need to
reallocate the ast_str so it will have more space, they will free
the existing buffer and allocate a new one, returning the address
of the new one. However, because process_output only receives a
pointer to the ast_str, not a pointer to its caller's variable
holding the pointer, if the original ast_str is freed, the caller
will not know, and will continue to use it (and later attempt to
free it). (reported by jkroon on #asterisk-dev)
By: James Van Vleet (jvanvleet) 2011-09-07 13:28:06.167-0500

No, this must be a different issue. I was reproducing it in a 1.8.6 RC (might have even be RC3 but the fix above appears to have been in RC1 even)
By: Roman Shubovich (rom1kz) 2011-09-19 11:41:25.995-0500

i have this issue too
asterisk 1.8.7.0 rc1
ubuntu natty, kernel 2.6.32.41
perl FastAGI module 0.02
and yes, a use http manager too (for redirect channels and originate calls)

-- Executing [XXXXXXXX@originate:2] AGI("Local/1232999402@conf-client-d7cc;1", "agi://127.0.0.1:10433/dial") in new stack
[Sep 19 20:34:31] WARNING[17400]: res_agi.c:1475 launch_netscript: Fcntl(F_GETFL) failed: Bad file descriptor

in other time i have this:
-- Executing [XXXXXXXX@originate:2] AGI("Local/7477662287@conf-client-5fee;1", "agi://127.0.0.1:10433/dial") in new stack
[Sep 19 20:05:19] ERROR[1336]: utils.c:1164 ast_carefulwrite: write() returned error: Bad file descriptor

By: Roman Shubovich (rom1kz) 2011-09-20 08:23:10.203-0500

asterisk 1.6.20
same result

-- Executing [XXXXXXXX@originate:2] AGI("Local/3770741383@conf-client-b6e0;1", "agi://127.0.0.1:10433/dial") in new stack
[Sep 20 17:20:11] ERROR[3773]: utils.c:1130 ast_carefulwrite: write() returned error: Connection refused
[Sep 20 17:20:11] WARNING[3773]: res_agi.c:850 launch_netscript: Connect to 'agi://127.0.0.1:10433/dial' failed: Connection refused
By: Rosen Iliev (rokoboko) 2011-10-03 15:50:29.387-0500

Hi guys,

I was digging into this problem for a couple of days.
I did manage to get rid of that error message:
[Oct 2 11:24:16] WARNING[90943] manager.c: HTTP Manager, fdopen failed: Bad file descriptor!

The problem was solved for us, by removing
close(s->fd);
from process_output.
manager.c Revision: 331315
on line 5510

"The fdopen() function associates a stream with the existing file descriptor, fildes. The mode of the stream must be compatible with the mode of the file descriptor. When the stream is closed via fclose(3), fildes is closed also."

Regards,

Rosen
By: Erik Wallin (erikw@pushtalk.se) 2011-10-04 16:43:46.556-0500

This seems to be the same issue as ASTERISK-18453

See also my comments there on a few other possible cases where file descriptors might get closed twice.
By: Rosen Iliev (rokoboko) 2011-10-09 21:32:39.645-0500

Hi guys,

Just to confirm, after removing close(s->fd); from line 5510 (manager.c Revision: 331315)
It has been 6 days now, and I have no more Bad file descriptor! messages, nor res_timing_dahdi or chan_sip messages. Before on busy hours my error log use to grow like crazy, filled up with res_timming_dahdi errors and chan_sip errors.

Regards,

Rosen
By: laszlovl (lvl) 2012-02-15 11:23:47.054-0600

I was experiencing several issues regarding bad file descriptors aswell (ASTERISK-19347), after further investigation it seems like I was suffering from the issue described in this bugreport.

To confirm, I setup a testcase to simulate 5-10 simultaneous calls, and created a script that continuously connects to the HTTP manager port and queries for information. This setup generated thousands of error messages regarding a bad file descriptor in res_timerfd after around 30 minutes.

I then created a patch that does exactly what Rosen Iliev suggested in his comment, compiled it in, restarted the testcase and Asterisk has been running without any problems since then: around 5 hours ago.

I took the liberty to upload a patch for this, as this should be merged to the trunk ASAP! All credits however go to the people above me who pinpointed this issue. Thanks!
By: James Van Vleet (jvanvleet) 2012-02-15 11:43:04.058-0600

Glad to hear it helped. I am surprised this did not get committed by someone in October - it must be giving a bucket of tears to anyone using the HTTP manager interface. I was surprised to get an update on this ticket. :-)

I would have been all over it but we were only using the HTTP interface for a single simple function so we just stopped using HTTP. I had thought Rosen's fix had got in but had no reason to track it.
By: Jordan Kirby (jordankirby) 2012-03-28 09:28:50.121-0500

We have seen the same problem in our test lab and can confirm that removing that line does remove the problem occurring.
What is the process for getting this fix validated by the Asterisk developers and committed into the source tree?
By: laszlovl (lvl) 2012-03-28 09:31:51.303-0500

I believe this is fixed in https://reviewboard.asterisk.org/r/1793/ - two weeks ago
By: Jonathan Rose (jrose) 2012-04-10 14:03:55.074-0500

Yes, this was fixed via ASTERISK-18453