Summary: | ASTERISK-19277: [patch]endlessly repeating error: "poll failed: Bad file descriptor" | ||
Reporter: | Barry Chern (bchern) | Labels: | |
Date Opened: | 2012-01-31 03:06:18.000-0600 | Date Closed: | 2015-07-02 07:55:16 |
Priority: | Minor | Regression? | |
Status: | Closed/Complete | Components: | Resources/res_timing_kqueue Resources/res_timing_timerfd |
Versions: | 1.8.9.0 | Frequency of Occurrence | Constant |
Related Issues: | |||
Environment: | OS X 10.5.8 on PPC Mac (G5) | Attachments: | ( 0) attnxfer.sh ( 1) attnxfer-alice.xml ( 2) attnxfer-bob-from-alice-to-charlie.xml ( 3) attnxfer-bob-to-charlie.xml ( 4) attnxfer-charlie.xml ( 5) attnxfer-extensions.conf ( 6) attnxfer-sip.conf ( 7) issueA19277-fix_closing_of_fd_0.patch ( 8) onePointEightIssueLog |
Description: | Attempting to move to 1.8 from a long-time working, not very complex, 1.6.x installation. Started with a download of the last release (1.8.8), then tried latest builds via SVN twice. Latest attempt with Asterisk SVN-branch-1.8-r353175. After apparently executing all of the steps of a dialplan without a problem, and hanging up, there is one error that seems to be incorrect, saying the csv couldn't be written to a readonly database, but the csv data HAS been written when I check on it. there follows a fast stream of these errors (after a slight pause) ad infinitum until I either cancel or the process kills itself by crashing: *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> *CLI> [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor [Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor *[Jan 30 04:32:14] ERROR[13942]: asterisk.c:2218 ast_el_read_char: poll failed: Bad file descriptor etc.etc. etc. (edited for brevity) | ||
Comments: | By: Matt Jordan (mjordan) 2012-01-31 08:24:48.752-0600 We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information By: Barry Chern (bchern) 2012-02-03 01:17:00.512-0600 This is the log. Note that I cancelled the process because it showed no signs of ending the repeat of the error or crashing, and I had to delete many of the repeats at the end to make the file under 10 MB. By: Barry Chern (bchern) 2012-02-03 01:18:42.365-0600 I have uploaded the log, with comment. In addition, note that the error happened after receiving my test call and recording voicemail. By: Matt Jordan (mjordan) 2012-02-09 14:45:04.740-0600 Although I'm not convinced its causing your problem, the error statement that the database is read-only is coming from the lower level SQLite3 driver - so it, at least, thinks there is a problem. Whether or not the write is actually succeeding in spite of that error is entirely up to the driver. You may want to attempt to perform an INSERT into the CDR table under the same user permissions as what Asterisk executes under to see if it reports a similar problem. By: Barry Chern (bchern) 2012-02-18 05:35:44.476-0600 I doubted the database error caused the fatal "poll failed" error just because there seemed to be such a long pause between them. but, do we have any idea what the "poll" is, what is being polled, and why it is so persistent in re-trying it? By: laszlovl (lvl) 2012-02-18 10:51:10.010-0600 Are you by any chance using the asterisk HTTP manager interface? By: Barry Chern (bchern) 2012-02-19 14:43:27.915-0600 no, just CLI By: Walter Doekes (wdoekes) 2012-02-20 03:34:05.988-0600 > do we have any idea what the "poll" is, what is being polled, and why it is so persistent in re-trying it? ast_el_read_char is reading characters from standard input and ast_consock, i.e. the input on the command line and the asterisk.ctl unix socket. They're both not expected to disappear., so it's probably not handled in the prettiest of ways. As far as I can see, a remote console only breaks if a write to stdout (STDOUT_FILENO==1) fails: {code} if (!ebuf && write(1, "", 1) < 0) break; {code} So that would have it continue indefinitely. The question: why are the descriptors bad/gone? By: Walter Doekes (wdoekes) 2015-06-30 05:27:31.082-0500 Okay, so this apparently still exists (in Asterisk GIT-11-96bbcf4M). I've found a way to reproduce it too: - ulimit -n 128 # to speed it up - asterisk -c - start a bunch of calls so we run out of FDs - press a key in the console - result: a flood of CLI> prompts, and the console is unusable At that point we arrive here (this is not a patch, so you shall not remove it -- and if it we're, I hereby grant you permission to use it): {noformat} --- a/main/asterisk.c +++ b/main/asterisk.c @@ -2530,8 +2530,21 @@ static int ast_el_read_char(EditLine *editline, char *cp) fprintf(stderr, "poll failed: %s\n", strerror(errno)); break; } + + /* revents can contains POLLIN, but also POLLERR, POLLHUP and POLLNVAL. + * If the process is out of file descriptors, we get POLLNVAL. At this + * point, the console is busted. */ + if ((fds[0].revents & POLLNVAL) == POLLNVAL || + (max > 1 && (fds[1].revents & POLLNVAL) == POLLNVAL)) { + fprintf(stderr, "Poll returned POLLNVAL, are you out of file descriptors?\n"); + fflush(stderr); + usleep(500000); + return 0; + } - if (!ast_opt_exec && fds[1].revents) { + if (max > 1 && fds[1].revents) { num_read = read(STDIN_FILENO, cp, 1); if (num_read < 1) { break; {noformat} Unfortunetely, that isn't enough. At this point the console is busted. So, it would seem that STDIN_FILENO is close()d by someone. Except, my GDB cannot find out who. I tried this: {noformat} $ sudo gdb `which asterisk` ... (gdb) run -c ... ...........................................................................................................................................................Asterisk Ready. [New Thread 0x7fffa70f4700 (LWP 10377)] *CLI> {noformat} At this point, the FDs look like this: {noformat} # ls /proc/`pidof asterisk`/fd -lv total 0 lrwx------ 1 root root 64 jun 30 12:18 0 -> /dev/pts/37 lrwx------ 1 root root 64 jun 30 12:18 1 -> /dev/pts/37 lrwx------ 1 root root 64 jun 30 12:18 2 -> /dev/pts/37 lr-x------ 1 root root 64 jun 30 12:18 3 -> /dev/urandom lrwx------ 1 root root 64 jun 30 12:18 4 -> socket:[1511689] lrwx------ 1 root root 64 jun 30 12:18 5 -> /var/lib/asterisk/astdb.sqlite3 ... {noformat} Add some breakpoints: {noformat} *CLI> Program received signal SIGINT, Interrupt. 0x00007ffff78ff12d in poll () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) break close Breakpoint 1 at 0x7ffff67a63f0: close. (2 locations) (gdb) cond 1 $rbx == 0 (gdb) break dup2 Breakpoint 2 at 0x7ffff78fdfe0: file ../sysdeps/unix/syscall-template.S, line 81. (gdb) cond 2 $rcx == 0 (gdb) break dup3 Breakpoint 3 at 0x7ffff78fe010: file ../sysdeps/unix/syscall-template.S, line 81. (gdb) cond 3 $rcx == 0 (gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <MULTIPLE> stop only if $rbx == 0 1.1 y 0x00007ffff67a63f0 ../sysdeps/unix/syscall-template.S:81 1.2 y 0x00007ffff78fdf50 ../sysdeps/unix/syscall-template.S:81 3 breakpoint keep y 0x00007ffff78fdfe0 ../sysdeps/unix/syscall-template.S:81 stop only if $rcx == 0 3 breakpoint keep y 0x00007ffff78fe010 ../sysdeps/unix/syscall-template.S:81 stop only if $rcx == 0 {noformat} Carry on and start flooding the thing: {noformat} (gdb) cont Continuing. ... [New Thread 0x7fffa6ffc700 (LWP 10529)] Poll returned POLLNVAL, are you out of file descriptors? [Thread 0x7fffa6ffc700 (LWP 10529) exited] Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 close () at ../sysdeps/unix/syscall-template.S:81 #1 0x000000000042c7f7 in ast_ouraddrfor (them=them@entry=0x17b78d8, us=us@entry=0x17b7ba8) at acl.c:972 #2 0x00007fffb9907c5c in ast_sip_ouraddrfor (them=0x17b78d8, us=us@entry=0x17b7ba8, p=0x17b3178) at chan_sip.c:3801 #3 0x00007fffb9934f8f in sip_alloc (callid=callid@entry=0x179b0b2 "163-10403@127.0.0.1", addr=addr@entry=0x7fffa73dbd20, useglobal_nat=useglobal_nat@entry=1, intended_method=intended_method@entry=5, req=req@entry=0x7fffa73db2b0, logger_callid=logger_callid@entry=0x1513df0) at chan_sip.c:8584 ... (gdb) up #1 0x000000000042c7f7 in ast_ouraddrfor (them=them@entry=0x17b78d8, us=us@entry=0x17b7ba8) at acl.c:972 972 close(s); (gdb) print s $2 = 0 (gdb) list 960 955 if ((s = socket(ast_sockaddr_is_ipv6(them) ? AF_INET6 : AF_INET, 956 SOCK_DGRAM, 0)) < 0) { 957 ast_log(LOG_ERROR, "Cannot create socket\n"); 958 return -1; ... {noformat} And indeed, fd 0 has been replaced. It has been (silently) closed by someone/something, and then a socket() call got hold of it. {noformat} # ls /proc/`pidof asterisk`/fd -lv total 0 lrwx------ 1 root root 64 jun 30 12:18 0 -> socket:[1513038] lrwx------ 1 root root 64 jun 30 12:18 1 -> /dev/pts/37 lrwx------ 1 root root 64 jun 30 12:18 2 -> /dev/pts/37 lr-x------ 1 root root 64 jun 30 12:18 3 -> /dev/urandom lrwx------ 1 root root 64 jun 30 12:18 4 -> socket:[1511689] lrwx------ 1 root root 64 jun 30 12:18 5 -> /var/lib/asterisk/astdb.sqlite3 ... {noformat} Weird no? I would have expected a breakpoint on either close, dup2 or dup3 the first time around. I cannot figure out how this happens. By: Walter Doekes (wdoekes) 2015-06-30 06:53:46.124-0500 For the record, I was testing something else and ran into this while running the attnxfer.sh test. See the attached files starting with {{attnxfer}}. Note that I'm not doing any (explicit) SQLite stuff. All files except sip.conf and extensions.conf are from {{make samples}}: {noformat} $ ls /etc/asterisk/ -1 asterisk.conf attnxfer-alice.xml attnxfer-bob-from-alice-to-charlie.xml attnxfer-bob-to-charlie.xml attnxfer-charlie.xml attnxfer.sh extensions.conf modules.conf sip.conf {noformat} By: Walter Doekes (wdoekes) 2015-07-01 04:35:12.774-0500 Aha! So, I shouldn't have been looking at {{%rbx}} but at {{%rdi}} -- which I noticed when debugging ASTERISK-25220. Now it all became clear. The issue is in {{res_timing_timerfd.c}} (and kqueue). See attached patch: {{issueA19277-fix_closing_of_fd_0.patch}} |