Summary: | ASTERISK-19347: File descriptor errors from res_timing_timerfd | ||
Reporter: | laszlovl (lvl) | Labels: | |
Date Opened: | 2012-02-13 07:17:22.000-0600 | Date Closed: | 2012-02-15 11:11:32.000-0600 |
Priority: | Major | Regression? | |
Status: | Closed/Complete | Components: | Resources/res_timing_timerfd |
Versions: | 1.8.9.0 | Frequency of Occurrence | Occasional |
Related Issues: | |||
Environment: | CentOS 6, 64bit | Attachments: | ( 0) timerfd.txt |
Description: | Ever since we updated our Asterisk installation to the latest version (1.8.9.0, compiled from source, running on CentOS 6 64bit), a couple of times a day Asterisk generates up to tens of thousands of errors like this: "res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor". -After investigating the different occurences, it seems there is one common denominator: the error always manifests when a caller is waiting in a queue, listening to music on hold.- An example is as follows: {quote} [Feb 13 10:02:59] DEBUG[14707] channel.c: Scheduling timer at (50 requested / 50 actual) timer ticks per second [Feb 13 10:03:03] DEBUG[14707] channel.c: Scheduling timer at (0 requested / 0 actual) timer ticks per second [Feb 13 10:03:03] DEBUG[14707] channel.c: Scheduling timer at (0 requested / 0 actual) timer ticks per second [Feb 13 10:03:03] DEBUG[14707] channel.c: Scheduling timer at (0 requested / 0 actual) timer ticks per second [Feb 13 10:03:32] VERBOSE[14707] pbx.c: -- Executing [s@prequeue:5] Queue("SIP/trunk-0000056d", "queue1") in new stack [Feb 13 10:03:32] VERBOSE[14707] res_musiconhold.c: -- Started music on hold, class 'default', on SIP/trunk-0000056d -> [Feb 13 10:10:33] DEBUG[14707] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 0 instead -> [Feb 13 10:10:33] ERROR[14707] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor -> [Feb 13 10:10:34] ERROR[14707] res_timing_timerfd.c: Call to timerfd_gettime() error: Invalid argument [Feb 13 10:10:34] VERBOSE[14707] app_queue.c: -- Local/agent@phones-bf1a;1 is ringing [Feb 13 10:10:36] VERBOSE[14707] app_queue.c: -- Local/agent@phones-bf1a;1 answered SIP/trunk-0000056d [Feb 13 10:10:36] VERBOSE[14707] res_musiconhold.c: -- Stopped music on hold on SIP/trunk-0000056d {quote} The 'res_timing_timerfd.c' error and debug messages repeat themselves tens of thousands of times every second, up to the point where the call is either cancelled or dropped. This is on a production server processing thousands of calls a day. Out of all these calls, the error only manifests itself a handful of times. Unfortunately we are not able to reproduce the problem in any other environment; we've setup test scenarios simulating the load that occurs in production (anywhere between 0 and 50 callers sitting in queues, listening to music on hold), but no errors occur during these tests. We have tested with and without internal_timing enabled, which doesn't seem to make any difference. Luckily, so far the errors have not affected the rest of the asterisk process: it seems to be isolated to one single channel at the time. Though while the error lasts (which can take up to 30 seconds), the CPU usage spikes to 100%, which makes sense considering it looks like the thread gets stuck in an eternal loop, and tens of thousands of lines have to be written to the logfiles every second. Interestingly enough, we have seen a few more errors regarding a bad file descriptor which seem unrelated on first sight: {quote} [Feb 11 10:47:56] WARNING[12510] res_agi.c: Unable to fdopen file descriptor [Feb 12 17:27:46] WARNING[26039] format_gsm.c: Short read (32) (Bad file descriptor)! [Feb 13 11:58:50] ERROR[6569] cdr_csv.c: Unable to re-open master file /var/log/asterisk//cdr-csv//Master.csv : Bad file descriptor {quote} If there is any more (debug) information needed, let me know! | ||
Comments: | By: Matt Jordan (mjordan) 2012-02-13 08:52:18.798-0600 We require a complete debug log to help triage the issue. This document will provide instructions on how to collect debugging logs from an Asterisk machine for the purpose of helping bug marshals troubleshoot an issue: https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information By: laszlovl (lvl) 2012-02-13 09:39:10.188-0600 Attached is the complete verbose & debug log. I stripped out some debug lines of the actual dialplan, as they contained private information about the caller. In the log you can see exactly when the timer errors start and end. In our setup, the queue calls a Local channel, which in turn (after executing some logic) dials a SIP phone. The errors occur on the caller's channel, but seem to start right after the queue dials the Local channel, and seem to end right when the local channel dials the actual SIP phone. By: laszlovl (lvl) 2012-02-13 15:36:33.661-0600 Hmm, it appears that my conclusion regarding the connection with music on hold was premature. I just watched the problem manifest itself again, this time the log output is extremely simple: {quote} [Feb 13 22:03:12] VERBOSE[31044] pbx.c: -- Executing [login@fromlocal:1] Answer("SIP/phone-00001c83", "") in new stack [Feb 13 22:03:12] ERROR[31044] res_timing_timerfd.c: Call to timerfd_gettime() error: Bad file descriptor {quote} The 'bad file descriptor' error then repeats itself around 850.000 times :-/ Debug is also the same as before: {quote} [Feb 13 22:03:12] DEBUG[31044] pbx.c: Launching 'Answer' [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: SIP answering channel: SIP/phone-00001c83 [Feb 13 22:03:12] DEBUG[31044] res_rtp_asterisk.c: Setting the marker bit due to a source update [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: Setting framing from config on incoming call [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: ** Our capability: 0x8 (alaw) Video flag: True Text flag: True [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: ** Our prefcodec: 0x0 (nothing) [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: -- Done with adding codecs to SDP [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: Done building SDP. Settling with this capability: 0x8 (alaw) [Feb 13 22:03:12] DEBUG[31044] chan_sip.c: Trying to put 'SIP/2.0 200' onto UDP socket destined for IP:2617 [Feb 13 22:03:12] DEBUG[31044] res_timing_timerfd.c: Expected to acknowledge 1 ticks but got 0 instead (repeated 850k times) [Feb 13 22:03:25] DEBUG[31044] channel.c: Soft-Hanging up channel 'SIP/phone-00001c83' [Feb 13 22:03:25] DEBUG[31044] channel.c: Hanging up channel 'SIP/phone-00001c83' [Feb 13 22:03:25] DEBUG[31044] chan_sip.c: Hanging up zombie call. Be scared. [Feb 13 22:03:25] DEBUG[31044] chan_sip.c: Updating call counter for incoming call [Feb 13 22:03:25] DEBUG[31044] res_rtp_asterisk.c: Setting RTCP address on RTP instance '0x7f18c9195bc8' {quote} By: laszlovl (lvl) 2012-02-15 11:11:32.167-0600 After further investigation, this appears to be a duplicate of ASTERISK-18223 |