ASTERISK-24768: res_timing_pthread: file descriptor leak

[Home]

Summary: ASTERISK-24768: res_timing_pthread: file descriptor leak

Reporter: Matthias Urlichs (smurfix) Labels:

Date Opened: 2015-02-08 11:04:40.000-0600 Date Closed: 2015-02-15 06:42:02.000-0600

Priority: Major Regression? Yes

Status: Closed/Complete Components: Resources/res_timing_pthread

Versions: 13.2.0 Frequency of
Occurrence Constant

Related
Issues:
is duplicated by ASTERISK-24784 Asterisk v13.2.0 + chan_sip + leaks pipes

is duplicated by ASTERISK-24792 CLONE - app_transfer fails with PJSIP channels

Environment: Current Debian (jessie/testing), i386, up-to-date Attachments: ( 0) timer.patch

Description: Pthread timers are never deallocated because their link into the pthread_timers chain is never undone.
This causes a file descriptor leak (at least two per incoming call).

The locking in this patch probably needs review; the ao2_unlink() call does not. :-P

\[Edit:\] *Inline patch removed*

Comments: By: Matt Jordan (mjordan) 2015-02-08 11:18:32.539-0600

Unfortunately, we cannot accept even small patches inline. Please attach the patch in unified diff format to the issue, after signing a contributor license agreement. Thanks!
By: Matthias Urlichs (smurfix) 2015-02-08 12:07:07.962-0600

Patch added.

"You already have a license that is approved or pending review."
By: Matt Jordan (mjordan) 2015-02-08 21:24:10.686-0600

[~smurfix]: Thanks for attaching the patch.

Looking at it, we don't actually need to explicitly lock the {{pthread_timers}} container in that close function, as ao2_unlink will explicitly lock the container. It's probably okay to still lock the {{timing_thread}} - if nothing else, it won't be explicitly harmful, and unlinking the item from the container will race with the call to {{ao2_container_count}}.

As it is, I'd be happy to commit the patch with that small tweak.
By: Matt Jordan (mjordan) 2015-02-08 21:27:29.310-0600

Actually, on second thought, we have to be careful not to introduce a locking inversion here.

Since {{ao2_link}} and {{ao2_unlink}} both lock the underlying container, and the locking in {{pthread_timer_open}} is established as {{pthread_timers}} followed by the {{timing_thread}} lock, the code should first lock the {{pthreads_container}}, then invoke {{ao2_unlink}} using the {{ao2_unlink_flags}} variant with the {{OBJ_NOLOCK}} flag.
By: Private Name (falves11) 2015-02-14 13:44:26.891-0600

I am eager to test the patch, if this it is approved.
By: Private Name (falves11) 2015-02-15 07:39:50.629-0600

It works indeed. Many thanks.
By: Matthias Urlichs (smurfix) 2015-02-15 08:51:34.589-0600

You're welcome. Finding resource leaks is no fun.
By: Private Name (falves11) 2015-02-21 19:57:14.315-0600

The problem is back. I am using SIP, not PJSIP. With 220 open calls, playing tons of voice prompts, it is a free roulette over the phone, you may test it 1 573 764 2870 (free free free)
lsof | grep asterisk | grep FIFO | wc -l
646110
after like 1/2 hour.

Note: SVN-branch-11-r432098M has the same issue, so this is a regression issue.
I think this may be a new leak, unknown so far. I went back to rev SVN-branch-13-r431807M and with 120 calls it shows
lsof | grep asterisk| grep FIFO | wc -l
190950
The only thing I do is playing prompts and constantly reading variables
Note:
This is turning simpler. In a completely empty, not even a single line, dialplan, with

SIP/demo-0000037d (None) Up Echo()
SIP/demo-0000037e (None) Up Echo()
SIP/demo-0000037f (None) Up Echo()
501 active channels
0 active calls
0 calls processed
lsof | grep asterisk| grep FIFO | wc -l
1028105

This is SVN-branch-13-r432154M
Is this normal?
By the way, there are no codecs loaded that do not come with Asterisk.
All the channels are originated using a call file and the app Echo.
I did some experimentation, and it shows that if If I open 500 calls, and in the receiving end use the app Echo(), same in the sending box, the count is 44 on both ends. Two boxes, one originates 500 calls and drops them into Echo(), the other one receives them and also Echo(). No FIFO problem.
However, if in the receiving end I play music on hold, on the 500 calls, the FIFO count goes to 500K. This proves that the issue is RTP, not timers. Also I tried loading alternatively both timing modules, pthread and timerf, and it did not matter.
Also I tried changing the codec, from ulaw to g729, and the FIFO count, ceteris paribus, went down to 1/4.
When the call count gets to 500, I get this error several times
ERROR[14107]: res_rtp_asterisk.c:3031 ast_rtcp_write_report: RTCP RR transmission error to X.X.X.X:10585, rtcp halted Operation not permitted

Can somebody give instructions as to how to debug this inside gdb?

By: Matthias Urlichs (smurfix) 2015-02-22 09:22:51.399-0600

To debug this with gdb, you basically attach it to Asterisk, set a breakpoint on the pipe() and pipe2() syscalls with "catch syscall pipe", use "where" to get a stack trace. You now see where the pipe gets allocated. Find the code that's responsible for cleaning up that allocation. Also, set a break on __ao2_ref to find any code which increments or decrements the reference count of the object where the pipe's file descriptor numbers are stored. You'll probably find an increment that's not balanced by a decrement. Fix it.

As an example, examine the previous fix. The error was that the allocator also enqueued the object, which increases the refcount; however the function which indicates that the system is done with it didn't dequeue it. Instant leak.

If that doesn't help, get the perl script at http://search.cpan.org/~infinoid/App-SVN-Bisect-1.0/bin/svn-bisect, find an older Asterisk version that doesn't have the bug, and use that svn-bisect script to find the the patch which introduced the problem. Most likely that'll point to a solution.
By: Private Name (falves11) 2015-02-22 10:37:13.190-0600

Could I possibly hire you via Paypak to log into my system and find the problem? It is very simple, there is no dialplan
just type, in asterisk "!testcall.sh" and one call will be placed. If you want the 500, type "500calls.sh".
I just think that maybe you can figure this out in 5 minutes and save the community a year of trouble. I am not a developer.

I have been doing some research, and all Asterisk versions have the same bug, from 1.8 to Trunk. This is not a regression, but a structural bug. I just happened to stumble upon it. I cannot believe that nobody did a simple test like the one I am doing, send 500 calls from one box to the next box, and play music on hold there and echo() here. I tried going back in versions for Asterisk 13 and it is the same. For instance now, in Trunk, the FIFO count for 500 calls connected is 1026048.
By: Matthias Urlichs (smurfix) 2015-02-22 12:08:43.430-0600

Please email the testcall.sh script and any required dialplans (and/or sip.conf entries) to me, I'll take a look.

NB, this looks like it's a completely different problem; the echo channel simply isn't released when the call is closed. You're not really leaking half a million FIFOs here – just two per channel.

matthias@urlichs.de
By: Private Name (falves11) 2015-02-22 12:25:12.736-0600

I just sent you the information. It is not the echo() app. The issue is identical when you play Music on Hold in both ends. It is related to RTP, since when I use echo() in both boxes, there is no issue. Again, I am not a developer, I just make a living out of Asterisk.
By: Private Name (falves11) 2015-02-24 22:38:00.544-0600

Version 432199 still has the handle leak.
By: Private Name (falves11) 2016-08-17 18:01:48.239-0500

I wonder if Asterisk certified/13.1-cert7 has this fix.
Is there a way to map git or svn version numbers to Certified Asterisk versions?
By the way, Asterisk 11 has this problem still big time.
with 150 calls open
lsof | grep asterisk | wc -l
245852
it crashes every few hours, and I bet the issue is this one.
By: Private Name (falves11) 2016-08-17 19:35:17.102-0500

Asterisk 13.11.0-rc1
asterisk -rx "core show channels count"
224 active channels
112 active calls
1099 calls processed
asterisk -rx "core show uptime"
System uptime: 11 minutes
Last reload: 11 minutes

lsof | grep asterisk | grep FIFO | wc -l
183992

I don't think this is fixed.

By: Joshua C. Colp (jcolp) 2016-08-17 19:50:11.964-0500

The specific issue mentioned here was fixed. Yours is different.
By: Matthias Urlichs (smurfix) 2016-08-18 00:26:10.617-0500

Note that

>> lsof | grep asterisk | wc -l

is not a legitimate way to count open file descriptors, since Asterisk is multithreaded and "lsof" emits every thread.
You need "lsof -p PID_OF_ASTERISK"-
By: Asterisk Team (asteriskteam) 2016-08-18 00:26:10.848-0500

This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable.
By: Private Name (falves11) 2016-08-18 00:47:06.206-0500

lsof -p $(pidof asterisk) | grep asterisk | wc -l
534
[root@kitcom asterisk]# lsof | grep asterisk | wc -l
34249
There is a world of difference. If you care to explain again and in more depth why this huge difference does exist.

By: Matthias Urlichs (smurfix) 2016-08-18 02:35:00.689-0500

>> lsof -p $(pidof asterisk) | grep asterisk | wc -l
The "grep" part is superfluous. Also, you want "pidof -s" in case Asterisk is busy forking an external process. In this case, you'd use "cat /run/asterisk/asterisk.pid" instead, because that's where you *know* the process ID is, instead of searching the whole system for it.

>> lsof | grep asterisk | wc -l
Do you know what a thread is? Asterisk has a lot of them. They share file descriptors (not necessarily, which is why "lsof" examines them in the first place, but in Asterisk's case they do). You want to count these once, not 64 times.

In any case, "lsof" does way too much work if all you need to know is the number of open file descriptors. Do
>> ls -f1 /proc/$(cat /run/asterisk/asterisk.pid)/fd | wc -l
for that.

By: Private Name (falves11) 2016-08-18 06:44:20.705-0500

Please close the case.
My count is low. I am humbled.