Summary:ASTERISK-24451: chan_iax2: reference leak in sched_delay_remove
Reporter:Corey Farrell (coreyfarrell)Labels:
Date Opened:2014-10-26 18:43:10Date Closed:2015-02-18 20:00:15.000-0600
Versions:11.16.0 13.0.0-beta3 Frequency of
Environment:Attachments:( 0) refs.txt
Description:chan_iax2 leaks references to {{peercnt}} when Asterisk is shutdown within 60 seconds of calling {{sched_delay_remove}}.  This issue was found with the testsuite using ASTERISK-24379.  Many tests produce the leak.

I don't know chan_iax2 but am reporting this in the hopes that someone can address it so we can get the full testsuite leak free.
Comments:By: Corey Farrell (coreyfarrell) 2014-10-26 18:46:26.103-0500

Output from refcounter.py -n for {{tests/callparking}}.  The full refs log is 57mb, if it is needed we will need to find a way for me to post it (JIRA limit is 10MB).

By: Matt Jordan (mjordan) 2014-10-28 11:48:05.179-0500

So the weird thing is that reference bump is balanced. The variable in question is {{peercnt}}:

3033: if ((peercnt = ao2_find(peercnts, &tmp, OBJ_POINTER))) {
3034: /* refcount is incremented with ao2_find.  keep that ref for the scheduler */
3035: ast_debug(1, "schedule decrement of callno used for %s in %d seconds\n", ast_sockaddr_stringify_addr(addr), MIN_REUSE_TIME);
3036: i = iax2_sched_add(sched, MIN_REUSE_TIME * 1000, peercnt_remove_cb, peercnt);
3037: if (i == -1) {
3038: ao2_ref(peercnt, -1);
3039: }
3040: }

The variable leaking is {{peercnt}}. The function that should drop the reference is {{peercnt_remove_cb}}, which should be called so long as {{iax2_sched_add}} returns anything but {{-1}}, in which case we drop the reference ourselves. {{iax2_sched_add}} calls {{ast_sched_add}} - which returns {{-1}} on failure and the scheduled ID if the scheduler succeeded.

So the question is - why is the callback function {{peercnt_remove_cb}} not getting called? Does a {{gdb backtrace}} of the running Asterisk instance show any weird locking situations?

The other option would be to put some debug into {{chan_iax2}} and try to find out why that callback isn't being run, or if the scheduler isn't firing for some reason.

By: Corey Farrell (coreyfarrell) 2014-11-04 07:16:16.350-0600

#define MIN_REUSE_TIME 60 /* Don't reuse a call number within 60 seconds */

We could be seeing the result of shutdown happening less than 60 seconds after the call to sched_delay_remove.  This still poses an issue for the {{REF_DEBUG}} testsuite, causing 10 tests to fail.  I realize this probably isn't a "real" production leak, but it's existence reduces the effectiveness of brute force leak checking.

Maybe this could be solved by saving the result of {{iax2_sched_add}} to a field in {{struct peercnt}}, iterate peercnts at shutdown and decrement?

By: Y Ateya (yateya) 2015-01-17 12:52:59.192-0600

This might be the cause for ASTERISK-24478. The callback is not called after stressing IAX connection.

By: Corey Farrell (coreyfarrell) 2015-02-15 13:29:15.596-0600

I believe I have a fix for this issue.  I'm currently doing some tests.

[~yateya]: I don't think your issue is related.  This issue resolves itself after a wait of 60 seconds, but that is a problem for automated testing where Asterisk is quickly shutdown.

By: Corey Farrell (coreyfarrell) 2015-02-15 14:05:43.805-0600

Posted to reviewboard: https://reviewboard.asterisk.org/r/4425/