|Summary:||ASTERISK-20462: [patch] Trunk not hungup if SLA Station hangs up before answer|
|Date Opened:||2012-09-23 11:11:36||Date Closed:||2013-01-22 08:55:33.000-0600|
|Environment:||Astlinux||Attachments:||( 0) asterisk-11-bugid20440+20462.patch|
( 1) asterisk-11-bugid20462.patch
( 2) extensions.conf
( 3) Hangup_before_answer.txt
( 4) sip.conf
( 5) sla.conf
|Description:||If an SLA station hangs up before the called party answers, then the channel to the sla station is terminated, but the channel to the called party remains open and continues to ring until timeout. Or if the party answers then they get unobtainable tone and asterisk fails all over trying to connect to a meetme conference that does not exist.|
See attached log file.
|Comments:||By: dkerr (dkerr) 2012-09-23 11:23:00.255-0500|
In this example extension 111 (an SLA station) dials extension 101 (not in an SLA group, but that doesn't seem to matter). Extension 101 rings, but because of ASTERISK-20440 the caller at 111 does not hear any ringback tone. Caller at 111 decides to hangup, but extension 101 continues to ring until timeout or it answers. In this case extension 101 answers but asterisk is unable to find the meetme conference to connect it to, because it is gone.
Asterisk should have terminated the outgoing channel when the SLAstation hungup.
Also attached are the relevant conf files.
By: dkerr (dkerr) 2012-11-09 07:44:27.920-0600
What is the status of this please? No action in 6 weeks.
By: dkerr (dkerr) 2012-11-10 21:23:00.311-0600
Attached patch file fixes this issue inside app_meetme.c. The problem was that after triggering an asynchronous dial for a Trunk, the code loops waiting for that trunk/channel to do something (answer, hangup, whatever). But the code does not watch for status changes at the originating channel/device. So if it hangs up the SLA code never knows.
I added code into this loop to test for the originating channel/device going to NOT_INUSE status and if detected break out of the loop with NULL trunk channel. Existing code acts on that to shut down the trunk, et al.
See also bug ASTERISK-20440 for another critical fix to SLA, and ASTERISK-20675 for a desirable added feature to SLATrunk().
By: Michael L. Young (elguero) 2012-11-10 22:01:38.210-0600
Just a couple quick comments in regards to the patch.
Do not refer to the issue this is fixing in the comment. Stating what that check is doing should suffice.
You are missing a space on the debug line between the string and the parameter.
By: dkerr (dkerr) 2012-11-11 11:02:41.832-0600
Fixed the comment, added the space. Made the patch against the trunk level of app_meetme.c
By: dkerr (dkerr) 2012-11-11 11:05:48.866-0600
While debugging this I notice that the section of code I am changing is inside a tight loop... for ( ; ; ) and it goes round and round thousands of times. Something of a CPU hog.
Would it not be better to sleep for 100ms before checking again? Checking 10 times a second would seem to be rapid enough.
To check this theory out I put...
at the end of the loop and it seems to work just fine. However I did not include this in any of my patches because maybe there is a good reason that the original developer had such a tight loop.
Just a thought.
By: dkerr (dkerr) 2012-11-15 19:28:25.671-0600
After much testing I decided that sleeping for 1/10th second is a smart thing to do rather than have a tight loop. So I have included it in this version of the patch. I placed the sleep before checking for SLA Station hangup so that we don't go back through the loop if the originating channel is no longer active.
By: Matt Jordan (mjordan) 2012-11-19 09:12:59.966-0600
The code change does feel a bit concerning. The SLA code was developed without the use of ao2 ref counted objects, so accessing the SLA station without any threading primitives runs the risk of derefencing a pointer to memory that has been freed. This could occur in that loop if a reload occurred while a trunk was being dialed. Its hard to verify whether or not this is possible - the reload code explicitly checks the count of the SLA stations/trunk ringing lists to make sure that it only performs a reload when things aren't in use - but you may want to double check that the ringing trunks list is not 0 when the trunk is being dialed. I'd expect that the station ringing list could go to 0 if a station was ringing and hung up.
If any of the ringing lists can't go to 0 during that loop, then I think you're okay.
As for the sleep in the polling loop - that always feels like a bit of a hack. This could have been written differently by using callbacks from the Dial API and the Device State core - but that would probably require changes to the Dial API and a restructuring of the SLA code, which is way beyond the scope of anything that would occur in a release branch (much less for this problem).
How much does the CPU peg when its in that loop without the sleep statement?
By: dkerr (dkerr) 2012-11-19 20:06:47.025-0600
Okay, so I'm a bit out of my depth here not being thoroughly familiar with the asterisk architecture. But I'm trying to think through the possibilities here. The dial_trunk() function is a thread kicked off by the SLAStation() function which is used on outbound calls only. SLAStation() waits on a semaphore until dial_trunk() signals it is done, so no risk from the thread that SLAStation executes on. In the case of outbound calls the station is not "ringing" so I don't think the number of stations ringing count comes into play. Also only one trunk can be rung... ast_dial() in this case only calls a single destination.
For reload to occur both the trunk and the station would have to have hung up. If that occurs then there is a possibility that args->station points to a invalid structure. args->station itself would not have been NULLed out though because SLAstation() is blocked on the semaphore and no other thread should know about it, so a test for a NULL args->station won't help.
Now here is where the 100ms second sleep creates a problem. If both the trunk and the station hangup simultaneously, while dial_trunk() is sleeping, then there is a window during which a delayed reload could occur, such that the test for originating station not-in-use could end up pointing to a args->station->name that is now bogus. But I think that this should be handled okay by ast_device_state() which should return a AST_DEVICE_INVALID or AST_DEVICE_UNKNOWN. Maybe we need to test for those return codes too? Problem may be alleviated by placing the 100ms sleep after the test for in use rather than before for then ast_dial_answered() will be called first and should return a hangup, which breaks out of the loop.
I couldn't agree more that this should have been written using callbacks. The inbound SLATrunk() function that generates rings to multiple extension "stations" was written with callbacks. For some reason SLAStation() was not -- and I believe that this is the root cause for ringback and station hangup not being handled in the original design... it was simply overlooked.
Tight polling loops are a hack, I agree. I added the 100ms sleep because I attempted to add debug statements inside the loop -- bad idea, thousands were generated to the console causing asterisk to quickly crash. I'm guessing because a buffer somewhere got overrun. Limiting the loop to ten a second controlled this. I have no way of measuring CPU use, but in my environment it would not be a problem as I'm only running a handful of extensions and rarely more than two simultaneous calls. How it would affect a larger installation, which probably use multi-core CPUs, I have no idea.
SLA does not seem to be a widely used feature of Asterisk (else this problem would have been logged long ago) and my objective is to merely fix what is broken within the bounds of current design. It is a much larger project to redesign this "properly" maybe a task to be undertaken when SLA is ported from meetme to the new conference service.