[Home]

Summary:ASTERISK-01962: *8 call pickup from sip to sip sometimes callee keeps ringing
Reporter:lvazquez (lvazquez)Labels:
Date Opened:2004-07-07 09:26:52Date Closed:2008-01-15 15:02:08.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) chan_sip.c-cancel_ringing.patch
( 1) channel.c-sip_pickup.patch
( 2) pickup-fail-full.txt
( 3) pickup-ok-full.txt
Description:In a call from SIP/0190 to SIP/0195 and picking up with *8 from SIP/0192 I was able to pickup the call from 0190 but sometimes SIP/0195 keeps ringing forever (asterisk did not sent a CANCEL). Phones configured with extensions 190, 195 and  191 resp. and all with callgroup=1 pickupgroup=1


****** ADDITIONAL INFORMATION ******

This problem has been there for many time, despite many people out there said they didn't say this (I suppose because this only happens randomly and depends on hardware timing)
I was sure it was a race condicion problem between at least two program threads and related to the (quoted from code) "pure nastiness" in the channel masquerade stuff process.
It was hard to get to the point because of the lack of documentation and the complexity on the code but after many many ast_debug outputs and many testing I found the problem
and finally I was able to fix it in a couple of ways (they both solve the problem but not sure of collateral effects).

Description of the problem:

In chan_sip.c when SIP/0192 dials *8 it gets into handle_request() and gets to ast_pickup_call(c) (line 6950 approx).
-> ast_pickup_call(c) in res_parking.c find the channel to pickup (SIP/0195) and starts the masquerading process calling
channel.c:ast_channel_masquerade(original,clone)
this function connect the two channels with the masq field and finish queueing an AST_FRAME_NULL frame in original and clone (the rest of masquerade process is handled in channel.c:ast_read() later)
At this point we have two branches and the luck of the call pickup depens on which one occurs first.

First posibility is: execution returns to handle_request() where we change the masqueraded channel to DOWN and then we hangup the channel with ast_hangup(responsible for sending the CANCEL)
chan_sip.c:6962
} else {
 ast_mutex_unlock(&p->lock);
## State is set down so CANCEL will be sent
 ast_setstate(c, AST_STATE_DOWN);
 ast_log(LOG_NOTICE, "Calling Hangup in
                      channel:%s\n",c->name);
 ast_hangup(c);
 ast_mutex_lock(&p->lock);
}

But here occurs the race condition because at the same time in a different thread the channel scheduler is resposible for calling channel.c:ast_read() in the masqueraded channel
orig: { name = 0195@asterisk; masq => {0192@asterisk} }
channel.c: ast_read(ast_channel *chan)
This function checks for chan->masq an if not null
calls ast_do_masquerade(chan)
channel.c:ast_do_masquerade(ast_channel *original)
This functions do the masquerade stuff (not going into details) and then calls sip_fixup and sip_hangup on the
clone channel. Then the channel is made zombie
The problem is, if we get here before reaching the hangup
in the chan_sip.c:handle_request() thread; the state in the channel will be UP so the CANCEL wont be sent in this hangup call:
chan_sip.c:sip_hangup(ast)
if (!ast || (ast->_state != AST_STATE_UP))
 needcancel = 1;
an wont neither be sent later when hangup will be called finally on handle_request() (because the canal will be a crapped out zombie) so in this case the CANCEL is never sent.

I will upload two files corresponding to CLI outputs with full debugging info (many DEBUG, WARNINGS an NOTICES introduced by myself to follow the code call flow).
The first one (pickup-ok-full.txt) is a call picked out OK (callee stops ringing) and the second (pickup-fail-full.txt) is the fault case (callee keeps ringing)
Look the commented lines where the execution order change and specialy to the different state when sip_hangup is called.

The fist way I found to fix this was to introduce an
usleep(100) call in channel.c:ast_do_masquerade
just before:
if (clone->pvt->fixup){
  res = clone->pvt->fixup(original, clone);
This ensures (at least in all my testings) the block in
chan_sip.c:handle_request is called first.

Because this seems a dirty one I looked for a different solution and I found the problem is fixed to if we swap
the state of original and clone channels before calling
hangup on the clone channel. This ensures state is not UP
(is RINGING) when sip_hangup is called on clone an so the CANCEL is sent to the previous calleed channel.
I suppose it doesn't introduce other problems because in
the current version the state of the original channel is changed a few lines ahead in the code:
** channel.c:ast_do_masquerade:L2320
original->_state = clone->_state;

Here is the patch for channel.c with the previous changes:

Comments:By: Mark Spencer (markster) 2004-07-07 09:57:18

First of all, let me say *thank you* for placing such a carefully researched and detailed bug report.

As for the correct way to solve it, it might be easiest just to make the cancel test depend exclusively upon the private structure.  If you have time to whip up a patch feel free to, otherwise i'll try to take a break while i'm here in bordeaux and do something with it.

By: lvazquez (lvazquez) 2004-07-07 12:14:52

Hello Mark
The patch I uploaded (swapping the states earlier in ast_do_masquerade) seems to fix the problem. I'am allmost sure this won't affect any other channel because
the state was changed a few lines later anyway.
But I agree with you the best way to fix it would be only to change code in chan_sip.c, an changing the cancel test in sip_hangup would be a good way.
This was my first try but I didn't know how, and comenting out the line
gives as result the sip channels didn't get the BYE when disconnecting an established call (take a look at  http://www.ipcontact.com.uy/callflow/corte_sip_bad/
and
http://www.ipcontact.com.uy/callflow/corte_sip/ )

Could you help me about wich field(s) in sip_pvt could be used for the test?
I just tried using p->ringing:
// if (!ast || (ast->_state != AST_STATE_UP))
if (!ast || p->ringing )
needcancel = 1;
But this didn't work.

Keep in contact
Luis

By: lvazquez (lvazquez) 2004-07-07 12:19:37

Sorry, an error in the url's I just sent.
Those are internal only, the right ones are:
http://ipcontact.teledata.com.uy/callflow/corte_sip/
http://ipcontact.teledata.com.uy/callflow/corte_sip_bad/

By: lvazquez (lvazquez) 2004-07-07 14:14:16

Well, I think I've got it.
After lookin in the sip_pvt structure I found the ringing field is used only for incomming channels (to indicate a  180 ringing has been sent), so It can be used in an outgoing sip channel to register if an invite has been sent so we have a cancel pending in case of hangup.
So I initialice this field to 1 when we send an INVITE in sip_call and reset to 0 once we get the 200 OK from the peer on handle_response.
This way I can use
(p->outgoing && p->ringing)
instead of
(ast->_state != AST_STATE_UP)
in the cancel test inside sip_hangup.
Not sure if the value of sip_pvt.ringing should be changed in another places to be consisten with this use but from the few testings I have made It seems to work ok. At least picked up SIP calls are allways canceled.
Please take a look (wipe out my debug lines) and feel free to use it if you
are convinced it's ok.
Bye

By: Mark Spencer (markster) 2004-07-09 07:11:00

I agree with putting it before the hangup.  That is the correct and easiest solution.  I didn't use your patch since it had so much extra comments in it, but please confirm my solution works as soon as possible.  Thanks again for such a good bug report.

By: lvazquez (lvazquez) 2004-07-09 10:21:43

Ok, I am getting the last cvs right now. Give me a couple of days to syncronize with the few patches we are including in our solution (some changes in sip channel names and with database tables to integrate better with ser) and I will test it and give you a response.
Thanks to you for your attention and confidence

By: Olle Johansson (oej) 2004-07-16 14:36:28

Waiting for feedback. Are we ready to close the bug report or is this still a problem?

/Housekeeping

By: lvazquez (lvazquez) 2004-07-19 09:05:49

Ok, I tested the new cvs and this seems to be working perfectly.
Sorry by the delay but my granfather died last week and was not good time for
debugging.
Thanks for your feedback Mark
Time to close this one.
Bye
Luis

By: Olle Johansson (oej) 2004-07-19 09:14:45

Fixed in cvs.

By: Digium Subversion (svnbot) 2008-01-15 15:02:08.000-0600

Repository: asterisk
Revision: 3412

U   trunk/channel.c

------------------------------------------------------------------------
r3412 | markster | 2008-01-15 15:02:07 -0600 (Tue, 15 Jan 2008) | 2 lines

Swap states early in masquerade process (bug ASTERISK-1962)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=3412