Summary:ASTERISK-04710: [patch] Make Realtime SIP Clusterable
Reporter:Anthony Minessale (anthm)Labels:
Date Opened:2005-07-27 14:50:25Date Closed:2008-01-15 15:52:27.000-0600
Versions:Frequency of
Environment:Attachments:( 0) sip_cluster_rev3.diff
( 1) sip_cluster_rev4.diff
( 2) sip_cluster_rev5_plus_bug_3923_rev1.diff
( 3) sip_cluster_rev5_plus_bug_3923.diff
( 4) sip_cluster_rev6.diff.txt
( 5) sip_cluster_rev7.diff.txt
( 6) utils_rev2.diff
( 7) utils_rev3.diff
( 8) utils.diff
Description:This patch is the finishing touch on the cluster work I was doing recently.

When realtime is in use, the fullcontact will be placed/taken from the realtime abstraction layer rather than astdb which limits the registration to 1 box.

With this patch applied and the activation of thes options in [general]

10 boxes may share 1 realtime db and once you register on any 1 box
inbound and outbound calls to the peer may be placed on the other 9 with no
further registration.

*NOTE* make sure you add the fullcontact column to your realtime sip table if you are using a database.
*NOTE* this already works on IAX and this patch is all that stands in the way of them behaving identically with the aforementioned options.


Disclaimer on file.
Comments:By: Olle Johansson (oej) 2005-07-28 15:46:14

Hey, anthm, the url_encode is taken from another patch and you've changed it from the original version, so that it's not compatible with my original patch.

Please use the one I have in utils.c in report 3923.

By: Lorenz Barth (bartpbx) 2005-07-29 11:28:32

after applying this patch to todays HEAD i recive:

Jul 29 18:10:30 WARNING[16238] chan_sip.c: sip_xmit of 0x81a7218 (len 770) to xx.xxx.xx.xxx returned -1: Invalid argument
Jul 29 18:10:32 WARNING[16224] chan_sip.c: sip_xmit of 0x81a7218 (len 770) to xx.xxx.xx.xxx returned -1: Invalid argument
Jul 29 18:10:34 WARNING[16224] chan_sip.c: sip_xmit of 0x81a7218 (len 770) to xx.xxx.xx.xxx returned -1: Invalid argument

On a call from asterisk to a nated sip client. The call does not go throug. If i revert this patch the same call goes through without any warings.

By: Anthony Minessale (anthm) 2005-07-29 11:35:46

maybe we should devise a patch for just url_encode and decode the decode was already in chan_sip, i made my own encode so we could move them into utils as 1 patch then use them in the other patches

By: Anthony Minessale (anthm) 2005-07-29 13:55:44

utils.diff patch moves the url_* to utils.c
sip_cluster_rev1.diff uses it this new way.

By: Olle Johansson (oej) 2005-07-29 14:05:17

Why create a new patch when I already have it in ASTERISK-3833?

By: Anthony Minessale (anthm) 2005-07-29 15:29:37

I dont know,
I didnt see it in your patch I missed the reference to the bug and the files were .txt and i didnt realize it was a patch i guess.

whatever way as long as it does the same thing they should be able to sort it out at commit

By: Anthony Minessale (anthm) 2005-07-29 16:26:27

updated to deal with nat.

By: nick (nick) 2005-07-30 16:58:40

bartpbx: have you tested anthm's new patch? I'd really like to get this into 1.2...


By: Lorenz Barth (bartpbx) 2005-07-31 01:37:04

Hi nick,

I'm still testing. There seams to be a small issue left. I'am just debugging it.I'll contact anthm with details or try to create my first patch

By: Lorenz Barth (bartpbx) 2005-07-31 03:05:39

I solved an Issue. But I have problems getting this utils patch to work :-) I'll update rev4 soon

By: Lorenz Barth (bartpbx) 2005-07-31 03:51:52


everything works fine now. I've made the following changes:

 * declare both url_ mehtods as nonstatic
 * move methods out of the #ifndef HAVE_STRTOQ
 * parse the fullcontact into a "contact" initreqprep. As I'am not a good C programmer, this needs some review. Especially to work without realtime i think.

With the patch fullcontact could be:
but the sip contact is only sip:30000@ with the updated patch initreqprep parses the fullcontact.

I think this could be improved somehow.

BTW: I'll send a disclamer in an hour or so.

By: Lorenz Barth (bartpbx) 2005-07-31 08:41:43

forgot the changes to utils.h in rev2. Added rev3

By: Anthony Minessale (anthm) 2005-07-31 09:03:31

This should be coorednated with OEJ his patch also has encode/decode routines in it as well so pay mind to that.

His patch is not added yet, so perhaps they should be combined.

By: Anthony Minessale (anthm) 2005-07-31 09:12:37

you need to post diff in diff -u format

By: Anthony Minessale (anthm) 2005-07-31 09:57:47

Try sip_cluster_rev5_plus_bug_3923.diff against clean CVS of utils.? and chan_sip.c

It also contains bug 3923 for simplicity sake with minor mods:

1) I renamed url_encode/decode to have the ast_ prefix.

2) I used the original ast_url_encode that I already had because it seems to be more efficient. Also I was confused by the doreserved option because it was 0 in every usage which appears to do nothing but copy the string unchanged.  Assuming that was not the desired effect this implementation seems to encode it fine.  If there is something wrong with this, let me know.

2) I fixed a missing arg in an ast_log line.

By: Anthony Minessale (anthm) 2005-07-31 11:44:02

Took out some of bartpbx's code to avoid segfault try again.

By: Lorenz Barth (bartpbx) 2005-08-03 15:36:11

The current patch works fine.

By: twisted (twisted) 2005-08-03 17:03:07

This patch works fine.

The only problem(s) I ran into were NAT related, and I don't think this patch will solve that issue.  (sip device behind nat registers to one box, call comes in for device into another box, but second box cannot reach it because the state in the NAT router is only open for the box it registered to).

By: twisted (twisted) 2005-08-03 17:43:35

also, make sure that you have the fullcontact field in the db long enough to handle it.  I was thinking that a varchar(25) would do it, when in fact, it needed to be MUCH longer.  

Issues with NAT are still present, but not on the initial box the client registered to.  

Better Scenario:

Client (A) registers to Server (A)

Client (A) makes a call through one of the servers in the farm, using realtime, works great.

Server (B) recieves a call for Client (A).  

Even though Server B knows where to find Client A, It cannot reach it, as the NAT router has a path stored that only lets Server A talk to Client A.  

Outside of nat, this works perfectly.  

I'm not sure what we could do to get around that, if anything.  Perhaps the best route to take would be to look at how SER handles NAT.

Ser can determine if the client is NATted on it's own.  If we could determine that somehow, and use that instead of a hard nat setting, we could also find where the initial registration took place, and simply route calls to that NATted device back out through server A.  I know that kinda defeats the purpose of load balancing, but it would at least let the call through.

By: Chris A. Icide (cicide) 2005-08-04 17:09:02

Why not assume that in the directive nat is set to yes in sip.conf, that the end device is behind a nat.  In that case populate the HANGUPCAUSE or whatever variable you feel needs to populate with a value that indicates the phone is on a remote server and nat'd.  At this point you can handle that issue inside the dial plan by forking based on the failure variable to a set of logic that uses sip or iax to move the call to the server which the device is nat'd to.  (You'd need access to that information as well)

By: nick (nick) 2005-08-04 17:32:19

Or just use a NATing load balancer (or router/fw for that matter, but you've probably got a LB anyway if you're going to be clustering...)

By: hristo (hristo) 2005-08-05 02:22:13

The box that has registered the NATed endpoint (lets call it endpoint "A") can put against the endpoint registration in the database it's own IP or some other info that will identify it (the box) within the farm. Now if a call for the NATed endpoint "A" hits another box in the farm it will know (via the database entry) to switch/reinvite the call to the box that registered endpoint "A" and the NAT should be handled without a problem.

This approach will most certainly require database modification and is probably not that scalable if you have many NATed endpoints but will provide general mechanism to share the additional information needed for routing among the nodes in the cluster.

off the topic: maybe the same approach can be used for making services that need access to the RTP stream cluster aware. Like for example call waiting where you need to insert a beep in the RTP (if i haven't got it totally wrong) and this can only be done by the box that is servicing the connected call.

By: Kevin P. Fleming (kpfleming) 2005-08-24 22:50:25

Can someone explain why this addition requires storing the fullcontact in a separate field in memory from the existing one, thereby increasing the size of every SIP peer and private structure by 128 bytes?

Also please take into account that there have been some other patches merged that will make this one no longer apply, specifically the one that changed 'rtnoupdate' into 'rtupdate' (if not others).

I'm not thrilled about the contents of bug 3923 being merged into this one with changes... I'd rather get that one merged so the other two that are waiting on it can also be handled while this one gets brought up to date.

By: Anthony Minessale (anthm) 2005-08-25 10:24:33

Do you you think it not applying has anything to do with the fact that I filed it on july 27 and the most recent update was july 31st yet you are finally looking at it on August 25th?  Just par for the course of how bugs get treated and it's another step closer to my not submitting any more patches to this system.  As for The extra field, let me remeber since it's been a month, oh yes it's because the fullcontact stored in the database is added to that field to prevent looking it up in the database again the next 10 times it's needed yet it's not always apropriate to have it in the real fullcontact field and 128 bytes is better than introducing side-effects into the giant mess of chan_sip. I feel like I waste my time even explaining any furtur let's not forget this issue was in another bug and I was told to file it seperatly only to watch it mold here for an entire month and that is, frankly, unacceptable.

By: Anthony Minessale (anthm) 2005-08-29 11:00:08

begrudgingly updated this patch to current CVS head.

By: Lorenz Barth (bartpbx) 2005-09-14 15:05:28

Any news on this? Is there anything we could do to get this into cvs?

By: Kevin P. Fleming (kpfleming) 2005-09-14 20:42:45

Well, we can start by getting the design question I asked nearly a month ago answered, so I can understand why we need to penalize users systems with 128 bytes of increased private structure for this functionality. There is also the issue that the most recent version of the patch still included someone's else code that had already been separately submitted (and disclaimed) and which has subsequently been merged.

Regardless of anyone's opinion, it has never been the case that we've merged patches quickly just to keep them from no longer applying; the complaints about this one not applying because other patches had been merged are not very realistic, given that those patches had been posted to Mantis before this one and had every right to be merged earlier. If doing so causes any other outstanding patches to no longer apply that is something that we all have to live with.

Of course, since that time many more patches have been merged, some of which were also older than this one, but most of which were newer (but they were ready to merge immediately and had no outstanding issues to be resolved).

If anthm wants to describe _WHY_ this patch is doing things the way that it is but doesn't have time to update it and test it then I'll happily take ownership of it and ensure that the functionality gets into CVS HEAD ASAP (with due credit, of course).

By: Anthony Minessale (anthm) 2005-09-15 09:09:38

let me quote my own bugnote from 8-25 ,the very next day after the question was asked with full sarcasm in tact.

"As for The extra field, let me remeber since it's been a month, oh yes it's because the fullcontact stored in the database is added to that field to prevent looking it up in the database again the next 10 times it's needed yet it's not always apropriate to have it in the real fullcontact field and 128 bytes is better than introducing side-effects into the giant mess of chan_sip"

But if I must elaborate furthur....

let me begin by reitterating that chan_sip is a giant mess.  It is a miracle I even got this patch working and I only did because I have been studying the code for nearly 3 years.  Next the fullcontact thing being stored in a dbm file can, at best, be described has a huge hack.  Now when approaching the problem and the examining the possible solutions I am faced with the task of re-engineering chan_sip to not be semi-assembly code and make it all make perfect sense so it is easy to add my patch, (or maybe if I do that I don't need the patch cos the by-product of my work may have solved the problem already). In any case, It became clear that the fullcontact thing must be ported to realtime to make this work.  It's bad enough that the fullcontact is constantly read from the db file but once I move it to realtime now you are asking for it to be read from the database all the time.  with the archaic nature of chan_sip the reason I added this field was so I could avoid looking up the data when chan_sip usually calls for a lookup in the dbm file by adding it to the new col as soon as a do a find_peer which is usually sooner than when it should have the fullcontact populated so when the fullcontact is needed it is obtained from that field in memory rather than from the db every time.  I chose to do this is way that would maintin the same flow and avoid introducing behavioural changes.  I do not have the time to redesign chan_sip but I do need this functionality and I believe that it was the best way to accomplish the goal without opening a giant can of worms.  Perhaps the 128 bytes could be replaced with a strdup/free then it's only 4 bytes but then i'd get heat for using malloc too much *shrug* Maybe every pvt should not need to go and manually duplicate the guts of every user/peer as it is created perhaps you could make a pointer to the peer in the pvt and ref count the peer to keep it from being freed this pales in comarison to 128 bytes yet this behaviour is worsened a few times a month............  I still hold the same point of view, I am getting sick of dealing with this, you trust us real-world developers to bring issues to light then we have to work or ass off to make you happy. Frankly if we even *describe* the issue that should be enough but most of us also hand out the code implementing the solution if it sux, ok, but at least it demonsrates the issue. Most of us have real jobs to attend to besides working on asterisk day and night.  I dont care if this patch was added sooner as much as I care how much it was ignored as an issue.  I was working on the transfer stuff too on sip in bug 3910 which was also completely ignored then you guys decided it was a good day to work on it from a different angle and just erased my patch and went about it your own way and had no collaberation with me at all...

By: Chris A. Icide (cicide) 2005-09-18 00:22:13

I've been using this patch for a couple weeks now and it has worked perfectly.  Environment is 5 servers all reading off of one set of dynamic tables for sip peers and sip users.  No clients are NAT so I haven't tested that functionality, but for non-nat peer/users it's worked perfectly

By: twisted (twisted) 2005-09-18 01:44:51

c'mon guys. memory is cheap.

By: Michael Jerris (mikej) 2005-09-26 22:30:28

Hand patched and updated against chan_sip.c 1.866 just in time for a couple more commits :(.  Can somone please test this patch (against cvshead from about an hour or so ago), it has significant changes as large portions of what were in the previous patch were already in tree from previous comits.  anthm-  Can you please make sure I didn't overly butcher this in the hand-patching process.

By: Lorenz Barth (bartpbx) 2005-10-05 16:11:59

I've just updated this patch against chan_sip.c 1.882 and updated one of our test servers in the cluster. everything seams to be working fine.

How can we continue on this?

By: Donny Kavanagh (donnyk) 2005-10-19 17:06:47

Is this going anywhere? any chance of this makeing it into 1.2.

By: Lorenz Barth (bartpbx) 2005-10-23 02:45:12

I would realy like to see it in 1.2. And i dont know or see a Problem with it. Maybe someon can show me / us the Problem.

By: siacali (siacali) 2005-10-25 15:47:49

This looks great!  What's preventing this from going forward?

By: Brian West (bkw918) 2005-10-26 15:06:56

This patch is needed by many.  These patches are a prime example of what caused the fork of Asterisk to happen.  


By: Serge Vecher (serge-v) 2005-10-26 15:33:40

I don't know about the inner workings of chan_sip, but is it possible to modify the code so that the extra 128 bytes are allocated in the private structure if there is a respective setting in sip.conf (clustering=yes, default to "no," which is current behaviour)?

By: Mark Spencer (markster) 2005-10-28 06:38:58

It has been waiting on someone to either fix or explain why the extra 128 bytes is necessary.  The comments from 9/15/2005, while extensive, provide no actual explanation.  Furthermore there is no explanation of the addition of the "defaddr" field, nor is there any explanation for why the "fullcontact" is handled separately.  In the absense of more input, I'll just try to make the changes myself, get the feature committed (pre 1.2) and see if anyone has any trouble with it.

By: Mark Spencer (markster) 2005-10-28 06:43:08

(and again I maintain it is easier to maintain code quality at patch time than to have to go back and try to fix it after the fact)

By: Mark Spencer (markster) 2005-10-28 09:45:21

Okay, I have committed a highly revised version of this patch with the following changes:

1) Removed inexplicable defaddr field in private structure
2) Merges fullcontact as part of the normal database call, eliminating the double database ineffiency
3) Simplified duplicated code
4) Removed unnecessary URI encode/decode
5) Removed duplicated fullcontact field
6) Changed "rtignoreregexpire" to "ignoreregexpire" since this option should have nothing to do with whether a peer is realtime or not.  Eventually this should be more than a global option.
7) Updated documentation appropriately

Please feel free to reopen this bug if my fix still needs some work (as is quite likely the case).

By: Olle Johansson (oej) 2005-10-30 07:54:38.000-0600

Reopen on request from bartpbx in IRC.

By: Lorenz Barth (bartpbx) 2005-10-30 14:06:30.000-0600


I've just updated two of our test server to see how the new code in HEAD works. And I have a few Problems with it.

I have a sip client registred on one Server. The fullcontact is in the DB. If I now do a "sip show peer <mytestpeer> load" on the second server, the peer is loaded but without the fullcontact (  Addr->IP , Defaddr->IP). Also I recive a few segfaults But I think these are not related to this problem. I'll open a an other Bug fore those.

By: Mark Spencer (markster) 2005-10-31 08:28:26.000-0600

Please find me on IRC so I can diagnose this.  Thanks!

By: Mark Spencer (markster) 2005-11-02 21:45:00.000-0600

If you are still having trouble with this after updating to latest CVS head, let me know in the next day or two, otherwise we'll close this back out.

By: Kevin P. Fleming (kpfleming) 2005-11-07 23:41:17.000-0600

Closed due to lack of response. If there is still an issue, please open a new bug with the requested details so it can be addressed.

By: Digium Subversion (svnbot) 2008-01-15 15:52:27.000-0600

Repository: asterisk
Revision: 6870

U   trunk/channels/chan_sip.c
U   trunk/configs/sip.conf.sample

r6870 | markster | 2008-01-15 15:52:26 -0600 (Tue, 15 Jan 2008) | 2 lines

Merge highly modified fullcontact in realtime sip patch (bug ASTERISK-4710, heavy mods)