[Home]

Summary:ASTERISK-15190: [patch] pedantic sip checking needed to generate valid messages (but broken)
Reporter:Walter Doekes (wdoekes)Labels:
Date Opened:2009-11-21 15:03:17.000-0600Date Closed:2010-11-23 13:32:41.000-0600
Priority:TrivialRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) ast16-ast_uri_encode.patch
( 1) astsvn-16299-ast_str_replace.diff
( 2) astsvn-16299-chansip-encode-from-fully.diff
( 3) astsvn-16299-displayname-no-uri-encode.diff
( 4) astsvn-16299-get_calleridname.diff
( 5) astsvn-16299-parse_uri-calls.diff
( 6) astsvn-16299-testh-compile.diff
( 7) get_calleridname_rewrite.diff
Description:In function 'initreqprep' in channels/chan_sip.c, the following code can be found:

if (sip_cfg.pedanticsipchecking) {
 ast_uri_encode(n, tmp_n, sizeof(tmp_n), 0);
 n = tmp_n;
 ast_uri_encode(l, tmp_l, sizeof(tmp_l), 0);
 l = tmp_l;
}
<...snip...>
snprintf(from, sizeof(from), "\"%s\" <sip:%s@%s>;tag=%s", n, l, d, p->tag);


The function ast_uri_encode encodes chars < 32 and > 127 -- perhaps one should replace that with ((signed char)*ptr < 32) ;-) -- as %HH hex escapes.

A couple of problems (all minor):
- ast_uri_encode forgets to escape % and 0x7F (RFC2396 2.4.2 and 2.4.3)
- ast_uri_encode does not escape <, >, @ and some other characters that 'l' would've liked to be escaped
- 'n' is not supposed to be hex-escaped (RFC4475 3.1.1.5 writes """The display name portion of the To and From header fields is "%Z%45". Note that this is not the same as %ZE.""")
- 'n' does however like the double-quote to be escaped, by a backslash
- ast_uri_decode is called on entire messages, not on already broken up parts


Browsing through chan_sip.c, I see pedanticsipchecking used in these cases:
- allow blanks between the header key and the colon
- allow multiline sip headers
- compare the from-tag/to-tag/branches as well instead of only the call-id
- check that a packet really is for us (handle_incoming)
- encode/decode reserved characters

In my humble opinion, I don't think creating valid output (correctly encoding illegal characters) should be enabled only by a flag that is reported as being 'slow'. And, not as relevant to me in this case, but decoding valid hex-escapes from peers does not sound like too much to ask, either.


What to do?
- I can easily write a patch that fixes my minor issue: always -- not dependent on the pedanticsipchecking -- run a s/"/\\"/g (instead of ast_uri_encode) on the name part in the From.
- I can also easily fix ast_uri_encode to escape %, 0x7f and the others as mentioned in RFC2396 2.4.3.
- Fixing all ast_uri_decode to operate first after the data has been broken up is a bit more tedious, so I can't promise I'll do that.


Regards,
Walter Doekes
OSSO B.V.
Comments:By: Leif Madsen (lmadsen) 2009-11-23 19:41:18.000-0600

Any patches that you can provide are greatly appreciated, even if the patches don't totally clear this issue up. Thanks for the reports!

By: Walter Doekes (wdoekes) 2009-11-30 14:32:46.000-0600

For starters, here's ast16-ast_uri_encode.patch, fixing ast_uri_encode to encode more reserved characters. It shouldn't break anything, as it is never illegal to escape too much.

It could however break communication between two asterisken that are using "special" characters of which one uses pedanticsipchecking and one does not. As now more characters that might not have been escaped previously are now escaped.

By: Digium Subversion (svnbot) 2009-12-07 17:24:59.000-0600

Repository: asterisk
Revision: 233609

U   branches/1.4/main/utils.c

------------------------------------------------------------------------
r233609 | dvossel | 2009-12-07 17:24:59 -0600 (Mon, 07 Dec 2009) | 8 lines

hex escape control and non 7-bit clean characters in uri_encode

In ast_uri_encode, non 7-bit clean characters were being hex escaped
correctly, but control characters were not.

(issue ASTERISK-15190)


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233609

By: Digium Subversion (svnbot) 2009-12-07 17:26:23.000-0600

Repository: asterisk
Revision: 233610

_U  trunk/

------------------------------------------------------------------------
r233610 | dvossel | 2009-12-07 17:26:22 -0600 (Mon, 07 Dec 2009) | 13 lines

Blocked revisions 233609 via svnmerge

........
 r233609 | dvossel | 2009-12-07 17:24:59 -0600 (Mon, 07 Dec 2009) | 8 lines
 
 hex escape control and non 7-bit clean characters in uri_encode
 
 In ast_uri_encode, non 7-bit clean characters were being hex escaped
 correctly, but control characters were not.
 
 (issue ASTERISK-15190)
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233610

By: Digium Subversion (svnbot) 2009-12-07 17:28:52.000-0600

Repository: asterisk
Revision: 233611

U   trunk/main/utils.c

------------------------------------------------------------------------
r233611 | dvossel | 2009-12-07 17:28:51 -0600 (Mon, 07 Dec 2009) | 4 lines

fixes incorrect logic in ast_uri_encode

issue ASTERISK-15190

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233611

By: Digium Subversion (svnbot) 2009-12-07 17:29:24.000-0600

Repository: asterisk
Revision: 233612

_U  branches/1.6.2/
U   branches/1.6.2/main/utils.c

------------------------------------------------------------------------
r233612 | dvossel | 2009-12-07 17:29:24 -0600 (Mon, 07 Dec 2009) | 11 lines

Merged revisions 233611 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r233611 | dvossel | 2009-12-07 17:28:51 -0600 (Mon, 07 Dec 2009) | 4 lines
 
 fixes incorrect logic in ast_uri_encode
 
 issue ASTERISK-15190
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233612

By: Digium Subversion (svnbot) 2009-12-07 17:29:49.000-0600

Repository: asterisk
Revision: 233613

_U  branches/1.6.1/
U   branches/1.6.1/main/utils.c

------------------------------------------------------------------------
r233613 | dvossel | 2009-12-07 17:29:48 -0600 (Mon, 07 Dec 2009) | 11 lines

Merged revisions 233611 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r233611 | dvossel | 2009-12-07 17:28:51 -0600 (Mon, 07 Dec 2009) | 4 lines
 
 fixes incorrect logic in ast_uri_encode
 
 issue ASTERISK-15190
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233613

By: Digium Subversion (svnbot) 2009-12-07 17:30:15.000-0600

Repository: asterisk
Revision: 233614

_U  branches/1.6.0/
U   branches/1.6.0/main/utils.c

------------------------------------------------------------------------
r233614 | dvossel | 2009-12-07 17:30:14 -0600 (Mon, 07 Dec 2009) | 11 lines

Merged revisions 233611 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r233611 | dvossel | 2009-12-07 17:28:51 -0600 (Mon, 07 Dec 2009) | 4 lines
 
 fixes incorrect logic in ast_uri_encode
 
 issue ASTERISK-15190
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=233614

By: David Vossel (dvossel) 2009-12-07 17:35:04.000-0600

Those patches just take care of making sure we properly escape < 32 and > 127.  There was some really odd logic going on there for trunk and 1.6.x that I wanted to get resolved real quick.

I'm reviewing your patch and should have some feedback shortly.  I'm having to read over RFC2396 for the first time.

By: David Vossel (dvossel) 2009-12-07 18:06:03.000-0600

In rfc2396 the reserved set is listed as ;/?:@&=+$,  Right now we have '#' and ' ' added to that as well and I don't entirely understand why.

From what I gather, your patch is treating the reserved character set as everything this is not alphanum or mark.  This includes quite a bit more characters that the ones listed in rfc2396 and our current code.

----from rfc2396------
uric          = reserved | unreserved | escaped
reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
               "$" | ","
unreserved    = alphanum | mark
mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
               "(" | ")"
-----------------------

By: nick_lewis (nick_lewis) 2009-12-08 11:09:41.000-0600

dvossel

I think the correct ABNF is defined in RFC3261 Section 25 (not RFC2396)

By: David Vossel (dvossel) 2009-12-08 12:05:44.000-0600

You are correct, we should reference rfc3261 rather than rfc2396.  The reserved, unreserved, and mark sets are the same though.

By: Walter Doekes (wdoekes) 2009-12-08 13:40:27.000-0600

Indeed. Just a quick note from that RFC (running short on time atm):

"""The set of characters actually reserved within any given URI
component is defined by that component.  In general, a character
is reserved if the semantics of the URI changes if the character
is replaced with its escaped US-ASCII encoding [5]."""

Which means that an '@' in a username must be replaced so it does not get treated as an early separator. What follows is that it's easiest to escape almost every character.

By: Walter Doekes (wdoekes) 2009-12-08 13:44:06.000-0600

(Oops, the '@' example was a bit bad as it was already in the list. Take '>' in a domain instead ;-))

By: nick_lewis (nick_lewis) 2009-12-09 05:29:13.000-0600

May I suggest that the basic char building blocks of the ABNF are built in to chan_sip.c as bitmapped defined constants so that comparison can be done quickly and efficiently to determine whether escaping is required. The outgoing char would be compared with the defined constants by shifting 0x01 by value of the char.

(If bitmaps have to be kept to 32 bits then individual bitmaps can be made for ascii ranges 32-63, 64-95, and 96-127. The lower 6 bits of the outgoing char can be used as the shift index and the upper 2 bits of the outgoing char can be used as the table index.)

I think the basic ABNF char building blocks are:
* ALPHA,
* DIGIT,
* HEX,
* LHEX,
* alphanum,
* reserved,
* mark,
* separators,
* token-unreserved, where token = 1*( alphanum / token-unreserved )
* word-unreserved, where word = 1*( alphanum / word-unreserved )
* user-unreserved,
* password-unreserved, where password = *( unreserved / escaped / password-unreserved )
* param-unreserved,
* hnv-unreserved,
* uric-no-slash-unreserved, where uric-no-slash = unreserved / escaped / uric-no-slash-unreserved
* pchar-unreserved, where pchar = unreserved / escaped / pchar-unreserved
* scheme-unreserved, where scheme = ALPHA *( ALPHA / DIGIT / scheme-unreserved )
* reg-name-unreserved, where reg-name = 1*( unreserved / escaped / reg-name-unreserved )

For example the defined constant SCHEME_UNRESERVED = 114349209288704
(or SCHEME_UNRESERVED[1] = 26624 if bitmaps are limited to 32 bits)

The basic ABNF char building blocks can be combined by bitwise ORing the defined constants.

For example user-char where user = 1*( user-char ) can be expressed as:
user-char = ( unreserved / escaped / user-unreserved ) and the pseudo
code to determine whether escaping is needed could be:

while (user_char) {
  if ((1 << user_char) | UNRESERVED | ESCAPED | USER_UNRESERVED) {
     asis_char(user_char, &user_output);
  } else {
     escape_char(user_char, &user_output);
  }
  userchar++
}

By: nick_lewis (nick_lewis) 2009-12-09 05:53:29.000-0600

oops I meant

while (user_char) {
  if ((1 << user_char) & (UNRESERVED | ESCAPED | USER_UNRESERVED)) {
     asis_char(user_char, &user_output);
  } else {
     escape_char(user_char, &user_output);
  }
  user_char++
}

By: David Vossel (dvossel) 2009-12-21 08:45:48.000-0600

wdoekes, Could you clarify what the following is about in your initial report.

- 'n' is not supposed to be hex-escaped (RFC4475 3.1.1.5 writes """The display name portion of the To and From header fields is "%Z%45". Note that this is not the same as %ZE.""")
- 'n' does however like the double-quote to be escaped, by a backslash

By: Olle Johansson (oej) 2009-12-21 13:33:36.000-0600

Isn't the display name UTF8 and should be encoded as UTF8, not ascii.

The domain part is ascii and should never be encoded.

The username part is tricky as it is in ascii characters, but encoding means that we can include stuff not accepted verbatim, like "oej@edvina.net@siphosting.com" where oej@edvina.net is the username part - but only if the @ is encoded.

uri_encoding is just about the URI, it's not a generic header encoding as SIP is utf8. The display name should not be involved here. How to handle UTF8 caller ID names in Asterisk is another issue, but that applies to both chan_iax2 and chan_sip.

By: Walter Doekes (wdoekes) 2009-12-21 14:09:31.000-0600

dvossel:

I mean that the display name portion of the caller id ('n' being the variable pointing to that in the particular piece of code) should not be percent-encoded. Examples of display names:
- ABC => "ABC" (obviously, the double quotes being optional but recommended)
- A%C => "A%C" (do not percent-encode the %)
- A"C => "A\"C" (do backslash-escape the ")
- A<C => "A<C" (here, the double quotes are not optional)


oej:

Your comments about UTF8 and ASCII do not make much sense to me:
- ASCII is the 7bit subset of UTF8. If the domain part is ASCII, it might still need to be encoded if characters that delimit the domain in SIP are allowed in a domain name (the bad example being ">" which is not allowed in a domain name). Further, I am not sure that domains really needs to be limited to ASCII only (see IDN domain names).
- You do not *have* to handle UTF8 if you do not want to(*). You can pass it along from the config files and out to the wire without knowing that you're dealing with UTF8. Only if you're dealing with multiple encodings or if you have to process characters string lengths (as opposed to byte string lengths) do you *need* to explicitly work with UTF8.

(*)You really should guard against overlong encodings. But as you're not processing the UTF8, you won't be affected by them. But a piece of code that trusts your output might choke on it / be exploitable through it.

By: Olle Johansson (oej) 2009-12-21 14:39:19.000-0600

- Have you ever seen the host part of a web url being encoded? No. Never happens. Only allowed characters in the host/domain part are amongst those characters that you never encode.
- We have to handle UTF8 conversion, because we support protocols that doesn't handle UTF8, like ISDN PRI caller ID names.

By: David Vossel (dvossel) 2009-12-21 16:48:06.000-0600

There is a patch for the decode before parsing issue on reviewboard. https://reviewboard.asterisk.org/r/450/

By: nick_lewis (nick_lewis) 2009-12-22 03:07:01.000-0600

RFC3261 Section 25 is wonderfully precise and unambiguous regarding the content of display-name and addr-spec so I think it is simply a case of agreeing how best to implement it.

Regarding caller id limitations for ISDN, I would have thought any conversion to deal with this would be in chan_dahdi rather than chan_sip

An interesting development in the character set for domains is the recent announcement that european top level domain .eu should offer characters from all official EU languages
http://europa.eu/rapid/pressReleasesAction.do?reference=IP/09/1903

By: Olle Johansson (oej) 2009-12-22 03:27:28.000-0600

Nick, IDNA doesn't change the actual character set in the domain name used in protocols. It's for end-user applications. So no, International character set in domains doesn't change anything in the SIP protocol. It changes your requirements on end-user clients though.

We have to make a decision in regards to the character set for Caller ID names internally in Asterisk. Right now, there's no definition and using caller ID names in UTF8 will cause issues. We do have to respect the multiprotocol architecture of asterisk.

By: nick_lewis (nick_lewis) 2009-12-23 03:18:46.000-0600

Thanks for the info on IDNA

I think that the most internationalization friendly character set that can practically be supported should be used internally for caller ID names. The chan_techs should be responsible if necessary for converting them to the character set supported by the protocols for the techs. I guess that the conversion could be simply to replace unrepresentable characters with a "?". Since the caller id name may be used in the dialplan I guess that extensions.conf would need to support the full character set or at least have some way to represent all characters perhaps using % escaping

By: Olle Johansson (oej) 2009-12-23 03:48:13.000-0600

Well, we can do better than that. IAX2 now is standardized to UTF8, so Asterisk should in fact support UTF8 natively, but this affects all CDR drivers, logging channels and will be a huge task to get right.

I proposed earlier a new field in the caller_id structure called cid_name_utf8 (I believe the branch still exists) but got arguments that this conversion could be handled automatically. I don't think so. Converting from Asian character sets to ASCII is not an automatic procedure. ANd putting '?' in personal names is not a good solution, if there are better solutions. People care about their names.

By: nick_lewis (nick_lewis) 2009-12-23 08:33:10.000-0600

Support for UTF8 in CDR drivers, logging channels etc could be added incrementally over time (from release 1.10 onwards perhaps). The utf2ascii boundary can start off including only chan_iax and the pbx core and later move outwards to encompass more parts of asterisk.

I am not sure how the utf2ascii conversion function could operate other than to replace all multibyte characters with "?". If the conversion is not done automatically how do you suppose it should be done? When an IAX call comes in from China and is being delivered to a SIP phone there is not really time to employ a translator to come up with the best caller id to display. I suppose some Asian characters that are purely phonetic (such as japanese katakana and hiragana) could be more intelligently converted to ascii but I am not sure that it is worthwhile working on conversions when time could be better spent added UTF8 support.

By: David Vossel (dvossel) 2009-12-23 12:19:59.000-0600

There is a patch up for the ast_uri_encode behavior change on reviewboard: https://reviewboard.asterisk.org/r/451/

By: Walter Doekes (wdoekes) 2009-12-23 13:51:16.000-0600

dvossel:

My issues.asterisk login does not seem to work on the reviewboard, so I'll post my feedback here:
(1) I explicitly encoded 0x25 and 0x7f. You left the comment about 0x7f but removed the functionality of both. Let me reiteratie: you *must* encode % to %25. And if the 0x7f stays in the comments, it should stay in the code as well.
(2) You check for (*ptr < 32). If you mean to check both <32 and >127, you should explicitly cast it to a signed char. """It is up to the compiler to decide whether "plain" char or default char is signed or unsigned."""
(3) You've added lots of backslashes before the % in the "expected" variable in your test. "\%" is reduced to "%" as it has no special meaning. But for all you know, it might in the future. Don't do that. (I suspect you're using an editor that highlights the %'s as printf and friends conversion specifiers.)

Thanks for your continued work on the issue.

By: pkempgen (pkempgen) 2009-12-23 14:13:43.000-0600

Nick_Lewis wrote:
> I am not sure how the utf2ascii conversion function could operate
> other than to replace all multibyte characters with "?".

The generic approach to this problem is: First transform the UTF-8
string to NFD
( http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization )
and only then strip all non-ASCII characters ([^\x20-\x7E]).
That way you get a good ASCII representation for each character.

Unicode normalization is not exactly trivial but fortunately there
are libraries such as ICU
( http://en.wikipedia.org/wiki/International_Components_for_Unicode ).
That's what PHP 6 uses btw.



By: Walter Doekes (wdoekes) 2009-12-23 14:28:08.000-0600

dvossel:

(4) If we're going to poke around in the ast_uri_encode function, I'd say that upper case hex chars are more common and would suggest replacing the "%%02x" with "%%02X". This might trigger breakage for users of bad code however ;-)


pkempgen:

I'm with Nick_Lewis on the utf2ascii issue:
Normalization could be possible, but I think it will do too little good for too much work. For western european languages where only few characters are >127, a ? here and there might be inconvenient yet readable, but for eastern european or (worse) asian languages, I fear the meaning could get lost completely if all accents are dropped and simply displaying ???? would be just as meaningful.


Nick_Lewis / oej:

My two cents on the UTF8 issue:
If you take away the problem of backward compatibility(*), the easiest way to go is to:
- declare that all strings used in asterisk are UTF8 and,
- ensure that all protocols that do not speak UTF8 are encoded to and decoded from their specific encodings(**) just before they leave and enter asterisk, respectively.

One alternative is using wide characters internally, but that suffers from the same issues and is far more work to implement.

(*) and (**) are mostly a problem in homogenic setups where everyone speaks the same encoding but no one has told the application which encoding that is.

In my humble opinion, to continue this fruitfully, it would be wise to identify where exactly breakage will occur when you switch and how this can be mitigated:
- SIP shouldn't be an issue, you are already speaking UTF8 if you're using the international characters.
- Other protocols (I am too unfamiliar with other protocols) might need an explicit character set setting and users should be forced/warned *early* to set this correctly.


Regards,
Walter

By: Olle Johansson (oej) 2009-12-23 14:53:48.000-0600

I am afraid that it sounds very easy, but it's not, really. It will affect a very large part of ASterisk to change ALL strings to UTF8, and still a large part if we change just the Caller ID name.

Any downconversion of names will fail and upset people. IMagine converting Asian characters to Ascii, we'll end up only with question marks. UTF8 is much bigger than ISO8859-1 and it's not about a few dots and rings above some characters. These people will want an UTF8 name as well as an "anglified" name in Ascii... We also need to check if the phone really supports UTF8. What happens if we send hebrew characters to a Polycom, a SNOM, a Cisco/linksys phone?

Trust me, what you require needs a lot of hours funded.

By: Walter Doekes (wdoekes) 2009-12-23 15:43:28.000-0600

(Pretty much all of my arguments below depend on my assumption that asterisk is encoding-agnostic and does not forcefully strip the eighth bit in the core of the application. If those assumptions do not hold true, you can ignore the rest of this post.)

- You do realize that the beauty of UTF8 is that an application that is encoding-agnostic can do UTF8 without any work at all? As far as I can tell -- please correct me in this if I'm wrong -- asterisk is for the most part encoding-agnostic (exceptions being things like mail content encoding in app_voicemail). So changing of "ALL strings" is not at all necessarily the case. Unless you're referring to the increased size you *might* need, which should be looked at, but is a moot point as you mention the current usage of other multibyte encodings that also use more bytes than characters.

- That people might want an anglified name might be the case, but that's also a moot point. They do not have two names in the current setup either.

- I did already notice that the Linksys on my desktop does not do UTF8 like it should. Yes, if the phone for some bad reason does speak latin1, forcing utf8 in the chan_sip.c output does make it problematic to send extended characters to the phone. But do note that the <=127 range still works fine.

- For the record: it is not me who is requiring asterisk to know about UTF8. I only speak SIP at the moment and as long as asterisk doesn't mangle my bytes I can speak UTF8 without asterisk having to know about it. And yes, even though I claim that you do not need to touch all strings, I do realise that this requires many many hours to get done.

By: David Vossel (dvossel) 2010-01-04 10:08:27.000-0600

How does everyone feel about https://reviewboard.asterisk.org/r/450/ and https://reviewboard.asterisk.org/r/451/ being committed into trunk now?

Is there anything else that must be addressed?

By: nick_lewis (nick_lewis) 2010-01-04 10:53:11.000-0600

+1 for committing these uri changes. The ongoing discussion seems to relate to display-name/callerid rather than uri

By: Walter Doekes (wdoekes) 2010-01-04 11:13:41.000-0600

Oh, I'm sorry. I wasn't aware that you had updated the patches on the reviewboard.

I will take a look.

By: Leif Madsen (lmadsen) 2010-01-04 14:22:27.000-0600

Just marking as ready for review prior to committing. Thanks!

By: Walter Doekes (wdoekes) 2010-01-06 07:44:38.000-0600

dvossel:

A few patches:
(1) astsvn-16299-testh-compile.diff to make it compile without the test framework
(2) astsvn-16299-parse_uri-calls.diff fixing typo's
(3) astsvn-16299-ast_str_replace.diff a utility function (see point 5)
(4) astsvn-16299-chansip-encode-from-fully.diff set do_special_char to 1 always (yes.. Nicks version is better but more time consuming to implement), not only in RPID/PAI. (forget this patch, see point 5's patch instead)
(5) astsvn-16299-displayname-no-uri-encode.diff do what (4) does and don't encode/decode the display name using uri-encoding, but by replacing " with ''.

One could continue to debate about (4). I'm not qualified to decide whether you want this as it could break (broken) communication between asterisken.

As for (5), as can be seen from my comments at the ast_displayname_encode macro, I initially thought I'd do it right (s/"/\"/g), but reading the RFC did not reveal to me whether one would need a s/\/\\/g as well. And to be kind to parsers that match the double-quotes only (not counting backslashes), I figured it was easiest to remove (replace with '') the double-quote altogether.

By: Olle Johansson (oej) 2010-01-06 07:46:46.000-0600

Again, the display name should *NOT* be URI-encoded. Ever. It's not part of a URI.

By: nick_lewis (nick_lewis) 2010-01-06 08:29:34.000-0600

I think if asterisk is going to encode the display-name then it should be done properly. RFC3261 does state in the text and the abnf that any backslash needs escaping as well as any double quote:

     A string of text is parsed as a single word if it is quoted using
     double-quote marks.  In quoted strings, quotation marks (") and
     backslashes (\) need to be escaped.

     quoted-string  =  SWS DQUOTE *(qdtext / quoted-pair ) DQUOTE
     qdtext         =  LWS / %x21 / %x23-5B / %x5D-7E / UTF8-NONASCII
     quoted-pair    =  "\" (%x00-09 / %x0B-0C / %x0E-7F)

By: nick_lewis (nick_lewis) 2010-01-06 09:28:02.000-0600

If you get time then perhaps you could also add a matching decode for the display-name (specifically the quoted-string style of display-name) and use it in get_calleridname.

I do not understand your first comment in get_calleridname: "(1) the display name doesn't require quotes". The function seems to correctly cater for this situation already with " } else { /* No quoted string" in which it takes the token style display-name as is. However I think your second comment is a good catch. The function should use < to find the end of only token style display-names. For quoted-string style display-names it should use " except \" to find the end

By: David Vossel (dvossel) 2010-01-06 15:59:57.000-0600

There shouldn't be a problem with decoding the display-name using the ast_uri_decode function.

The way I see it all we need is a function similar to ast_uri_encode (lets say something like ast_custom_encode) that takes in an array of characters to escape, an input buffer, an output buffer, and len of output buffer.

char escape[2] = { 0x5C, 0x22 };
ast_custom_encode(n, tmp_n, sizeof(tmp_n), escape);

Would this resolve the display-name issue?

Also, I have removed the reviewboard posts for now.  I have applied a few more changes to the patches and consolidated them into this branch.  http://svn.digium.com/svn/asterisk/team/dvossel/sip_uri_encode_decode

By: Walter Doekes (wdoekes) 2010-01-06 16:28:54.000-0600

Nick:
You're right, I read the code wrong. (1) was already taken care of. And I was looking at RFC2543 instead of 3261. Oops.

I hacked up a get_calleridname decode function, but it became a bit longer than intended ;-)
A couple of limitations:
* It doesn't handle line continuations/folding. I'd expect an earlier mechanism to unfold those.
* If the destination buffer is too short, invalid input may be accepted as valid.


dvossel:
> There shouldn't be a problem with decoding the display-name
> using the ast_uri_decode function.
The display name shouldn't be uri-encoded/decoded. Double quotes should be backslash escaped. See also my initial:
> - 'n' does however like the double-quote to be escaped, by a backslash

By: nick_lewis (nick_lewis) 2010-01-07 04:55:20.000-0600

Wow that was a speedy rewrite of the get_calleridname function

It looks very good but I think it would be more readable if the processing of the quoted-string and the tokens where relegated to subsidiary functions e.g.

if (input[0] == '"') {
  quotedstring_decode(input, &output, outputsize)
  return orig_output;
}
if (input[0] == 0x9 || input[0] == 0xd || input[0] == ' ') {
  goto get_calleridname_fail;
}
tokens_decode(input, &output, outputsize)
/* set NUL while trimming trailing whitespace */
...etc...

Also I feel that some of the failure conditions are a bit harsh. How about just skipping over initial tabs and spaces rather than returning a failure? Also how about silently dropping invalid characters from within tokens rather than returning a failure?

By: nick_lewis (nick_lewis) 2010-01-07 05:46:15.000-0600

Looking through the code it would appear that the parsing of the from and to headers is broken in many places. I would be interested to see what happens if the display-name is "Jack & Jill" or "tomato-tagliatelle" or "Acme Sales; Mike"

By: nick_lewis (nick_lewis) 2010-01-07 06:14:33.000-0600

Sorry scrub my previous comment. The get_in_brackets function correctly deals with from and to headers and refer/replaces does not use a display-name

By: David Vossel (dvossel) 2010-01-14 12:39:45.000-0600

wdoekes

Thank you for your continued work on this issue.  I've gotten a chance to review the get_calleridname patch you have provided.  While I think your patch will work I do not like how it is structured.  Like Nick_lewis pointed out, it isn't very readable, so here's what I have done.

There is a patch I've just uploaded called 'new_get_calleridname.diff'.  This patch modifies get_calleridname to correctly isolated a quoted and non-quoted name before ast_copy_string is called to create the output buffer.  Can you write you write a modular decode function that can replace the ast_copy_string call in get_calleridname? I prefer this over a complete rewrite of get_calleridname.  There is a big XXX TODO note added in the patch right were I'm taking about.


By: Walter Doekes (wdoekes) 2010-01-14 14:01:55.000-0600

Your code breaks on '''"...<..." <sip:abc@def>''' because of the forward search in:

if (!(end = strchr(input, '<')) || end == input) {

When there are backslash escaping rules, you shouldn't sneak past them to try and handle it later. Yes, it may be more readable, but it's wrong and probably more cpu intensive too.

By: David Vossel (dvossel) 2010-01-14 17:23:56.000-0600

wdoekes: good point, a rewrite appears necessary.  I'm going to go through and re-structure a few things to attempt to compress the code a little for readability.  I'm also going to design a test for this function.  Would you be able to provide some example cases and your expected output?

By: David Vossel (dvossel) 2010-01-14 17:23:58.000-0600

wdoekes: good point, a rewrite appears necessary.  I'm going to go through and re-structure a few things to attempt to compress the code a little for readability.  I'm also going to design a test for this function.  Would you be able to provide some example cases and your expected output?

By: Walter Doekes (wdoekes) 2010-01-15 03:07:19.000-0600

Sure thing.

One thing to keep in mind when rewriting:
You probably want to keep an output pointer to the original buffer to give back to the caller. The caller can then proceed to parse the stuff between <> from there (and same thing with the data after the <>).

By: nick_lewis (nick_lewis) 2010-01-15 06:06:57.000-0600

Once there is a working get_calleridname function it will also need to be used by the read_to_parts function which currently tries to get the caller id name itself - sometimes unsuccessfully

By: David Vossel (dvossel) 2010-01-15 12:56:37.000-0600

Okay, I just uploaded my modified version of the get_calleridname rewrite patch along with some code to test it.

The decode logic should be the same, but this version is much easier on the error cases. Instead of returning NULL every time an invalid character is approached, this function attempts to skip those characters (not copying them into the output buffer) and continue on. There are still some harsh errors that must be caught causing an immediate return, but when possible it attempts to recover.  This falls under the philosophy of being strict with what you send, but relaxed with what you accept (as long as it is possible to make some sort of sense out of it).



By: David Vossel (dvossel) 2010-01-21 18:39:57.000-0600

A new review combining all the progress made on this issue is now on reviewboard.  https://reviewboard.asterisk.org/r/469/

By: Digium Subversion (svnbot) 2010-01-26 10:30:11.000-0600

Repository: asterisk
Revision: 243200

U   trunk/channels/chan_sip.c
U   trunk/include/asterisk/utils.h
U   trunk/main/test.c
U   trunk/main/utils.c
A   trunk/tests/test_utils.c

------------------------------------------------------------------------
r243200 | dvossel | 2010-01-26 10:30:09 -0600 (Tue, 26 Jan 2010) | 56 lines

RFC compliant uri and display-name encode/decode

1.  URI Encoding
This patch changes ast_uri_encode()'s behavior when doreserved is enabled.
Previously when doreserved was enabled only a small set of reserved
characters were encoded.  This set was comprised primarily of the reserved
characters defined in RFC3261 section 25.1, but contained other characters as
well.  Rather than only escaping the reserved set, doreserved now escapes
all characters not within the unreserved set as defined by RFC 3261 and
RFC 2396.  Also, the 'doreserved' variable has been renamed to 'do_special_char'
in attempts to avoid confusion.

When doreserve is not enabled, the previous logic of only encoding the
characters <= 0X1F and > 0X7f remains, except for the '%' character, which
must always be encoded as it signifies a HEX escaped character during the decode
process.

2. URI Decoding: Break up URI before decode.
In chan_sip.c ast_uri_decode is called on the entire URI instead of it's
individual parts after it is parsed.  This is not good as ast_uri_decode
can introduce special characters back into the URI which can mess up parsing.
This patch resolves this by not decoding a URI until parsing is completely
done.  There are many instances where we check to see if pedantic checking
is enabled before we decode a URI.  In these cases a new macro,
SIP_PEDANTIC_DECODE, is used on the individual parsed segments of the URI
rather than constantly putting if (pedantic) { decode() } checks everywhere
in the code.  In the areas where ast_uri_decode is not dependent upon
pedantic checking this macro is not used, but decoding is still moved to
each individual part of the URI.  The only behavior that should change from
this patch is the time at which decoding occurs.

Since I had to look over every place URI parsing occurs to create this
patch, I found several places where we use duplicate code for parsing.
To consolidate the code, those areas have updated to use the parse_uri()
function where possible.

3. SIP display-name decoding according to RFC3261 section 25.
To properly decode the display-name portion of a FROM header, chan_sip's
get_calleridname() function required a complete re-write.  More information
about this change can be found in the comments at the beginning of this function.

4. Unit Tests.
Unit tests for ast_uri_encode, ast_uri_decode, and get_calleridname() have been
written.  This involved the addition of the test_utils.c file for testing the
utils api.

(closes issue ASTERISK-15190)
Reported by: wdoekes
Patches:
     astsvn-16299-get_calleridname.diff uploaded by wdoekes (license 717)
     get_calleridname_rewrite.diff uploaded by dvossel (license 671)
Tested by: wdoekes, dvossel, Nick_Lewis

Review: https://reviewboard.asterisk.org/r/469/


------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=243200

By: Digium Subversion (svnbot) 2010-01-26 10:31:31.000-0600

Repository: asterisk
Revision: 243201

_U  branches/1.6.2/

------------------------------------------------------------------------
r243201 | dvossel | 2010-01-26 10:31:30 -0600 (Tue, 26 Jan 2010) | 61 lines

Blocked revisions 243200 via svnmerge

........
 r243200 | dvossel | 2010-01-26 10:30:08 -0600 (Tue, 26 Jan 2010) | 56 lines
 
 RFC compliant uri and display-name encode/decode
 
 1.  URI Encoding
 This patch changes ast_uri_encode()'s behavior when doreserved is enabled.
 Previously when doreserved was enabled only a small set of reserved
 characters were encoded.  This set was comprised primarily of the reserved
 characters defined in RFC3261 section 25.1, but contained other characters as
 well.  Rather than only escaping the reserved set, doreserved now escapes
 all characters not within the unreserved set as defined by RFC 3261 and
 RFC 2396.  Also, the 'doreserved' variable has been renamed to 'do_special_char'
 in attempts to avoid confusion.
 
 When doreserve is not enabled, the previous logic of only encoding the
 characters <= 0X1F and > 0X7f remains, except for the '%' character, which
 must always be encoded as it signifies a HEX escaped character during the decode
 process.
 
 2. URI Decoding: Break up URI before decode.
 In chan_sip.c ast_uri_decode is called on the entire URI instead of it's
 individual parts after it is parsed.  This is not good as ast_uri_decode
 can introduce special characters back into the URI which can mess up parsing.
 This patch resolves this by not decoding a URI until parsing is completely
 done.  There are many instances where we check to see if pedantic checking
 is enabled before we decode a URI.  In these cases a new macro,
 SIP_PEDANTIC_DECODE, is used on the individual parsed segments of the URI
 rather than constantly putting if (pedantic) { decode() } checks everywhere
 in the code.  In the areas where ast_uri_decode is not dependent upon
 pedantic checking this macro is not used, but decoding is still moved to
 each individual part of the URI.  The only behavior that should change from
 this patch is the time at which decoding occurs.
 
 Since I had to look over every place URI parsing occurs to create this
 patch, I found several places where we use duplicate code for parsing.
 To consolidate the code, those areas have updated to use the parse_uri()
 function where possible.
 
 3. SIP display-name decoding according to RFC3261 section 25.
 To properly decode the display-name portion of a FROM header, chan_sip's
 get_calleridname() function required a complete re-write.  More information
 about this change can be found in the comments at the beginning of this function.
 
 4. Unit Tests.
 Unit tests for ast_uri_encode, ast_uri_decode, and get_calleridname() have been
 written.  This involved the addition of the test_utils.c file for testing the
 utils api.
 
 (closes issue ASTERISK-15190)
 Reported by: wdoekes
 Patches:
       astsvn-16299-get_calleridname.diff uploaded by wdoekes (license 717)
       get_calleridname_rewrite.diff uploaded by dvossel (license 671)
 Tested by: wdoekes, dvossel, Nick_Lewis
 
 Review: https://reviewboard.asterisk.org/r/469/
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=243201