Summary:ASTERISK-28794: res_pjsip: Crash when escaping during URI printing
Reporter:nappsoft (nappsoft)Labels:patch
Date Opened:2020-03-30 13:00:12Date Closed:2020-05-27 15:07:13
Status:Closed/CompleteComponents:pjproject/pjsip Resources/res_pjsip
Versions:16.9.0 Frequency of
Environment:musl libc based linuxAttachments:( 0) ASTERISK-28794.diff
( 1) refresh_expires.diff

We are observing a frequent crash which seems to be a regression as this didn't happen with asterisk 16.9.0.

In the system this happens most frequently (almost exactly every 15 minutes) we have an outbound registration that is not answering (currently not reachable). As soon as we remove the references to this system (aor, endpoint, registration, ...) it doesn't happen any longer.

However we also observed the same crashes on other systems with asterisk 16.9.0 where we do not have such a system configured. So it seems like this can with other (possibliy temporarily) because of other unreachable contacts as well.

The backtrace is the following:

#0  0x00007f937e830725 in pj_strncpy2_escape () from /usr/lib/libpjlib-util.so.2
No symbol table info available.
#1  0x00007f937e8e2239 in pjsip_url_print () from /usr/lib/libpjsip.so.2
No symbol table info available.
#2  0x00007f937e8e19e4 in pjsip_name_addr_print () from /usr/lib/libpjsip.so.2
No symbol table info available.
#3  0x00007f937e8da580 in pjsip_contact_hdr_print () from /usr/lib/libpjsip.so.2
No symbol table info available.
#4  0x00007f937e8db6f6 in pjsip_msg_print () from /usr/lib/libpjsip.so.2
No symbol table info available.
#5  0x00007f937e8e80bb in pjsip_tx_data_encode () from /usr/lib/libpjsip.so.2
No symbol table info available.
#6  0x00007f937e8e2b26 in endpt_on_tx_msg () from /usr/lib/libpjsip.so.2
No symbol table info available.
#7  0x00007f937e8e8ab5 in pjsip_transport_send () from /usr/lib/libpjsip.so.2
No symbol table info available.
#8  0x00007f937e8f5bd0 in tsx_send_msg () from /usr/lib/libpjsip.so.2
No symbol table info available.
#9  0x00007f937e8f560e in tsx_timer_callback () from /usr/lib/libpjsip.so.2
No symbol table info available.
#10 0x00007f937e7c9b04 in pj_timer_heap_poll () from /usr/lib/libpj.so.2
No symbol table info available.
#11 0x00007f937e8e33eb in pjsip_endpt_handle_events2 () from /usr/lib/libpjsip.so.2
No symbol table info available.
#12 0x00007f937e24bfb8 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4721
       delay = {sec = 0, msec = 10}
#13 0x00007f937e7b8946 in thread_main () from /usr/lib/libpj.so.2
No symbol table info available.
#14 0x00007f938029cb82 in sem_open () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#15 0x00007f937d1f9000 in ?? ()
No symbol table info available.
#16 0x00007f938029ec48 in __strftime_fmt_1 () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#17 0x0000000000000000 in ?? ()
No symbol table info available.
Comments:By: Asterisk Team (asteriskteam) 2020-03-30 13:00:13.724-0500

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

By: nappsoft (nappsoft) 2020-03-31 04:52:42.864-0500

Actually this doesn't really seem to be a regression: it's true that we didn't observe these problems with asterisk 16.7.0, but the second thing that has changed was the malloc implementation of musl libc. So while above described circumstances (how to trigger the crash) stay true we get the old (stable) behavior back when using the old musl malloc implementation.

So it seems like this was some kinde of race condition that's depending on "how forgiving" the memory management is...

By: Joshua C. Colp (jcolp) 2020-03-31 09:47:03.071-0500

Do you have any way to reproduce it or have seen it with glibc? As a project it's the C library we support, with all others being best effort by the community. Is the backtrace always in the same spot?

By: nappsoft (nappsoft) 2020-03-31 11:09:49.156-0500

Yes, asterisk always crashes at the same address, so the "end" of the backtrace is always the same: in pj_strncpy2_escape () from /usr/lib/libpjlib-util.so.2

However as we do not have that much disk space on these systems we usually have coredumps deactivated. But thanks to dmesg I can confirm that asterisk always crashes in the same function.

The way I was able to make the crashes more frequent was to have a sip-registration to a host that does not answer. (So maybe in the end it has to do with retransmitted SIP messages and memory that is freed too early or so?). On the otherhand it does not seem to depend on load: we have a system up and running with the same version and the same malloc implementation for several weeks with lots 15-30 parallel calls and about 100 connected phones over TLS where we did not observer any single crash while the sytem in question, where the crash was happening frequently only has 5 connected devices and is crashing while not one single call is going on.

I understand that musl libc is not the main libc you're targetting. However I don't think that the race condition is depending on a musl einvironment. (If there was a fundamental malloc bug in musl the application would not always and ONLY crash at the same place I guess...)

By: Joshua C. Colp (jcolp) 2020-03-31 11:30:07.640-0500

I understand that you believe it will impact other implementations, it just makes it that much more complicated and time intensive to try to isolate the problem. PJSIP also uses a memory pool approach, so whether that could impact the new musl malloc implementation I do not know either.

By: nappsoft (nappsoft) 2020-04-14 08:29:13.775-0500

FYI: also tested this with PJSIP 2.10 last weekend, with PJ_TIMER_HEAP_USE_COPY enabled. The code is still crashing at the same place

Btw. if I may ask: did you alreaedy test pjproject 2.10? Somehow asterisk 16.9 with PJSIP 2.10 will consume about 40% more memory according to top compared to a 16.9/2.9 combination. This is even the case with PJ_TIMER_HEAP_USE_COPY disabled and I somehow do not see any reason why...

By: Joshua C. Colp (jcolp) 2020-04-14 08:52:01.794-0500

We haven't tested 2.10 against Asterisk, it is in queue but has not yet been worked.

By: nappsoft (nappsoft) 2020-04-18 07:06:57.288-0500

PJSIP 2.10 has about the same memory footprint like 2.9 as long as one disables PJ_TIMER_USE_COPY. I didn't notice that PJ_TIMER_HEAP_USE_COPY had been renamed to PJ_TIMER_USE_COPY later on. With PJ_TIMER_USE_COPY enabled the basic memory footprint on my testsystem raises from ~70MByte to ~88MByte right after boot.

By: nappsoft (nappsoft) 2020-05-25 10:51:14.715-0500

Thinking about this issue again I noticed that the outbound registration is explicitely set to expire after 15 minutes. So I guess that it's no coincidence that the crashes usually happen after 15 minutes.

So I cam to the conclusion that the attached patch might be worth a try, I'm currently testing this one on the system I was confronted with this issue:

- could it be the case that two different events are scheduled that can occur more or less at the same time? (the expiration of the registration, when the registration might be removed from memory and the scheduled registration retry?). This is why I thought it might be a good idea to call pjsip_regc_update_expires again when rescheduling the registration.

By: nappsoft (nappsoft) 2020-05-26 06:05:28.373-0500

seems like I'd misinterpreted pjsip_regc_update_expires, so the patch does not help in anyway.

However: can someone maybe comment my idea that it could have to do with a rescheduled registration attempt colliding with the expiration time?

By: Joshua C. Colp (jcolp) 2020-05-26 06:45:49.897-0500

A full backtrace could show the two threads happening at the same time. Otherwise the threading should prevent it, but there may be an off-nominal or it doesn't cover the case you state.

By: nappsoft (nappsoft) 2020-05-27 00:13:40.898-0500

An update concerning this case: it doesn't seem to have to do with the register request, but with the options request to an unresponsive target. I've now built a system where I can easily reproduce the crash, so more traces will follow the next days.

By: nappsoft (nappsoft) 2020-05-27 04:46:34.820-0500

Thanks to some SIP Traces I actually know a bit better what's going on:

it seems like memory sometimes gets corrupted when resending options packets as I have several cases where the following happens (traces filtered by call-id, so I can distinguish resends from new OPTIONS requests):

- the first few OPTIONS packets leave the system correctly (Contact Header: Contact: <sip:290@xxx.xxx.xxx.xxx:5060>
- then suddenly the user-part of the contact header gets corrupted and the header looks like (example, content can vary): Contact: <sip:%20%20%20@xxx.xxx.xxx.xxx:5060>
- everything else is ok, only the user-part of the contact header gets corrupted

Usually Asterisk will send out 5 OPTIONS packets (so 4 retries). When this happens everything is ok. Howerver, sometimes it sends more (7 to 8 in the cases I've seen) where the 5th or 6th will become corrupt

However one other strange thing I've discovered: sometimes the OPTIONS retries happen all 4 seconds, but sometimes they seem to start with t1 (first retry after 0.5s, second after 1.5, third after 3.5 and then it starts to send options all 4 seconds). But: whether this happens or not seems not to have an influence on the contact header corruption: in either case this sometimes happens and sometimes doesn't. Anyway this is somehow weird...

By: nappsoft (nappsoft) 2020-05-27 06:16:25.346-0500

what helps against the header corruption:

- either not defining contact_user in the type=endpoint configuration
- or commenting the code that sets contact_uri->user to endpoint->contact_user in res_pjsip.c -> create_out_of_dialog_request (so commenting the line pj_strdup2(pool, &contact_uri->user, endpoint->contact_user) )

By: Joshua C. Colp (jcolp) 2020-05-27 06:26:25.163-0500

Does this patch resolve the issue?

By: nappsoft (nappsoft) 2020-05-27 06:45:02.837-0500

Yes, this patch helps. No more corrupted headers and I assume that the crashing will be over thanks to that. Will let the test-system run for some more hours though.

By: nappsoft (nappsoft) 2020-05-27 09:29:30.656-0500

Looks good, no more crashes.

By: Friendly Automation (friendly-automation) 2020-05-27 15:07:15.600-0500

Change 14444 merged by Friendly Automation:
res_pjsip: Use correct pool for storing the contact_user value.


By: Friendly Automation (friendly-automation) 2020-05-27 15:12:54.820-0500

Change 14468 merged by Friendly Automation:
res_pjsip: Use correct pool for storing the contact_user value.


By: Friendly Automation (friendly-automation) 2020-05-27 15:13:47.056-0500

Change 14467 merged by Friendly Automation:
res_pjsip: Use correct pool for storing the contact_user value.


By: Friendly Automation (friendly-automation) 2020-05-27 15:22:56.427-0500

Change 14469 merged by Joshua Colp:
res_pjsip: Use correct pool for storing the contact_user value.