Summary: | ASTERISK-28161: Removal of Previous Patch Causes PJSIP Timer Issues | ||||||||
Reporter: | Ross Beer (rossbeer) | Labels: | pjsip | ||||||
Date Opened: | 2018-11-12 04:25:57.000-0600 | Date Closed: | 2020-06-15 17:13:01 | ||||||
Priority: | Major | Regression? | Yes | ||||||
Status: | Closed/Complete | Components: | Resources/res_pjsip | ||||||
Versions: | GIT | Frequency of Occurrence | Frequent | ||||||
Related Issues: |
| ||||||||
Environment: | CentOS 7.5 | Attachments: | ( 0) core-asterisk-114868-1543230597-thread1.txt ( 1) core-asterisk-145932-1554311130-thread1.txt ( 2) core-asterisk-174659-1559054674-thread1.txt ( 3) core-asterisk-175043-1553248984-thread1.txt ( 4) core-asterisk-180205-host-1542295191-thread1.txt ( 5) core-asterisk-197853-host-1542017635-thread1.txt ( 6) core-asterisk-3163-thread1.txt ( 7) core-asterisk-33635-1554479440-thread1.txt ( 8) core-asterisk-47221-1554826091-thread1.txt ( 9) core-asterisk-82410-1554735609-thread1.txt (10) core-asterisk-9435-1554290719-thread1.txt (11) core-asterisk-99802-1554218364-thread1.txt | ||||||
Description: | The removal of 'third-party/pjproject/patches/0010-timer-Clean-up-usage-of-timer-heap.patch' in commit https://gerrit.asterisk.org/#/c/asterisk/+/10487/ has started a segfault related to PJSIP Timers.
Please see the attached backtrace. | ||||||||
Comments: | By: Asterisk Team (asteriskteam) 2018-11-12 04:26:00.022-0600 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Kevin Harwell (kharwell) 2018-11-13 15:33:49.743-0600 [~rossbeer] Just curious, but have you added the patch back in and experienced no more segfaults? And/Or removed it again and they started up again? By: Ross Beer (rossbeer) 2018-11-14 05:07:05.836-0600 I have been plagued with timer issues for quite some time, removing the patch has moved the issue to a different place. Previous segfaults before this change were observed here: {noformat} Thread 1 (Thread 0x7f9682b7d700 (LWP 150053)): #0 0x00007f9997c2a85c in copy_node (ht=0x284f800, slot=5050, moved_node=0x7f96b55275b0) at ../src/pj/timer.c:137 #1 0x00007f9997c2ac95 in reheap_up (ht=0x284f800, moved_node=0x7f96b5d1e4e0, slot=5050, parent=2524) at ../src/pj/timer.c:208 #2 0x00007f9997c2aea3 in remove_node (ht=0x284f800, slot=5050) at ../src/pj/timer.c:254 parent = 2524 moved_node = 0x7f96b5d1e4e0 removed_node = 0x7f96ee3d1b28 #3 0x00007f9997c2b18d in cancel (ht=0x284f800, entry=0x7f96ee3d1b28, flags=7) at ../src/pj/timer.c:353 timer_node_slot = 5050 #4 0x00007f9997c2b6a2 in cancel_timer (ht=0x284f800, entry=0x7f96ee3d1b28, flags=6, id_val=0) at ../src/pj/timer.c:594 count = 0 {noformat} The reported issue for this relates to a SIP retransmission, probably related to a reliable connection such as TCP/TLS. Is this correct? By: Kevin Harwell (kharwell) 2018-11-14 17:41:52.349-0600 I'm not entirely sure. Do you have, or can you get, any more info that might shed some more light on this issue, or might help us to replicate the problem? For instance a full debug log with sip trace and/or a description of what's happening on the system at the time? Also can you attach an entire backtrace? If lucky it might show another thread acting up. Thanks! By: Ross Beer (rossbeer) 2018-11-15 09:45:42.916-0600 I've emailed the full backtrace to you directly for privacy. However in another segfault, the backtrace looks the same as this issue but this time the CLI showed the following: {noformat} [2018-11-15 15:18:48] ERROR[26108]: res_pjsip_session.c:1881 ast_sip_session_defer_termination: FRACK!, Failed assertion !session->defer_terminate (0) Got 31 backtrace records #0: [0x607c9c] /usr/sbin/asterisk(__ast_assert_failed+0x84) [0x607c9c] #1: [0x7fa2eef09dab] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0x4dab) [0x7fa2eef09dab] #2: [0x7fa2eef0f0a2] /usr/lib64/asterisk/modules/res_pjsip_session.so(ast_sip_session_defer_termination+0x5c) [0x7fa2eef0f0a2] #3: [0x7fa1b3f31aa6] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x4aa6) [0x7fa1b3f31aa6] #4: [0x7fa1b3f33203] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x6203) [0x7fa1b3f33203] #5: [0x7fa1b3f334a9] /usr/lib64/asterisk/modules/res_pjsip_refer.so(+0x64a9) [0x7fa1b3f334a9] #6: [0x7fa2eef10d1e] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xbd1e) [0x7fa2eef10d1e] #7: [0x7fa2eef10f7f] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xbf7f) [0x7fa2eef10f7f] #8: [0x7fa2eef119ff] /usr/lib64/asterisk/modules/res_pjsip_session.so(+0xc9ff) [0x7fa2eef119ff] #9: [0x7fa4e25f090d] /usr/lib64/libasteriskpj.so.2(+0x6690d) [0x7fa4e25f090d] #10: [0x7fa4e263bdf1] /usr/lib64/libasteriskpj.so.2(pjsip_dlg_on_tsx_state+0xac) [0x7fa4e263bdf1] #11: [0x7fa4e263c65f] /usr/lib64/libasteriskpj.so.2(+0xb265f) [0x7fa4e263c65f] #12: [0x7fa4e2634907] /usr/lib64/libasteriskpj.so.2(+0xaa907) [0x7fa4e2634907] #13: [0x7fa4e26368c8] /usr/lib64/libasteriskpj.so.2(+0xac8c8) [0x7fa4e26368c8] #14: [0x7fa4e2635860] /usr/lib64/libasteriskpj.so.2(pjsip_tsx_recv_msg+0xc1) [0x7fa4e2635860] #15: [0x7fa4e263b544] /usr/lib64/libasteriskpj.so.2(pjsip_dlg_on_rx_request+0x3fd) [0x7fa4e263b544] #16: [0x7fa4e263d145] /usr/lib64/libasteriskpj.so.2(+0xb3145) [0x7fa4e263d145] #17: [0x7fa4e2619ea4] /usr/lib64/libasteriskpj.so.2(pjsip_endpt_process_rx_data+0x1ac) [0x7fa4e2619ea4] #18: [0x7fa2ec044947] /usr/lib64/asterisk/modules/res_pjsip.so(+0x2d947) [0x7fa2ec044947] #19: [0x5ef1d5] /usr/sbin/asterisk(ast_taskprocessor_execute+0x10d) [0x5ef1d5] #20: [0x5f8c5a] /usr/sbin/asterisk() [0x5f8c5a] #21: [0x5ef1d5] /usr/sbin/asterisk(ast_taskprocessor_execute+0x10d) [0x5ef1d5] #22: [0x5f6ea9] /usr/sbin/asterisk() [0x5f6ea9] #23: [0x5f8548] /usr/sbin/asterisk() [0x5f8548] #24: [0x5f8301] /usr/sbin/asterisk() [0x5f8301] #25: [0x604b40] /usr/sbin/asterisk() [0x604b40] {noformat} By: Jonathan Sullivan (flightnut5304) 2019-01-30 17:33:26.479-0600 I seem to be having the same issue here, can someone more experienced please confirm? Asterisk 16.0.0 built by mockbuild @ jenkins7 on a x86_64 running Linux on 2018-10-10 17:15:57 UTC Brief.txt - https://pastebin.freepbx.org/view/72b0d661 Full.txt - https://pastebin.freepbx.org/view/c131fc6a Locks.txt - https://pastebin.freepbx.org/view/117ac566 Thread1.txt - https://pastebin.freepbx.org/view/10b0451d By: Kevin Harwell (kharwell) 2019-01-31 09:55:15.560-0600 The backtrace appears to show a crash on the same call. I'd suspect it is a very similar, if not the same issue. By: Jonathan Sullivan (flightnut5304) 2019-01-31 12:44:36.339-0600 Just upgraded to Asterisk 16.1.1 on FreePBX and the issues with the segfaults are still present, as promised here are my updated backtraces from the two separate crashes I had this morning. Can somebody tell me if these are BOTH related to the same underlying issue? Thread- https://pastebin.freepbx.org/view/002604d0 Brief- https://pastebin.freepbx.org/view/70937d82 Full- https://pastebin.freepbx.org/view/b543b6aa Locks- https://pastebin.freepbx.org/view/54155437 ________________________________________ Thread- https://pastebin.freepbx.org/view/bb93954e Brief- https://pastebin.freepbx.org/view/1feac3cd Full- https://pastebin.freepbx.org/view/96a32c4c Locks- https://pastebin.freepbx.org/view/1323c484 By: Jonathan Sullivan (flightnut5304) 2019-01-31 13:53:49.977-0600 Updated traces above By: Ross Beer (rossbeer) 2019-01-31 14:35:15.300-0600 There was recently a patch made for PJSIP and then also added to asterisk which may resolve this issue: https://gerrit.asterisk.org/#/c/asterisk/+/10903/ I think this will be in the next release By: Jonathan Sullivan (flightnut5304) 2019-01-31 14:39:35.160-0600 Any eta for the release? By: Sean Bright (seanbright) 2019-03-25 15:04:01.416-0500 16.2.0 and 13.25.0 were released on 2019-02-15 - are you still having this problem with these new releases? By: Ross Beer (rossbeer) 2019-03-26 04:33:46.985-0500 I am still getting crashes in the PJSIP timers, please see the latest backtrace. By: Sean Bright (seanbright) 2019-03-26 13:02:21.993-0500 [~rossbeer], could you try manually applying this patch and see if it helps: https://trac.pjsip.org/repos/changeset/5934 *Edit:* Actually, doing that may be a pain if you are using bundled pjproject. Let me whip up a quick gerrit review you can try: 13: https://gerrit.asterisk.org/c/asterisk/+/11182 If you need it for a different release version of Asterisk, let me know. By: Ross Beer (rossbeer) 2019-03-26 14:24:05.906-0500 Great, thank you! I'll test and feedback if there are any issues By: Sean Bright (seanbright) 2019-04-01 09:41:13.397-0500 [~rossbeer], have you had a chance to apply this patch and test to see if it resolves the crash you are experiencing? By: Ross Beer (rossbeer) 2019-04-01 09:46:48.189-0500 I believe the patch will resolve the issue, there have been no further segfaults since applying the patch. However, the issue happens at random and therefore may happen at any time, from multiple times a day or after a few weeks. By: Friendly Automation (friendly-automation) 2019-04-02 10:17:10.135-0500 Change 11190 merged by Friendly Automation: pjproject: Add timer patch from pjproject r5934 [https://gerrit.asterisk.org/c/asterisk/+/11190|https://gerrit.asterisk.org/c/asterisk/+/11190] By: Ross Beer (rossbeer) 2019-04-02 10:28:46.775-0500 I have just had a reoccurrence of this issue, please see the attached backtrace. By: Asterisk Team (asteriskteam) 2019-04-02 10:28:47.401-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Friendly Automation (friendly-automation) 2019-04-02 10:30:02.582-0500 Change 11182 merged by George Joseph: pjproject: Add timer patch from pjproject r5934 [https://gerrit.asterisk.org/c/asterisk/+/11182|https://gerrit.asterisk.org/c/asterisk/+/11182] By: Friendly Automation (friendly-automation) 2019-04-02 10:30:31.377-0500 Change 11189 merged by George Joseph: pjproject: Add timer patch from pjproject r5934 [https://gerrit.asterisk.org/c/asterisk/+/11189|https://gerrit.asterisk.org/c/asterisk/+/11189] By: Sean Bright (seanbright) 2019-04-02 10:33:17.287-0500 [~rossbeer], what specific version of Asterisk 13 are you running? By: Ross Beer (rossbeer) 2019-04-02 10:34:30.010-0500 The latest GIT showing: GIT-13-13.15.0-rc1-2191-g664f706M Plus the patch from https://gerrit.asterisk.org/#/c/asterisk/+/11182/ By: Ross Beer (rossbeer) 2019-04-03 08:26:53.279-0500 This issue appears to be happening more since the latest patch has been applied By: Ross Beer (rossbeer) 2019-04-03 08:42:11.590-0500 One thing that differs on our code base is that there is a patch that changes the following so that more timers can exist: -typedef int pj_timer_id_t; +typedef long int pj_timer_id_t; The number shown in the backtrace appears to show this working as the ID is -140068368185622. By: Sean Bright (seanbright) 2019-04-03 08:49:29.488-0500 [~rossbeer], we can't support code that you have modified. By: Ross Beer (rossbeer) 2019-04-03 08:50:35.819-0500 One of the Digium devs gave me the patch, I will remove this and continue testing By: Ross Beer (rossbeer) 2019-04-04 03:32:27.786-0500 A further segfault has occurred, please see lates backtrace. By: Ross Beer (rossbeer) 2019-04-08 10:21:46.958-0500 Further timer crashes are happening within PJSIP By: Ross Beer (rossbeer) 2019-04-23 04:20:27.573-0500 PJSIP patch https://trac.pjsip.org/repos/changeset/5971 deals with timer issues, could this help resolve this issue? I note that removing the previous patch https://gerrit.asterisk.org/c/asterisk/+/11182 has lessened the frequency of the crashes. By: Ross Beer (rossbeer) 2019-04-29 09:41:39.019-0500 I'm pretty sure this is related to the recent PJSIP patch above. If anyone could provide a patch to asterisk it would be appreciated. I will happily test. By: Friendly Automation (friendly-automation) 2019-05-06 05:45:28.616-0500 Change 11344 merged by Friendly Automation: pjproject-bundled: Add upstream timer fixes [https://gerrit.asterisk.org/c/asterisk/+/11344|https://gerrit.asterisk.org/c/asterisk/+/11344] By: Friendly Automation (friendly-automation) 2019-05-22 09:17:15.544-0500 Change 11396 merged by George Joseph: pjproject-bundled: Add upstream timer fixes [https://gerrit.asterisk.org/c/asterisk/+/11396|https://gerrit.asterisk.org/c/asterisk/+/11396] By: Friendly Automation (friendly-automation) 2019-05-22 11:39:52.614-0500 Change 11392 merged by Friendly Automation: pjproject-bundled: Add upstream timer fixes [https://gerrit.asterisk.org/c/asterisk/+/11392|https://gerrit.asterisk.org/c/asterisk/+/11392] By: Friendly Automation (friendly-automation) 2019-05-22 12:09:54.275-0500 Change 11393 merged by Friendly Automation: pjproject-bundled: Add upstream timer fixes [https://gerrit.asterisk.org/c/asterisk/+/11393|https://gerrit.asterisk.org/c/asterisk/+/11393] By: Friendly Automation (friendly-automation) 2019-05-22 12:10:45.873-0500 Change 11391 merged by Friendly Automation: pjproject-bundled: Add upstream timer fixes [https://gerrit.asterisk.org/c/asterisk/+/11391|https://gerrit.asterisk.org/c/asterisk/+/11391] By: Ross Beer (rossbeer) 2019-05-29 05:39:31.226-0500 The upstream patches do not resolve the issue. By: Asterisk Team (asteriskteam) 2019-05-29 05:39:31.764-0500 This issue has been reopened as a result of your commenting on it as the reporter. It will be triaged once again as applicable. By: Ross Beer (rossbeer) 2019-05-30 04:48:29.094-0500 Adding a further backtrace from another segfault By: Kevin Harwell (kharwell) 2020-06-11 13:35:43.079-0500 [~rossbeer] Not sure if you are still having problems. If so you might want to try pjproject 2.10 if you have not already (I think they worked on some timer stuff). If you want a head start you can get our patch for bundled here: https://gerrit.asterisk.org/c/asterisk/+/14413 Otherwise it'll go out in the next release in a few weeks. By: Ross Beer (rossbeer) 2020-06-11 14:35:09.584-0500 I've been running with two pjproject timer patches for about a year which resolves the issue. Thank you for including these in the code base, I will test these now. By: Kevin Harwell (kharwell) 2020-06-15 17:12:50.544-0500 Ah okay from a previous comment I was under the impression you were still possibly having an issue. Glad to hear you are not. Since that is the case I'm going to close this issue. |