Summary: | ASTERISK-26344: Asterisk 13.11.0 + PJSIP crash | ||||||||||
Reporter: | Ian Gilmour (tuxian) | Labels: | |||||||||
Date Opened: | 2016-09-07 09:19:16 | Date Closed: | 2016-10-31 11:38:20 | ||||||||
Priority: | Major | Regression? | |||||||||
Status: | Closed/Complete | Components: | pjproject/pjsip | ||||||||
Versions: | 13.11.0 | Frequency of Occurrence | |||||||||
Related Issues: |
| ||||||||||
Environment: | Centos 6.8 (64-bit) + Asterisk-13.11.0 (with bundled 2.5.5 pjsip) + libsrtp-1.5.4 | Attachments: | ( 0) 0001-r5400-pjsip_tx_data_dec_ref.patch ( 1) 0006_jira_asterisk_26344_v13_fix_cache.patch ( 2) 0006_jira_asterisk_26344_v13_fix.patch ( 3) 2016-10-28.tgz ( 4) cli-and-gdb.tgz ( 5) cli-and-gdb.tgz ( 6) cli-and-gdb-2016-10-24.tgz ( 7) cli-and-gdb-2016-10-24-crash2.tgz ( 8) cli-and-gdb-2016-10-25-crash1.tgz ( 9) cli-and-gdb-2016-10-26-crash1-after-fix.tgz (10) cli-and-gdb-2016-10-27.tgz (11) cli-and-gdb-3-crashes.tgz (12) cli-and-gdb-bt-on-destroying_tx_data.tgz (13) cli-and-gdb-inc-dec-ref-logging.tgz (14) jira_asterisk_26344_v13_debuging.patch | ||||||||
Description: | Hi,
I have a development Asterisk 13.11.0 test setup (uses the bundled pjsip-2.5.5). Environment is Centos 6.8 (64-bit) + Asterisk-13.11.0 + libsrtp-1.5.4. On startup Asterisk registers 5 Asterisk users with a remote OpenSIPS server, over TLS, using PJSIP. As part of the test all 5 Asterisk PJSIP users are reregistered with OpenSIPS Server every couple of mins. All outgoing/incoming pjsip call media is encrypted using SRTP and via an external RTPPROXY running alongside the external OpenSIPS Server. Asterisk is also configured to use chan_sip on 127.0.0.1:5060 to allow calls from a locally run SIPp process. All SIPp calls are TCP+RTP. I use SIPp to run multiple concurrent loopback calls (calls vary in duration) through Asterisk to the OpenSIPS server and back to an echo() service running on the same Asterisk). i.e. {noformat} SIPp <-TCP/RTP-> chan_sip <-> chan_pjsip <-TLS/SRTP-> OpenSIPS server (+ rtpproxy) <-TLS/SRTP-> chan_pjsip (echo service). {noformat} Initially I see all chan_pjsip registrations and reregistrations for all 5 PJSIP users go out through a single TCP port. I then start a SIPp test running multiple concurrent calls. At some point into the test the Asterisk PJSIP TCP port gets closed and reopened - when it does so I see Asterisk crash shortly afterwards. Possibly significantly\(?) the time of the crash was around the time one of the PJSIP users should have reregistered after the TCP outgoing port change (The log shows all 5 PJSIP users reregistering after the PJSIP TCP port change, but only 4 of the 5 reregistering twice before the crash). {noformat} Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffa2814700 (LWP 7166)] __pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) bt #0 __pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50 #1 0x00007ffff78e9d9b in pj_mutex_lock (mutex=0x44492d6c6c6143) at ../src/pj/os_core_unix.c:1265 #2 0x00007ffff78e9e39 in pj_atomic_dec_and_get (atomic_var=0x7fffd8074630) at ../src/pj/os_core_unix.c:962 #3 0x00007ffff787d7e0 in pjsip_tx_data_dec_ref (tdata=0x7fff8c3bfab8) at ../src/pjsip/sip_transport.c:495 #4 0x00007ffff788a087 in tsx_shutdown (tsx=0x7fff94060a98) at ../src/pjsip/sip_transaction.c:1062 #5 0x00007ffff788b4bc in tsx_set_state (tsx=0x7fff94060a98, state=PJSIP_TSX_STATE_DESTROYED, event_src_type=PJSIP_EVENT_TIMER, event_src=0x7fff94060c50, flag=0) at ../src/pjsip/sip_transaction.c:1271 #6 0x00007ffff788b88e in tsx_on_state_terminated (tsx=<value optimized out>, event=<value optimized out>) at ../src/pjsip/sip_transaction.c:3337 #7 0x00007ffff788bcd5 in tsx_timer_callback (theap=<value optimized out>, entry=0x7fff94060c50) at ../src/pjsip/sip_transaction.c:1171 #8 0x00007ffff78fc449 in pj_timer_heap_poll (ht=0x1137950, next_delay=0x7fffa2813d30) at ../src/pj/timer.c:643 #9 0x00007ffff7875b19 in pjsip_endpt_handle_events2 (endpt=0x1137668, max_timeout=0x7fffa2813d70, p_count=0x0) at ../src/pjsip/sip_endpoint.c:712 #10 0x00007ffff1320b00 in monitor_thread_exec (endpt=<value optimized out>) at res_pjsip.c:3889 #11 0x00007ffff78ea5d6 in thread_main (param=0x114dee8) at ../src/pj/os_core_unix.c:541 #12 0x00007ffff5a8faa1 in start_thread (arg=0x7fffa2814700) at pthread_create.c:301 #13 0x00007ffff509baad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) bt full #0 __pthread_mutex_lock (mutex=0x44492d6c6c6143) at pthread_mutex_lock.c:50 type = <value optimized out> id = <value optimized out> #1 0x00007ffff78e9d9b in pj_mutex_lock (mutex=0x44492d6c6c6143) at ../src/pj/os_core_unix.c:1265 status = <value optimized out> #2 0x00007ffff78e9e39 in pj_atomic_dec_and_get (atomic_var=0x7fffd8074630) at ../src/pj/os_core_unix.c:962 new_value = <value optimized out> #3 0x00007ffff787d7e0 in pjsip_tx_data_dec_ref (tdata=0x7fff8c3bfab8) at ../src/pjsip/sip_transport.c:495 No locals. #4 0x00007ffff788a087 in tsx_shutdown (tsx=0x7fff94060a98) at ../src/pjsip/sip_transaction.c:1062 No locals. #5 0x00007ffff788b4bc in tsx_set_state (tsx=0x7fff94060a98, state=PJSIP_TSX_STATE_DESTROYED, event_src_type=PJSIP_EVENT_TIMER, event_src=0x7fff94060c50, flag=0) at ../src/pjsip/sip_transaction.c:1271 prev_state = PJSIP_TSX_STATE_TERMINATED #6 0x00007ffff788b88e in tsx_on_state_terminated (tsx=<value optimized out>, event=<value optimized out>) at ../src/pjsip/sip_transaction.c:3337 No locals. #7 0x00007ffff788bcd5 in tsx_timer_callback (theap=<value optimized out>, entry=0x7fff94060c50) at ../src/pjsip/sip_transaction.c:1171 event = {prev = 0x7fff8c5f4908, next = 0x1bfe, type = PJSIP_EVENT_TIMER, body = {timer = {entry = 0x7fff94060c50}, tsx_state = {src = {rdata = 0x7fff94060c50, tdata = 0x7fff94060c50, timer = 0x7fff94060c50, status = -1811542960, data = 0x7fff94060c50}, tsx = 0x7fffa2813c90, prev_state = -1568588592, type = 32767}, tx_msg = {tdata = 0x7fff94060c50}, tx_error = {tdata = 0x7fff94060c50, tsx = 0x7fffa2813c90}, rx_msg = {rdata = 0x7fff94060c50}, user = {user1 = 0x7fff94060c50, user2 = 0x7fffa2813c90, user3 = 0x7fffa2813cd0, user4 = 0x0}}} tsx = 0x7fff94060a98 #8 0x00007ffff78fc449 in pj_timer_heap_poll (ht=0x1137950, next_delay=0x7fffa2813d30) at ../src/pj/timer.c:643 node = 0x7fff94060c50 grp_lock = 0x7fffd8000ab8 now = {sec = 613363, msec = 925} count = 2 #9 0x00007ffff7875b19 in pjsip_endpt_handle_events2 (endpt=0x1137668, max_timeout=0x7fffa2813d70, p_count=0x0) at ../src/pjsip/sip_endpoint.c:712 timeout = {sec = 0, msec = 0} count = 0 net_event_count = 0 c = <value optimized out> #10 0x00007ffff1320b00 in monitor_thread_exec (endpt=<value optimized out>) at res_pjsip.c:3889 delay = {sec = 0, msec = 10} #11 0x00007ffff78ea5d6 in thread_main (param=0x114dee8) at ../src/pj/os_core_unix.c:541 rec = 0x114dee8 result = <value optimized out> #12 0x00007ffff5a8faa1 in start_thread (arg=0x7fffa2814700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fffa2814700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735919769344, -4896504223120570676, 140737488337344, 140735919770048, 0, 3, 4896356555646224076, 4896525551845689036}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #13 0x00007ffff509baad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. {noformat} | ||||||||||
Comments: | By: Asterisk Team (asteriskteam) 2016-09-07 09:19:17.415-0500 Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report. Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. By: Rusty Newton (rnewton) 2016-09-07 20:46:53.734-0500 Were you able to reproduce this? Or did it only happen the one time? If you can reproduce it, can you recompile with DONT_OPTIMIZE and BETTER_BACKTRACES and provide a new trace? Along, with that a full log showing verbose and debug( both level 5 or above) up to the time of the crash? By: Ian Gilmour (tuxian) 2016-09-08 04:00:40.558-0500 I repeated the test (with DONT_OPTIMIZE and BETTER_BACKTRACES) this morning - got another (different) crash after ~50 mins. During the test it closed and reopened PJSIP TCP port a few times and appeared to recover ok each time. n.b. * Full log file is > 1GB in size - so I've only attaching the tail end of it (let me know if you need more). * I inadvertently left PJ_LOG_MAX_LEVEL set to 5 during the run (I can rerun the test with it set to the default 3 if need be). {noformat} Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff420c1700 (LWP 21901)] 0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177 177 for (p_entry = &ht->table[hash & ht->rows], entry=*p_entry; Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.5-7.el6_0.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-20.el6_7.x86_64 elfutils-libelf-0.164-2.el6.x86_64 gsm-1.0.13-4.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-57.el6.x86_64 libacl-2.2.49-6.el6.x86_64 libattr-2.4.44-7.el6.x86_64 libcap-2.16-5.5.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 libtool-ltdl-2.2.6-15.5.el6.x86_64 libuuid-2.17.2-12.24.el6.x86_64 libxml2-2.7.6-21.el6_8.1.x86_64 libxslt-1.1.26-2.el6_3.1.x86_64 lm_sensors-libs-3.1.1-17.el6.x86_64 lua-5.1.4-4.1.el6.x86_64 mysql-libs-5.1.73-7.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64 net-snmp-libs-5.5-57.el6.x86_64 nspr-4.11.0-1.el6.x86_64 nss-3.21.0-8.el6.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.21.0-2.el6.x86_64 openldap-2.4.40-12.el6.x86_64 perl-libs-5.10.1-141.el6_7.1.x86_64 popt-1.13-7.el6.x86_64 rpm-libs-4.8.0-55.el6.x86_64 speex-1.2-0.12.rc1.1.el6.x86_64 sqlite-3.6.20-1.el6_7.2.x86_64 tcp_wrappers-libs-7.6-58.el6.x86_64 unixODBC-2.2.14-14.el6.x86_64 xz-libs-4.999.9-0.5.beta.20091007git.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177 #1 0x00007ffff78e3890 in pj_hash_get (ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=4294967295, hval=0x7fff420c071c) at ../src/pj/hash.c:230 #2 0x00007ffff12c168f in ast_sip_dict_get (ht=0x7fff88277558, key=0x7fffa2bb7b05 "on_response") at res_pjsip.c:3954 #3 0x00007fffa2bb5bb2 in session_inv_on_tsx_state_changed (inv=0x7fff94049678, tsx=0x7fff8976af48, e=0x7fff420c08d0) at res_pjsip_session.c:2623 #4 0x00007ffff7818594 in mod_inv_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip-ua/sip_inv.c:699 #5 0x00007ffff7864460 in pjsip_dlg_on_tsx_state (dlg=0x7fff887891d8, tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_dialog.c:2055 #6 0x00007ffff7864cd0 in mod_ua_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_ua_layer.c:178 #7 0x00007ffff785cdf8 in tsx_set_state (tsx=0x7fff8976af48, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffd8257bc8, flag=0) at ../src/pjsip/sip_transaction.c:1233 #8 0x00007ffff785fd39 in tsx_on_state_proceeding_uac (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2957 #9 0x00007ffff785f357 in tsx_on_state_calling (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2540 #10 0x00007ffff785dcef in pjsip_tsx_recv_msg (tsx=0x7fff8976af48, rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:1787 #11 0x00007ffff785c28a in mod_tsx_layer_on_rx_response (rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:875 #12 0x00007ffff7842b8d in pjsip_endpt_process_rx_data (endpt=0x123d1a8, rdata=0x7fffd8257bc8, p=0x7ffff14f7e40, p_handled=0x7fff420c0b64) at ../src/pjsip/sip_endpoint.c:894 #13 0x00007ffff12dce59 in distribute (data=0x7fffd8257bc8) at res_pjsip/pjsip_distributor.c:765 #14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x7fff94094648) at taskprocessor.c:938 #15 0x00000000005f6aca in execute_tasks (data=0x7fff94094648) at threadpool.c:1322 #16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x1236e88) at taskprocessor.c:938 #17 0x00000000005f4dd5 in threadpool_execute (pool=0x123ab28) at threadpool.c:351 #18 0x00000000005f63e8 in worker_active (worker=0x7fff9006c868) at threadpool.c:1105 #19 0x00000000005f6194 in worker_start (arg=0x7fff9006c868) at threadpool.c:1024 #20 0x00000000006023ee in dummy_start (data=0x7fff900696d0) at utils.c:1235 #21 0x00007ffff5a4baa1 in start_thread (arg=0x7fff420c1700) at pthread_create.c:301 #22 0x00007ffff5057aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) bt full #0 0x00007ffff78e36d2 in find_entry (pool=0x0, ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=11, val=0x0, hval=0x7fff420c071c, entry_buf=0x0, lower=0) at ../src/pj/hash.c:177 hash = 1918918923 p_entry = 0x3085c entry = 0x57d11483 #1 0x00007ffff78e3890 in pj_hash_get (ht=0x7fff88277558, key=0x7fffa2bb7b05, keylen=4294967295, hval=0x7fff420c071c) at ../src/pj/hash.c:230 entry = 0x7fff94049678 #2 0x00007ffff12c168f in ast_sip_dict_get (ht=0x7fff88277558, key=0x7fffa2bb7b05 "on_response") at res_pjsip.c:3954 hval = 1918918923 #3 0x00007fffa2bb5bb2 in session_inv_on_tsx_state_changed (inv=0x7fff94049678, tsx=0x7fff8976af48, e=0x7fff420c08d0) at res_pjsip_session.c:2623 cb = 0x4 id = 15 session = 0x7fff9404cab8 tdata = 0x7ffff78dd302 __PRETTY_FUNCTION__ = "session_inv_on_tsx_state_changed" #4 0x00007ffff7818594 in mod_inv_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip-ua/sip_inv.c:699 dlg = 0x7fff887891d8 inv = 0x7fff94049678 #5 0x00007ffff7864460 in pjsip_dlg_on_tsx_state (dlg=0x7fff887891d8, tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_dialog.c:2055 i = 2 #6 0x00007ffff7864cd0 in mod_ua_on_tsx_state (tsx=0x7fff8976af48, e=0x7fff420c08d0) at ../src/pjsip/sip_ua_layer.c:178 dlg = 0x7fff887891d8 #7 0x00007ffff785cdf8 in tsx_set_state (tsx=0x7fff8976af48, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffd8257bc8, flag=0) at ../src/pjsip/sip_transaction.c:1233 e = {prev = 0x7fff420c0910, next = 0x7ffff78f33cb, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7fffd8257bc8}, tsx_state = {src = {rdata = 0x7fffd8257bc8, tdata = 0x7fffd8257bc8, timer = 0x7fffd8257bc8, status = -668632120, data = 0x7fffd8257bc8}, tsx = 0x7fff8976af48, prev_state = 1, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7fffd8257bc8}, tx_error = {tdata = 0x7fffd8257bc8, tsx = 0x7fff8976af48}, rx_msg = {rdata = 0x7fffd8257bc8}, user = {user1 = 0x7fffd8257bc8, user2 = 0x7fff8976af48, user3 = 0x300000001, user4 = 0x28976aff0}}} prev_state = PJSIP_TSX_STATE_CALLING #8 0x00007ffff785fd39 in tsx_on_state_proceeding_uac (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2957 timeout = {sec = 0, msec = 0} #9 0x00007ffff785f357 in tsx_on_state_calling (tsx=0x7fff8976af48, event=0x7fff420c0a20) at ../src/pjsip/sip_transaction.c:2540 msg = 0x7fffd8259690 code = 200 #10 0x00007ffff785dcef in pjsip_tsx_recv_msg (tsx=0x7fff8976af48, rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:1787 event = {prev = 0x0, next = 0x7fff8816e528, type = PJSIP_EVENT_RX_MSG, body = {timer = {entry = 0x7fffd8257bc8}, tsx_state = {src = {rdata = 0x7fffd8257bc8, tdata = 0x7fffd8257bc8, timer = 0x7fffd8257bc8, status = -668632120, data = 0x7fffd8257bc8}, tsx = 0x7fff420c0a70, prev_state = -141699390, type = 32767}, tx_msg = {tdata = 0x7fffd8257bc8}, tx_error = {tdata = 0x7fffd8257bc8, tsx = 0x7fff420c0a70}, rx_msg = {rdata = 0x7fffd8257bc8}, user = {user1 = 0x7fffd8257bc8, user2 = 0x7fff420c0a70, user3 = 0x7ffff78dd6c2, user4 = 0x7fff420c0a70}}} #11 0x00007ffff785c28a in mod_tsx_layer_on_rx_response (rdata=0x7fffd8257bc8) at ../src/pjsip/sip_transaction.c:875 key = {ptr = 0x7fff89121f80 "c$BYE$z9hG4bKPj4ce37655-c6fe-467e-9834-0cd83e7839c6", slen = 51} hval = 452279487 tsx = 0x7fff8976af48 #12 0x00007ffff7842b8d in pjsip_endpt_process_rx_data (endpt=0x123d1a8, rdata=0x7fffd8257bc8, p=0x7ffff14f7e40, p_handled=0x7fff420c0b64) at ../src/pjsip/sip_endpoint.c:894 msg = 0x7fffd8259690 def_prm = {start_prio = 1108085632, start_mod = 0x6f3e0b, idx_after_start = 7291691, silent = 0} mod = 0x7ffff7b54220 handled = 0 i = 1 status = 0 #13 0x00007ffff12dce59 in distribute (data=0x7fffd8257bc8) at res_pjsip/pjsip_distributor.c:765 param = {start_prio = 0, start_mod = 0x7ffff14f7ca0, idx_after_start = 1, silent = 0} handled = 0 rdata = 0x7fffd8257bc8 is_request = 0 is_ack = 0 endpoint = 0x7fff00000000 #14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x7fff94094648) at taskprocessor.c:938 local = {local_data = 0x7fff420c19c0, data = 0x600b26} t = 0x7fffd817e970 size = 9989040 __PRETTY_FUNCTION__ = "ast_taskprocessor_execute" #15 0x00000000005f6aca in execute_tasks (data=0x7fff94094648) at threadpool.c:1322 tps = 0x7fff94094648 #16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x1236e88) at taskprocessor.c:938 local = {local_data = 0x57d11483, data = 0x0} t = 0x7fffd80a2b30 size = 140734301476608 __PRETTY_FUNCTION__ = "ast_taskprocessor_execute" #17 0x00000000005f4dd5 in threadpool_execute (pool=0x123ab28) at threadpool.c:351 __PRETTY_FUNCTION__ = "threadpool_execute" #18 0x00000000005f63e8 in worker_active (worker=0x7fff9006c868) at threadpool.c:1105 alive = 1 #19 0x00000000005f6194 in worker_start (arg=0x7fff9006c868) at threadpool.c:1024 worker = 0x7fff9006c868 saved_state = 32767 __PRETTY_FUNCTION__ = "worker_start" #20 0x00000000006023ee in dummy_start (data=0x7fff900696d0) at utils.c:1235 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140734301476608, -8273484635704556838, 140737221511808, 140734301477312, 140735609810640, 3, -8273484635729722662, 8273157970701812442}, __mask_was_saved = 0}}, __pad = {0x7fff420c0e30, 0x0, 0x0, 0x20}} __cancel_routine = 0x451815 <ast_unregister_thread> __cancel_arg = 0x7fff420c1700 not_first_call = 0 ret = 0x7ffff52f9858 a = {start_routine = 0x5f610d <worker_start>, data = 0x7fff9006c868, name = 0x7fff900ce6e0 "worker_start started at [ 1079] threadpool.c worker_thread_start()"} #21 0x00007ffff5a4baa1 in start_thread (arg=0x7fff420c1700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fff420c1700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140734301476608, 8273158244416717530, 140737221511808, 140734301477312, 140735609810640, 3, -8273484635702459686, -8273179729874550054}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #22 0x00007ffff5057aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. {noformat} By: Ian Gilmour (tuxian) 2016-09-08 04:11:32.476-0500 cli prior to crash + gdb output. By: Ian Gilmour (tuxian) 2016-09-09 06:42:04.940-0500 This morning (after a rebuild), using the same config, I got another crash - this time it occurred immediately on asterisk startup (i.e. no calls running)!! Could it be caused by some sort of race condition on multiple PJSIP registrations to the same SIP server? I can't think of anything else that's odd about this setup. {noformat} Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffa18ed700 (LWP 16459)] 0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935 935 pj_mutex_lock( atomic_var->mutex ); (gdb) bt #0 0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935 #1 0x00007ffff741b20c in pj_atomic_inc (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:949 #2 0x00007ffff7387836 in pjsip_tx_data_add_ref (tdata=0x7fff8800d958) at ../src/pjsip/sip_transport.c:464 #3 0x00007fff6410e929 in sip_outbound_registration_response_cb (param=0x7fffa18ec5a0) at res_pjsip_outbound_registration.c:939 #4 0x00007ffff7360202 in call_callback (regc=0x7fff88008338, status=0, st_code=200, reason=0x7fffcc006ae0, rdata=0x7fffcc005008, expiration=120, contact_cnt=1, contact=0x7fffa18ec680, is_unreg=0) at ../src/pjsip-ua/sip_reg.c:775 #5 0x00007ffff73614c1 in regc_tsx_callback (token=0x7fff88008338, event=0x7fffa18ec8d0) at ../src/pjsip-ua/sip_reg.c:1345 #6 0x00007ffff739e59f in mod_util_on_tsx_state (tsx=0x7fff88015b98, event=0x7fffa18ec8d0) at ../src/pjsip/sip_util_statefull.c:81 #7 0x00007ffff739ae90 in tsx_set_state (tsx=0x7fff88015b98, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffcc005008, flag=0) at ../src/pjsip/sip_transaction.c:1233 #8 0x00007ffff739ddd1 in tsx_on_state_proceeding_uac (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2957 #9 0x00007ffff739d3ef in tsx_on_state_calling (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2540 #10 0x00007ffff739bd87 in pjsip_tsx_recv_msg (tsx=0x7fff88015b98, rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:1787 #11 0x00007ffff739a322 in mod_tsx_layer_on_rx_response (rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:875 #12 0x00007ffff7380c25 in pjsip_endpt_process_rx_data (endpt=0x123f798, rdata=0x7fffcc005008, p=0x7ffff1035e40, p_handled=0x7fffa18ecb64) at ../src/pjsip/sip_endpoint.c:894 #13 0x00007ffff0e1ae59 in distribute (data=0x7fffcc005008) at res_pjsip/pjsip_distributor.c:765 #14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x11a7cb8) at taskprocessor.c:938 #15 0x00000000005f6aca in execute_tasks (data=0x11a7cb8) at threadpool.c:1322 #16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x123ce38) at taskprocessor.c:938 #17 0x00000000005f4dd5 in threadpool_execute (pool=0x123d088) at threadpool.c:351 #18 0x00000000005f63e8 in worker_active (worker=0x7fff90000ee8) at threadpool.c:1105 #19 0x00000000005f6194 in worker_start (arg=0x7fff90000ee8) at threadpool.c:1024 #20 0x00000000006023ee in dummy_start (data=0x7fff90000e70) at utils.c:1235 #21 0x00007ffff5589aa1 in start_thread (arg=0x7fffa18ed700) at pthread_create.c:301 #22 0x00007ffff4b95aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) bt full #0 0x00007ffff741b1b4 in pj_atomic_inc_and_get (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:935 new_value = 140737354108928 #1 0x00007ffff741b20c in pj_atomic_inc (atomic_var=0x3536663238336466) at ../src/pj/os_core_unix.c:949 No locals. #2 0x00007ffff7387836 in pjsip_tx_data_add_ref (tdata=0x7fff8800d958) at ../src/pjsip/sip_transport.c:464 No locals. #3 0x00007fff6410e929 in sip_outbound_registration_response_cb (param=0x7fffa18ec5a0) at res_pjsip_outbound_registration.c:939 retry_after = 0x0 tsx = 0x7fff88015b98 client_state = 0x11a7b78 response = 0x7fff88018b88 callback_invoked = 0x7fff88010060 __PRETTY_FUNCTION__ = "sip_outbound_registration_response_cb" #4 0x00007ffff7360202 in call_callback (regc=0x7fff88008338, status=0, st_code=200, reason=0x7fffcc006ae0, rdata=0x7fffcc005008, expiration=120, contact_cnt=1, contact=0x7fffa18ec680, is_unreg=0) at ../src/pjsip-ua/sip_reg.c:775 cbparam = {regc = 0x7fff88008338, token = 0x11a7b78, status = 0, code = 200, reason = {ptr = 0x7fffcc006b40 "OK.5", slen = 2}, rdata = 0x7fffcc005008, expiration = 120, contact_cnt = 1, contact = {0x7fff8800db98, 0xa18ec680, 0x0, 0x7fffa18ec620, 0x7ffff741b75a, 0x7fff8800dc40, 0x7fff88007fa8, 0x7835015fb8, 0x88008338, 0x7fffa18ec640}, is_unreg = 0} #5 0x00007ffff73614c1 in regc_tsx_callback (token=0x7fff88008338, event=0x7fffa18ec8d0) at ../src/pjsip-ua/sip_reg.c:1345 is_unreg = 0 rdata = 0x7fffcc005008 expiration = 120 contact_cnt = 1 contact = {0x7fff8800db98, 0x45e035, 0xffffffffa18ec6b0, 0x7fff880119e8, 0x7fff8800e388, 0x11a7bf0, 0x7fffa18ec830, 0x7fffa18ec820, 0x7fffa18ec710, 0x7ffff741cc2c} status = 48 regc = 0x7fff88008338 tsx = 0x7fff88015b98 handled = 0 update_contact = 0 #6 0x00007ffff739e59f in mod_util_on_tsx_state (tsx=0x7fff88015b98, event=0x7fffa18ec8d0) at ../src/pjsip/sip_util_statefull.c:81 tsx_data = 0x7fff88015fb8 #7 0x00007ffff739ae90 in tsx_set_state (tsx=0x7fff88015b98, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7fffcc005008, flag=0) at ../src/pjsip/sip_transaction.c:1233 e = {prev = 0x7fffa18ec910, next = 0x7ffff7431463, type = PJSIP_EVENT_TSX_STATE, body = {timer = {entry = 0x7fffcc005008}, tsx_state = {src = {rdata = 0x7fffcc005008, tdata = 0x7fffcc005008, timer = 0x7fffcc005008, status = -872394744, data = 0x7fffcc005008}, tsx = 0x7fff88015b98, prev_state = 1, type = PJSIP_EVENT_RX_MSG}, tx_msg = {tdata = 0x7fffcc005008}, tx_error = {tdata = 0x7fffcc005008, tsx = 0x7fff88015b98}, rx_msg = {rdata = 0x7fffcc005008}, user = {user1 = 0x7fffcc005008, user2 = 0x7fff88015b98, user3 = 0x300000001, user4 = 0x288015c40}}} prev_state = PJSIP_TSX_STATE_CALLING #8 0x00007ffff739ddd1 in tsx_on_state_proceeding_uac (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2957 timeout = {sec = 0, msec = 0} #9 0x00007ffff739d3ef in tsx_on_state_calling (tsx=0x7fff88015b98, event=0x7fffa18eca20) at ../src/pjsip/sip_transaction.c:2540 msg = 0x7fffcc006ad0 code = 200 #10 0x00007ffff739bd87 in pjsip_tsx_recv_msg (tsx=0x7fff88015b98, rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:1787 event = {prev = 0x0, next = 0x7fff88016438, type = PJSIP_EVENT_RX_MSG, body = {timer = {entry = 0x7fffcc005008}, tsx_state = {src = {rdata = 0x7fffcc005008, tdata = 0x7fffcc005008, timer = 0x7fffcc005008, status = -872394744, data = 0x7fffcc005008}, tsx = 0x7fffa18eca70, prev_state = -146688166, type = 32767}, tx_msg = {tdata = 0x7fffcc005008}, tx_error = {tdata = 0x7fffcc005008, tsx = 0x7fffa18eca70}, rx_msg = {rdata = 0x7fffcc005008}, user = {user1 = 0x7fffcc005008, user2 = 0x7fffa18eca70, user3 = 0x7ffff741b75a, user4 = 0x7fffa18eca70}}} #11 0x00007ffff739a322 in mod_tsx_layer_on_rx_response (rdata=0x7fffcc005008) at ../src/pjsip/sip_transaction.c:875 key = {ptr = 0x7fff8800de50 "c$REGISTER$z9hG4bKPjc25c7b82-ee5b-4ac0-9679-dfe63f1cfdef", slen = 56} hval = 4161045600 tsx = 0x7fff88015b98 #12 0x00007ffff7380c25 in pjsip_endpt_process_rx_data (endpt=0x123f798, rdata=0x7fffcc005008, p=0x7ffff1035e40, p_handled=0x7fffa18ecb64) at ../src/pjsip/sip_endpoint.c:894 msg = 0x7fffcc006ad0 def_prm = {start_prio = 2710489984, start_mod = 0x6f3e0b, idx_after_start = 2710490000, silent = 32767} mod = 0x7ffff76922a0 handled = 0 i = 1 status = 0 #13 0x00007ffff0e1ae59 in distribute (data=0x7fffcc005008) at res_pjsip/pjsip_distributor.c:765 param = {start_prio = 0, start_mod = 0x7ffff1035ca0, idx_after_start = 1, silent = 0} handled = 0 rdata = 0x7fffcc005008 is_request = 0 is_ack = 0 endpoint = 0x0 #14 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x11a7cb8) at taskprocessor.c:938 local = {local_data = 0x7fffa18ed9c0, data = 0x600b26} t = 0x7fffcc02a000 size = 9989040 __PRETTY_FUNCTION__ = "ast_taskprocessor_execute" #15 0x00000000005f6aca in execute_tasks (data=0x11a7cb8) at threadpool.c:1322 tps = 0x11a7cb8 #16 0x00000000005ed9cb in ast_taskprocessor_execute (tps=0x123ce38) at taskprocessor.c:938 local = {local_data = 0x57d28c4d, data = 0x0} t = 0x7fffcc029d10 size = 140735903880960 __PRETTY_FUNCTION__ = "ast_taskprocessor_execute" #17 0x00000000005f4dd5 in threadpool_execute (pool=0x123d088) at threadpool.c:351 __PRETTY_FUNCTION__ = "threadpool_execute" #18 0x00000000005f63e8 in worker_active (worker=0x7fff90000ee8) at threadpool.c:1105 alive = 0 #19 0x00000000005f6194 in worker_start (arg=0x7fff90000ee8) at threadpool.c:1024 worker = 0x7fff90000ee8 saved_state = 32767 __PRETTY_FUNCTION__ = "worker_start" #20 0x00000000006023ee in dummy_start (data=0x7fff90000e70) at utils.c:1235 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140735903880960, -6756505628295498910, 140737220745856, 140735903881664, 0, 3, -6756505628337441950, 6756334233892352866}, __mask_was_saved = 0}}, __pad = {0x7fffa18ece30, 0x0, 0x0, 0x0}} __cancel_routine = 0x451815 <ast_unregister_thread> __cancel_arg = 0x7fffa18ed700 not_first_call = 0 ret = 0x0 a = {start_routine = 0x5f610d <worker_start>, data = 0x7fff90000ee8, name = 0x7fff90001000 "worker_start started at [ 1079] threadpool.c worker_thread_start()"} #21 0x00007ffff5589aa1 in start_thread (arg=0x7fffa18ed700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fffa18ed700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140735903880960, 6756335057497289570, 140737220745856, 140735903881664, 0, 3, -6756505628289207454, -6756322751223106718}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #22 0x00007ffff4b95aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. {noformat} By: Ian Gilmour (tuxian) 2016-09-09 06:54:22.492-0500 cli + gdb output from system startup for new crash. By: Ian Gilmour (tuxian) 2016-09-09 08:06:39.416-0500 Observation - changing res_pjsip_outbound_registration.c:1267 from: {noformat} schedule_registration(state->client_state, (ast_random() % 10) + 1); {noformat} to: {noformat} schedule_registration(state->client_state, 3 + 1); {noformat} seems to gives me a system that crashes on PJSIP user registration during startup (the only time I see it work is when I see a SIP 503 reported for all the initial registration attempts). By: Richard Mudgett (rmudgett) 2016-10-11 15:49:23.647-0500 I have been unable to find a cause for the crash. A patch was recently committed to the v13 branch to help gather more information. The patch extends MALLOC_DEBUG to cover PJPROJECT memory pools when you use the --with-pjproject-bundled Asterisk configure script flag. ./configure --with-pjproject-bundled --enable-dev-mode https://gerrit.asterisk.org/#/c/4030/ By: Ian Gilmour (tuxian) 2016-10-13 04:15:19.720-0500 Hi Richard, I built the current Asterisk 13 trunk code (commit 2971c1b4ebf8c4fb5ee2053c5812ea3896d6d17d) to generate the cli-and-gdb-3-crashes.txt. The system was configured with: {noformat} ./configure --prefix=$PREFIX --with-crypto=/usr --with-ssl=/usr \ --with-srtp=$PREFIX --with-libcurl=$PREFIX --with-jansson=$PREFIX \ --with-pjproject-bundled --enable-dev-mode {noformat} and {noformat} menuselect/menuselect \ --enable DONT_OPTIMIZE --enable BETTER_BACKTRACES --enable MALLOC_DEBUG menuselect.makeopts {noformat} I changed 1 line of code - the schedule_registration() call as per above (replacing (ast_random() % 10) with 3) in order to force the issue to occur more quickly as I am pretty sure it's related to overlapping registrations all going out on the same pjsip transport. With this 1 line change crashes occur almost immediately on startup. Without the change crashes occur more randomly. The cli-and-gdb-3-crashes.txt file contains 3 crashes. I've dumped out some additional data structures with gdb when the crashes occur. At least some of this data looks to be corrupt e.g. in a print *tsx->last_tx there is an is_pending value of 36. A quick glance through the code suggests this field should only ever be 0 or 1. So it looks like it's potentially accessing data structures that have already been freed and the memory reallocated - but that's just a guess. The var/log/asterisk/mmlog file only contains the "New session" text for each run. By: Ian Gilmour (tuxian) 2016-10-13 09:31:34.956-0500 For cli-and-gdb-inc-dec-ref-logging.txt I modified third-party/pjproject/patches/0001-r5400-pjsip_tx_data_dec_ref.patch to log calls to pjsip_tx_add_ref() and pjsip_tx_dec_ref(). At the time of the crash it's attempting to use a tdata of 0x7fff80000eb8 but this pjsip_tx_data was destroyed earlier ~[2016-10-13 13:49:34.858]. By: Ian Gilmour (tuxian) 2016-10-13 10:50:43.554-0500 cli-and-gdb-bt-on-destroying_tx_data.tgz shows a slightly different crash with the same root cause. In frame 1 tsx->last_tx is pointing to data that has already been freed. On this gdb trace I set a breakpoint when the pjsip_tx_data is about to be destroyed and got gdb to do a "bt full; cont" on hitting the breakpoint. So you can identify where the pjsip_tx_data destruction is originating. By: Ross Beer (rossbeer) 2016-10-13 10:56:09.813-0500 Not sure if this is of any help, however could this be fixed by using the SVN version of PJSIP. There has just been a patch put up to help resolve TLS issues: https://trac.pjsip.org/repos/changeset/5459 By: Ian Gilmour (tuxian) 2016-10-13 11:27:43.660-0500 In the case of the last crash on frame 1 there is an assumption in the code that if tsx->last_tx is non-NULL then it points to valid pjsip_tx_data. But this clearly isn't the case. I suspect the crashes occurs because tsx->last_tx is not being set to NULL everywhere it should be. (i.e. whenever pjsip_tx_data_dec_ref() returns PJSIP_EBUFDESTROYED). There are places in the asterisk code where it sets: {noformat} response->old_request = tsx->last_tx; [sip_outbound_registration_response_cb()] {noformat} and others where: {noformat} pjsip_tx_data_dec_ref(response->old_request); [registration_response_destroy()] {noformat} with no check for the PJSIP_EBUFDESTROYED return value and no setting to NULL of response->old_request or tsx->last_tx as a result. I suspect the error lies in these bits of code rather than the PJSIP TLS code. By: Richard Mudgett (rmudgett) 2016-10-17 18:09:37.503-0500 Both ASTERISK-26387 and ASTERISK-26344 are likely the same issue. I have been studying the logs from both issues. I have noticed that on the ASTERISK-26387 logs there are OPTIONS ping responses for endpoint qualification that are being processed by a different serializer than the request. This can cause reentrancy problems (e.g. crashes). The outgoing OPTIONS requests go out on a pjsip/default serializer while the response is processed by a pjsip/distributor serializer because the distributor cannot find the original serializer that sent the request. I also noticed that when this happened the updated contact status was reporting for an endpoint that needed DNS resolution (sip:sbc.anveno.com was one). On the ASTERISK-26344 logs a similar thing is happening but I see it for outbound registration requests needing DNS resolution. The REGISTER response is being processed by a pjsip/distributor serializer while the request went out on a pjsip/outreg serializer. By: Ian Gilmour (tuxian) 2016-10-21 08:17:42.226-0500 I spent a bit more time looking over the code but I still cannot see where the pjsip_tx_data ref_cnt isn't been incremented correctly. But, to try and confirm that this is the problem, as a test I modified pjsip_tx_data_create() to call pjsip_tx_data_add_ref() just prior to returning PJ_SUCCESS. i.e. {noformat} #if defined(PJ_DEBUG) && PJ_DEBUG!=0 pj_atomic_inc( tdata->mgr->tdata_counter ); #endif *p_tdata = tdata; pjsip_tx_data_add_ref(tdata); return PJ_SUCCESS; } {noformat} and I now have a system that no longer crashes. But (as expected) it does now leak memory. :-) By: Richard Mudgett (rmudgett) 2016-10-21 22:05:08.315-0500 [~tuxian] I still haven't found why the tdata ref count is off. There is some subtle race condition in play here between the serializer and the monitor thread. I am attaching a patch to gather more debug information concerning the tdata ref counts, transaction, and the transaction's group locking. If you can apply the patch and gather the same kind of information you gathered for the [cli-and-gdb.tgz] we might be able to see what is happening. Debug level 5. [^jira_asterisk_26344_v13_debuging.patch] - This patch replaces the patch you attached. # On a clean workspace apply the patch: {{patch -p1 -i jira_asterisk_26344_v13_debuging.patch}} # Modify the {{schedule_registration(state->client_state, 3 + 1);}} line to cause the crash so we can see which tdata to look for in the debug log. # Configure asterisk to use the bundled pjproject. FYI: With MALLOC_DEBUG enabled, when you see something reference the 0xdeaddeaddeaddead address you know someone is attempting to use already freed memory. By: Ian Gilmour (tuxian) 2016-10-24 08:55:52.438-0500 Hi Richard, cli-and-gdb-2016-10-14.tgz log was generated after: # applying your patch. # using: {noformat}schedule_registration(state->client_state, 3 + 1);{noformat} # configured using: {noformat} --with-pjproject-bundled --enable-dev-mode {noformat} # menuselect used: {noformat} menuselect/menuselect \ --enable DONT_OPTIMIZE --enable BETTER_BACKTRACES --enable MALLOC_DEBUG menuselect.makeopts {noformat} At the time of the crash frame 0 has entry=0xdeaddeaddeaddead. By: Ian Gilmour (tuxian) 2016-10-24 09:34:02.792-0500 cli-and-gdb-2016-10-24-crash2.txt is the same code build. but shows a slightly different crash. This time hdr=0xdeaddeaddeaddead. By: Ian Gilmour (tuxian) 2016-10-25 04:37:42.304-0500 cli-and-gdb-2016-10-25-crash1.txt - another crash similar to the last one but with a bit more debug as to when hdr gets set to 0xdeaddeaddeaddead. By: Richard Mudgett (rmudgett) 2016-10-25 17:44:23.679-0500 [^0006_jira_asterisk_26344_v13_fix.patch] - This is an initial attempt to fix the crashes resulting from tdata being unreffed too many times. Copy this file as is into the third-party/pjproject/patches directory. Let me know how well it works. By: Ian Gilmour (tuxian) 2016-10-26 05:19:40.679-0500 still crashing i'm afraid - see [^cli-and-gdb-2016-10-26-crash1-after-fix.tgz] for details. By: Richard Mudgett (rmudgett) 2016-10-26 13:08:21.767-0500 [^0006_jira_asterisk_26344_v13_fix_cache.patch] - This patch should fix the DNS cache crash you are now seeing. Place this file also in the third-party/pjproject/patches directory. You will need both fix patches because the DNS cache crash is actually a different bug. Let me know how well these fixes are working. By: Ian Gilmour (tuxian) 2016-10-27 04:40:57.730-0500 With the latest patch in place I got no immediate crash so I left it running overnight. Still no crashes by morning. So good progress. :-) But I suspect the following indicate we now have a memory leak. {noformat} grep -e ":40:0[012]" -B 3 screenlog.2 | grep -e ":40:0" -e 31339 | sed 's/[[:space:]]*$//' 31339 iang 20 0 2778m 63m 13m S 0.3 0.8 0:47.22 asterisk top - 21:40:01 up 14 days, 13:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 63m 13m S 0.7 0.8 1:03.18 asterisk top - 22:40:02 up 14 days, 14:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 64m 13m S 1.7 0.8 1:19.21 asterisk top - 23:40:01 up 14 days, 15:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 65m 13m S 0.3 0.8 1:36.02 asterisk top - 00:40:02 up 14 days, 16:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 67m 13m S 0.0 0.9 1:52.29 asterisk top - 01:40:00 up 14 days, 17:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 67m 13m S 0.7 0.9 2:08.39 asterisk top - 02:40:01 up 14 days, 18:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 68m 13m S 0.7 0.9 2:24.50 asterisk top - 03:40:02 up 14 days, 19:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 68m 13m S 0.3 0.9 2:41.09 asterisk top - 04:40:01 up 14 days, 20:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 69m 13m S 0.3 0.9 2:59.74 asterisk top - 05:40:02 up 14 days, 21:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 70m 13m S 0.3 0.9 3:16.87 asterisk top - 06:40:00 up 14 days, 22:39, 5 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 72m 13m S 0.7 0.9 3:33.20 asterisk top - 07:40:01 up 14 days, 23:39, 6 users, load average: 0.00, 0.00, 0.00 31339 iang 20 0 2778m 73m 13m S 0.3 0.9 3:49.29 asterisk top - 08:40:02 up 15 days, 39 min, 6 users, load average: 0.00, 0.00, 0.00 {noformat} and, {noformat} grep -nH -B 5 -e "bytes in all allocations and deferred free allocations" screenlog.0 screenlog.0-56326- screenlog.0-56327- 12590115 bytes in all allocations screenlog.0-56328- 1409194 bytes in deferred free large allocations screenlog.0-56329- 35735 bytes in deferred free small allocations screenlog.0-56330- 1444929 bytes in deferred free allocations screenlog.0:56331: 14035044 bytes in all allocations and deferred free allocations -- screenlog.0-56594- screenlog.0-56595- 12591138 bytes in all allocations screenlog.0-56596- 1405537 bytes in deferred free large allocations screenlog.0-56597- 35642 bytes in deferred free small allocations screenlog.0-56598- 1441179 bytes in deferred free allocations screenlog.0:56599: 14032317 bytes in all allocations and deferred free allocations -- screenlog.0-59106- screenlog.0-59107- 12638906 bytes in all allocations screenlog.0-59108- 1416980 bytes in deferred free large allocations screenlog.0-59109- 36774 bytes in deferred free small allocations screenlog.0-59110- 1453754 bytes in deferred free allocations screenlog.0:59111: 14092660 bytes in all allocations and deferred free allocations -- screenlog.0-66080- screenlog.0-66081- 12690106 bytes in all allocations screenlog.0-66082- 1441081 bytes in deferred free large allocations screenlog.0-66083- 36137 bytes in deferred free small allocations screenlog.0-66084- 1477218 bytes in deferred free allocations screenlog.0:66085: 14167324 bytes in all allocations and deferred free allocations -- screenlog.0-67593- screenlog.0-67594- 12700346 bytes in all allocations screenlog.0-67595- 1439375 bytes in deferred free large allocations screenlog.0-67596- 36300 bytes in deferred free small allocations screenlog.0-67597- 1475675 bytes in deferred free allocations screenlog.0:67598: 14176021 bytes in all allocations and deferred free allocations -- screenlog.0-285757- screenlog.0-285758- 15318479 bytes in all allocations screenlog.0-285759- 1413605 bytes in deferred free large allocations screenlog.0-285760- 36032 bytes in deferred free small allocations screenlog.0-285761- 1449637 bytes in deferred free allocations screenlog.0:285762: 16768116 bytes in all allocations and deferred free allocations -- screenlog.0-325786- screenlog.0-325787- 15715874 bytes in all allocations screenlog.0-325788- 1381982 bytes in deferred free large allocations screenlog.0-325789- 36186 bytes in deferred free small allocations screenlog.0-325790- 1418168 bytes in deferred free allocations screenlog.0:325791: 17134042 bytes in all allocations and deferred free allocations {noformat} [^cli-and-gdb-2016-10-27.tgz] is last night log. By: Richard Mudgett (rmudgett) 2016-10-27 14:07:40.715-0500 We have more than a potential memory leak. MALLOC_DEBUG is reporting memory corruption in the [^cli-and-gdb-2016-10-27.tgz] log. Unfortunately, MALLOC_DEBUG cannot be of any more help in hunting down the corruption. You'll need to switch to using valgrind after disabling MALLOC_DEBUG to get more information on who is corrupting the memory. https://wiki.asterisk.org/wiki/display/AST/Valgrind By: Ian Gilmour (tuxian) 2016-10-28 10:22:47.001-0500 [^2016-10-28.tgz] has the results of 5 valgrind runs with associated console/gdb outputs. Some of these I ^C'ed as they didn't want to crash on their own. Others I changed the Valgrind debug switches used. Hopefully there is something in here that proves useful. By: Friendly Automation (friendly-automation) 2016-10-28 17:15:16.306-0500 Change 4228 had a related patch set uploaded by Richard Mudgett: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4228|https://gerrit.asterisk.org/4228] By: Friendly Automation (friendly-automation) 2016-10-28 17:15:37.648-0500 Change 4229 had a related patch set uploaded by Richard Mudgett: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4229|https://gerrit.asterisk.org/4229] By: Friendly Automation (friendly-automation) 2016-10-28 17:15:54.880-0500 Change 4230 had a related patch set uploaded by Richard Mudgett: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4230|https://gerrit.asterisk.org/4230] By: Richard Mudgett (rmudgett) 2016-10-28 17:22:30.435-0500 [~tuxian] I'm going to declare this issue resolved when the above reviews are merged because the original crashing issue and the cache crash is fixed with the patch. Teluu has already merged those changes into their code. The memory corruption and possible memory leak issue will be continued in the ASTERISK-26516 issue. By: Friendly Automation (friendly-automation) 2016-10-31 11:38:21.138-0500 Change 4228 merged by Joshua Colp: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4228|https://gerrit.asterisk.org/4228] By: Friendly Automation (friendly-automation) 2016-10-31 12:45:05.731-0500 Change 4229 merged by zuul: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4229|https://gerrit.asterisk.org/4229] By: Friendly Automation (friendly-automation) 2016-10-31 12:45:08.793-0500 Change 4230 merged by zuul: bundled pjproject: Crashes while resolving DNS names. [https://gerrit.asterisk.org/4230|https://gerrit.asterisk.org/4230] |