Summary: | ASTERISK-24651: [patch] Fix race condition in DTLS | ||||||
Reporter: | Badalian Vyacheslav (slavon) | Labels: | |||||
Date Opened: | 2014-12-30 06:57:33.000-0600 | Date Closed: | 2015-07-07 14:56:57 | ||||
Priority: | Critical | Regression? | |||||
Status: | Closed/Complete | Components: | Resources/res_rtp_asterisk | ||||
Versions: | 11.15.0 11.16.0 | Frequency of Occurrence | |||||
Related Issues: |
| ||||||
Environment: | Attachments: | ( 0) backtrace.txt ( 1) core-show-locks.txt ( 2) fix_dtls_race_conditions.diff ( 3) fix_dtls_race_conditions2.diff | |||||
Description: | You code have race condition. Its broke {{SSL *}} struct and {{BIO *}}.
Before apply patch to test need 1. Apply ASTERISK-24650 2. Get last openssl openssl-OpenSSL_1_0_1-stable from trunk https://github.com/openssl/openssl/tree/OpenSSL_1_0_1-stable (we fix in it DTLS issues) and configure (for centos 6) - {{./config --prefix=/usr --openssldir=/etc/pki/tls shared no-ssl2 zlib enable-camellia enable-seed enable-tlsext enable-rfc3779 enable-cms enable-md2 no-mdc2 no-rc5 no-ec2m no-gost --with-krb5-flavor=MIT --with-krb5-dir=/usr}}. Then make, make tests, make install. ! Patch tested in production server. ! Added Mutex tested with debug_threads. No deadlocks. Asterisk without fixed crash after 5-100 calls in havy concourent calls. After patch applyed 100 000 calls do normal. Race condition in 2 situations 1. One thread change or broke {{SSL}} struct then SSL code is executed. Added mutex to protect it. Also change {{BIO}} buffers to NULL after free {{SSL}} struct. Its help to detect SSL that allready free and don't write to {{BIO}}. {{SSL_free}} is free all assigned BIO buffers. No need to BIO_free (its may cause double free issue). 2. big touble with {{dtls_srtp_check_pending}} code...i divide retransmition scheclude to RTP and RCTP to fix race conditions... Ready to discuss or review patch. Do not have to be very strict with me, I am sharing practices on their own initiative. | ||||||
Comments: | By: Badalian Vyacheslav (slavon) 2014-12-30 07:30:54.929-0600 And happy new year to Astersk Dev team :) By: Badalian Vyacheslav (slavon) 2014-12-31 05:13:47.580-0600 If you have issues that get openssl assert: {quote} d1_both.c(278): OpenSSL internal error, assertion failed: s->init_num == (int)s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH {quote} link it to this issue. It's becouse {quote} Race condition in 2 situations 1. One thread change or broke SSL struct then SSL code is executed... {quote} This is not openssl issue. Its asterisk race condition that fixed by this patch. By: Rusty Newton (rnewton) 2015-01-02 12:03:58.948-0600 Had you already filed bug reports for the crashes that this patch fixed for you? Was ASTERISK-24333 the report or one of them? I'm not a developer so I can't review your patch, but the next part of the process is to get your code up on Reviewboard. You can follow the [code review process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process]. Be sure to link your Reviewboard URL here on the issue. By: Matt Jordan (mjordan) 2015-01-19 20:02:29.976-0600 There's a whole lot of locking you've thrown in here, and I'm not sure I understand the need for any of it. Which threads are having a race condition? Generally, an RTP instance is owned by a channel, and the channel is responsible for reading/writing to the RTP instance. While a scheduler callback can occur - and will occur on a separate thread - I think you've been extremely over aggressive with the locking here. It would help if we knew *exactly* what the race condition was, on what piece of data, and between what threads. Having a backtrace showing the crash due to the race condition would be extremely helpful. By: Badalian Vyacheslav (slavon) 2015-01-21 01:48:37.821-0600 I can explain in detail, if you give me the Russian-speaking developer. I am very hard to explain in detail in English. But I will try through Google translator. 1. You do not worry about the safety of SSL structure. As a result, one thread Asterisk it changes when the OpenSSL (in another thread) is code execution. I added a lot of mutex just to try to disable simultaneous access to SSL structures. 2. Your scheduler launches the second stream. He calls the callback function that for RTP and RTCP is tested and each test causes the timer. The result - a cascading effect. In my code is always only one timer for RTP and RTCP will be launched. Code free from the cascading effect. To do this, slightly modified structure DTLS. Along the way, there was still a few things that I thought it necessary to correct, to improve stability. Destroy example can be invoked by another thread, and we are not protected against this. Had to insert in rtp_engine many checks as otherwise accesses the already remote structures. Or such malfunction with BIO also present when the structure SSL scrape, you do not need to do for FREE BIO. All related to BIO docs deleted. My patch unfortunately does not solve all problems. But most of the addresses, while the other cases to prevent a crash. If an asterisk before the patch was falling every 15 minutes, but now we have a 2-3 fall in the month. Unfortunately work with OpenSSL you stretched in many places and there is a piece of falling I rules for the same reason - SSL structure has already died, and the other thread is using it. For a global solution is required to rewrite a large piece of code to work with SSL structure to prevent simultaneous access to multiple streams. It needs to be thread safe to do so. I admit that patch of possible solutions "dirty", but my time is quite expensive. Had to cope with the customer quickly. For the same reason I can not give a patch to your rules. This time, and it's very expensive. I beg enter into my position. And not to be impersonal. I am the technical director of real estate development firm. In my submission of more than 50 developers C / C ++ and Java. Specialization - the banking sector and large call centers. I need to be engaged in the affairs of the company, so I return to you our solutions to your problems in the form in which I have an opportunity :( Hopefully Google translator was able to translate my thoughts so that they are understood) By: Badalian Vyacheslav (slavon) 2015-01-21 04:39:57.551-0600 ASTERISK-24566 also becouse this race condition By: Thomas Guebels (tguescaux) 2015-02-13 03:45:44.540-0600 Hi, I'm having the same kind of crashes as in ASTERISK-24131 when calling from asterisk to jssip/chrome and even the latest fixes in libssl did not change anything. However, I tested the attached patch and it fixes the crashes for me, so there's definitely something interesting in all the locking that it adds. I'm still trying to find out what is happening but apparently preventing dtls_srtp_check_pending (which can be launched from the channel thread, the scheduler thread or the pj worker thread) to run concurrently with each other, or concurrently with dtls_perform_handshake seems to do the trick. Also note that I can reproduce these crashes with only one call running at a time. By: Badalian Vyacheslav (slavon) 2015-02-17 08:55:05.911-0600 Thomas! Thanks for testing! Unfortunately, the patch is a crutch for the current implementation of SSL. Unfortunately we have seen in rare cases fall asterisk, because chan_sip not only refers to the SSL structure. I have no idea how quickly rewritten to work with SSL structure to ensure it only work with a single thread (Access to SSL must be threadsafe). It is necessary to rewrite all the work globally with SSL, what I do not have time and energy. By: Alex Zarubin (az_tth) 2015-02-18 22:01:40.129-0600 Attached please find core-show-locks and backtrace file. Asterisk (11.15.0) is in deadlock. Scenario: asterisk takes webrtc/wss calls (dials out plain sip and bridges. Every call lasts 5 sec, there are no more than 2-3 calls at a time. Original 11.15.0 release would crash or deadlock after 200-300 calls. With patches 24650 and 24651 installed there are no crashes (so far) and there were about 3000 calls before the deadlock. By: Badalian Vyacheslav (slavon) 2015-02-18 22:59:29.563-0600 Alex. Try compile without DEBUG_THREADS and deadlock will out i think. It's may look strange, but asterisk have deadlocks in normal SIP UDP mode with this flag. Maybe bug in DEBUG_THREADS code. By: Alex Zarubin (az_tth) 2015-02-18 23:21:33.047-0600 Vyacheslav, thank you for the patches. We'll remove DEBUG_THREADS flag for production build. We need it for the load test release though as it adds "core show locks" and helps with backtraces - hopefully asterisk developers will be able to use this info. By: Stefan Engström (StefanEng86) 2015-02-28 14:29:33.198-0600 Any of you people working with this issue kind enough to comment ASTERISK-24832? Are my crash logs similar to the ones that prompted you to make that patch, Badalian? By: Badalian Vyacheslav (slavon) 2015-03-06 07:04:32.935-0600 11.16.0 - patch work By: Badalian Vyacheslav (slavon) 2015-03-06 10:25:09.738-0600 Update patch. fix {{SSL}} free after use in dtls_srtp_setup. By: Joshua C. Colp (jcolp) 2015-07-06 06:00:25.030-0500 A change is now up for review at the following addresses for a fix to this problem. While our code review process is pretty fast these days if anyone would like to test the change and provide feedback on this issue it would be welcome: 11: https://gerrit.asterisk.org/#/c/786/ 13: https://gerrit.asterisk.org/#/c/787/ master: https://gerrit.asterisk.org/#/c/788/ The patch can be downloaded by clicking the "Download" dropdown and selecting the method you wish. By: Badalian Vyacheslav (slavon) 2015-07-06 11:54:22.992-0500 Fine! I tried to finish the current version to work, but in fact it was a stick. You better know the system. Your patches correctly. Thanks Joshua Kolp. |