Summary: | ASTERISK-19231: Abort signal 6 raises when using 'sip show peers' with realtime peers | ||||||
Reporter: | Thomas Arimont (tomaso) | Labels: | |||||
Date Opened: | 2012-01-23 03:02:44.000-0600 | Date Closed: | 2012-02-22 08:54:19.000-0600 | ||||
Priority: | Major | Regression? | |||||
Status: | Closed/Complete | Components: | Channels/chan_sip/General | ||||
Versions: | 1.8.8.2 | Frequency of Occurrence | Occasional | ||||
Related Issues: |
| ||||||
Environment: | RHEL 6.1 | Attachments: | ( 0) bt_und_btfull_core2.txt ( 1) sip_show_peers_2012_02_16.diff ( 2) sip_show_peers.diff | ||||
Description: | On a heavy load situation with a huge number of sip realtime peers the _sip_show_peers() function raises an abort signal 6 (libc) when freeing (ast_free()) the local array buffer 'peerarray'. To force the fault an additional script is running which executes "asterisk -rx 'sip show peers' every second when a huge number of phones try to register for the first time. I seems that the _sip_show_peers() function does not take account for the case that the number of peers can change between the calculcation of the number and memory allocation at the start of the function and the iteration through the the peer list later on. If meanwhile a peer is added to the container the allocated 'peerarray' buffer can be overrun and lead to the that fault when freeing the buffer at the end of the funtion. Please look at the attached bt full log (sorry, we only have a code optimized version). | ||||||
Comments: | By: David Woolley (davidw) 2012-01-23 06:14:24.962-0600 An abort in malloc indicates that memory was corrupt even before the malloc was called. Whilst it is possible that your mechanism applies, I don't think you can draw that conclusion just from the presence of show peers in the backtrace. By: Thomas Arimont (tomaso) 2012-01-23 06:51:55.450-0600 David, yes, in principle you're right. But if the memory was corrupted already before the malloc, it would be very implausible from the statistic view that this fault happens 10 or 20 times repeatedly in the same function at the same command and nowhere else. So please take this as an additional information which describes this bug. By: Matt Jordan (mjordan) 2012-02-14 08:27:30.327-0600 I haven't been able to reproduce the crash you described, but your analysis looked correct. I've gone ahead and made the sizing of the array thread-safe with respect to the initialization of the iterator, which should prevent the buffer overrun. Can you test with your scenario and see if this resolves the issue? Thanks Matt By: Thomas Arimont (tomaso) 2012-02-14 09:13:33.404-0600 Matt, yes, I will test the patch. I'm just about to build the rhel rpm's. Will have the results tomorrow. Thanks so far! Thomas By: Matt Jordan (mjordan) 2012-02-14 10:42:39.286-0600 Thomas - upon some review, this patch isn't going to work. We're still putting together one that does - but please hold off for a bit. By: Thomas Arimont (tomaso) 2012-02-14 11:19:16.717-0600 O.k., that's why I still see the repetitive written coredumps in our test which I started a few hours ago with your first patch ... ;-) By: Matt Jordan (mjordan) 2012-02-16 13:55:26.163-0600 Okay - lets try this again. This patch should lock the peers container and create a snapshot of the container using an ao2_callback. The number of items that the iterator will traverse over is the same as the size of the array - so that should prevent the crash you're seeing. Please test this one out and let us know if it fixes the problem. Thanks! By: Thomas Arimont (tomaso) 2012-02-17 02:27:40.080-0600 Matt, thanks a lot. I don't see an opportunity to arrange a test with your new patch before tuesday next week ('Karneval' is going on), but I'm looking forward to do that. Thomas By: Thomas Arimont (tomaso) 2012-02-21 09:18:03.281-0600 Matt, the patch seems to fix the problem. We will let our automatic testscripts run overnight to see if it's still o.k. tomorrow. Thanks so far! Thomas By: Matt Jordan (mjordan) 2012-02-21 18:50:47.783-0600 No problem. If everything goes well, I'll get this into 1.8 tomorrow. The actual commit will be only slightly different - there are some off nominal conditions in the patch that aren't being handled fully (if some of the allocations fail). By: Thomas Arimont (tomaso) 2012-02-22 06:40:17.468-0600 Matt, so as almost expected everything went well tonight (you should not see any ambiguity in this sentence, I was only watching TV last night ;-)). Good job! Thomas |