ASTERISK-19231: Abort signal 6 raises when using 'sip show peers' with realtime peers

[Home]

Summary: ASTERISK-19231: Abort signal 6 raises when using 'sip show peers' with realtime peers

Reporter: Thomas Arimont (tomaso) Labels:

Date Opened: 2012-01-23 03:02:44.000-0600 Date Closed: 2012-02-22 08:54:19.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Channels/chan_sip/General

Versions: 1.8.8.2 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-18423 Crash in AMI initiated sip show peers

is related to ASTERISK-19361 Asterisk exited on signal 6: Related to sip show peers?

Environment: RHEL 6.1 Attachments: ( 0) bt_und_btfull_core2.txt
( 1) sip_show_peers_2012_02_16.diff
( 2) sip_show_peers.diff

Description: On a heavy load situation with a huge number of sip realtime peers the _sip_show_peers() function raises an abort signal 6 (libc) when freeing (ast_free()) the local array buffer 'peerarray'. To force the fault an additional script is running which executes "asterisk -rx 'sip show peers' every second when a huge number of phones try to register for the first time.
I seems that the _sip_show_peers() function does not take account for the case that the number of peers can change between the calculcation of the number and memory allocation at the start of the function and the iteration through the the peer list later on. If meanwhile a peer is added to the container the allocated 'peerarray' buffer can be overrun and lead to the that fault when freeing the buffer at the end of the funtion.

Please look at the attached bt full log (sorry, we only have a code optimized version).

Comments: By: David Woolley (davidw) 2012-01-23 06:14:24.962-0600

An abort in malloc indicates that memory was corrupt even before the malloc was called. Whilst it is possible that your mechanism applies, I don't think you can draw that conclusion just from the presence of show peers in the backtrace.
By: Thomas Arimont (tomaso) 2012-01-23 06:51:55.450-0600

David,
yes, in principle you're right. But if the memory was corrupted already before the malloc, it would be very implausible from the statistic view that this fault happens 10 or 20 times repeatedly in the same function at the same command and nowhere else.
So please take this as an additional information which describes this bug.
By: Matt Jordan (mjordan) 2012-02-14 08:27:30.327-0600

I haven't been able to reproduce the crash you described, but your analysis looked correct. I've gone ahead and made the sizing of the array thread-safe with respect to the initialization of the iterator, which should prevent the buffer overrun. Can you test with your scenario and see if this resolves the issue?

Thanks

Matt
By: Thomas Arimont (tomaso) 2012-02-14 09:13:33.404-0600

Matt,
yes, I will test the patch. I'm just about to build the rhel rpm's. Will have the results tomorrow. Thanks so far!
Thomas
By: Matt Jordan (mjordan) 2012-02-14 10:42:39.286-0600

Thomas - upon some review, this patch isn't going to work. We're still putting together one that does - but please hold off for a bit.
By: Thomas Arimont (tomaso) 2012-02-14 11:19:16.717-0600

O.k., that's why I still see the repetitive written coredumps in our test which I started a few hours ago with your first patch ... ;-)
By: Matt Jordan (mjordan) 2012-02-16 13:55:26.163-0600

Okay - lets try this again. This patch should lock the peers container and create a snapshot of the container using an ao2_callback. The number of items that the iterator will traverse over is the same as the size of the array - so that should prevent the crash you're seeing.

Please test this one out and let us know if it fixes the problem. Thanks!
By: Thomas Arimont (tomaso) 2012-02-17 02:27:40.080-0600

Matt,
thanks a lot.
I don't see an opportunity to arrange a test with your new patch before tuesday next week ('Karneval' is going on), but I'm looking forward to do that.
Thomas
By: Thomas Arimont (tomaso) 2012-02-21 09:18:03.281-0600

Matt,
the patch seems to fix the problem. We will let our automatic testscripts run overnight to see if it's still o.k. tomorrow.
Thanks so far!
Thomas
By: Matt Jordan (mjordan) 2012-02-21 18:50:47.783-0600

No problem. If everything goes well, I'll get this into 1.8 tomorrow.

The actual commit will be only slightly different - there are some off nominal conditions in the patch that aren't being handled fully (if some of the allocations fail).
By: Thomas Arimont (tomaso) 2012-02-22 06:40:17.468-0600

Matt,
so as almost expected everything went well tonight (you should not see any ambiguity in this sentence, I was only watching TV last night ;-)).
Good job!
Thomas