[Home]

Summary:ASTERISK-16044: [patch] The heap data structure can't cope with a removal and reinsert
Reporter:cappucinoking (cappucinoking)Labels:
Date Opened:2010-05-03 12:42:51Date Closed:2010-05-06 09:15:58
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Tests/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) heap-fix.diff
( 1) heap-fix.rev2.diff
( 2) test_heap.diff
( 3) test_heap.patch
Description:Whilst looking into why qualify events are being scheduled out of sequence I started looking into why items on the maxheap weren't in the correct order.
The qualify code is a different use case then the current tests allow for.
When a response to a qualify request is made, the event is rescheduled via a remove and re-add to the maxheap structure.
I have written a test to demonstrate the failure of the heap to keep the items in order when removing and re-inserting items.
Comments:By: cappucinoking (cappucinoking) 2010-05-03 12:54:49

The patch creates a heap with 1000 items.
Randomly removes and re-inserts 500 items.
Checks that items pop-off in order.

Currently this test is failing.

By: Russell Bryant (russell) 2010-05-03 16:15:28

Give this patch a try.

By: cappucinoking (cappucinoking) 2010-05-03 16:56:54

This is now passing the test case.

By: cappucinoking (cappucinoking) 2010-05-03 17:30:16

test_heap.diff is a slightly revised version of the test - and fixes a minor typo in test two as well.

By: Russell Bryant (russell) 2010-05-04 13:52:36

Let me know how the testing in production goes today.

By: Russell Bryant (russell) 2010-05-04 14:00:26

I attached a rev2 of the fix.  It's the same code, I just changed the formatting of the bubble_up() loop to be a bit more readable.

By: cappucinoking (cappucinoking) 2010-05-04 16:07:07

It seems to have fixed the issue I was having with the qualify timing.
I haven't seen any adverse effects.

By: Digium Subversion (svnbot) 2010-05-06 08:58:12

Repository: asterisk
Revision: 261496

U   trunk/main/heap.c

------------------------------------------------------------------------
r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines

Fix handling of removing nodes from the middle of a heap.

This bug surfaced in 1.6.2 and does not affect code in any other released
version of Asterisk.  It manifested itself as SIP qualify not happening when
it should, causing peers to go unreachable.  This was debugged down to scheduler
entries sometimes not getting executed when they were supposed to, which was in
turn caused by an error in the heap code.

The problem only sometimes occurs, and it is due to the logic for removing an entry
in the heap from an arbitrary location (not just popping off the top).  The scheduler
performs this operation frequently when entries are removed before they run (when
ast_sched_del() is used).

In a normal pop off of the top of the heap, a node is taken off the bottom,
placed at the top, and then bubbled down until the max heap property is restored
(see max_heapify()).  This same logic was used for removing an arbitrary node
from the middle of the heap.  Unfortunately, that logic is full of fail.  This
patch fixes that by fully restoring the max heap property when a node is thrown
into the middle of the heap.  Instead of just pushing it down as appropriate, it
first pushes it up as high as it will go, and _then_ pushes it down.

Lastly, fix a minor problem in ast_heap_verify(), which is only used for
debugging.  If a parent and child node have the same value, that is not an
error.  The only error is if a parent's value is less than its children.

A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
and then producing an ast_heap test case that demonstrated the breakage.  That
made it very easy for me to focus on the heap logic and produce a fix.  Open source
projects are awesome.

(closes issue ASTERISK-15721)
Reported by: ib2
Tested by: cappucinoking, crjw

(closes issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
     heap-fix.rev2.diff uploaded by russell (license 2)
Tested by: cappucinoking, russell

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261496

By: Digium Subversion (svnbot) 2010-05-06 09:02:25

Repository: asterisk
Revision: 261497

_U  branches/1.6.1/
U   branches/1.6.1/main/heap.c

------------------------------------------------------------------------
r261497 | russell | 2010-05-06 09:02:22 -0500 (Thu, 06 May 2010) | 47 lines

Merged revisions 261496 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines
 
 Fix handling of removing nodes from the middle of a heap.
 
 This bug surfaced in 1.6.2 and does not affect code in any other released
 version of Asterisk.  It manifested itself as SIP qualify not happening when
 it should, causing peers to go unreachable.  This was debugged down to scheduler
 entries sometimes not getting executed when they were supposed to, which was in
 turn caused by an error in the heap code.
 
 The problem only sometimes occurs, and it is due to the logic for removing an entry
 in the heap from an arbitrary location (not just popping off the top).  The scheduler
 performs this operation frequently when entries are removed before they run (when
 ast_sched_del() is used).
 
 In a normal pop off of the top of the heap, a node is taken off the bottom,
 placed at the top, and then bubbled down until the max heap property is restored
 (see max_heapify()).  This same logic was used for removing an arbitrary node
 from the middle of the heap.  Unfortunately, that logic is full of fail.  This
 patch fixes that by fully restoring the max heap property when a node is thrown
 into the middle of the heap.  Instead of just pushing it down as appropriate, it
 first pushes it up as high as it will go, and _then_ pushes it down.
 
 Lastly, fix a minor problem in ast_heap_verify(), which is only used for
 debugging.  If a parent and child node have the same value, that is not an
 error.  The only error is if a parent's value is less than its children.
 
 A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
 and then producing an ast_heap test case that demonstrated the breakage.  That
 made it very easy for me to focus on the heap logic and produce a fix.  Open source
 projects are awesome.
 
 (closes issue ASTERISK-15721)
 Reported by: ib2
 Tested by: cappucinoking, crjw
 
 (closes issue ASTERISK-16044)
 Reported by: cappucinoking
 Patches:
       heap-fix.rev2.diff uploaded by russell (license 2)
 Tested by: cappucinoking, russell
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261497

By: Digium Subversion (svnbot) 2010-05-06 09:04:38

Repository: asterisk
Revision: 261498

_U  branches/1.6.2/
U   branches/1.6.2/main/heap.c

------------------------------------------------------------------------
r261498 | russell | 2010-05-06 09:04:36 -0500 (Thu, 06 May 2010) | 47 lines

Merged revisions 261496 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

........
 r261496 | russell | 2010-05-06 08:58:07 -0500 (Thu, 06 May 2010) | 40 lines
 
 Fix handling of removing nodes from the middle of a heap.
 
 This bug surfaced in 1.6.2 and does not affect code in any other released
 version of Asterisk.  It manifested itself as SIP qualify not happening when
 it should, causing peers to go unreachable.  This was debugged down to scheduler
 entries sometimes not getting executed when they were supposed to, which was in
 turn caused by an error in the heap code.
 
 The problem only sometimes occurs, and it is due to the logic for removing an entry
 in the heap from an arbitrary location (not just popping off the top).  The scheduler
 performs this operation frequently when entries are removed before they run (when
 ast_sched_del() is used).
 
 In a normal pop off of the top of the heap, a node is taken off the bottom,
 placed at the top, and then bubbled down until the max heap property is restored
 (see max_heapify()).  This same logic was used for removing an arbitrary node
 from the middle of the heap.  Unfortunately, that logic is full of fail.  This
 patch fixes that by fully restoring the max heap property when a node is thrown
 into the middle of the heap.  Instead of just pushing it down as appropriate, it
 first pushes it up as high as it will go, and _then_ pushes it down.
 
 Lastly, fix a minor problem in ast_heap_verify(), which is only used for
 debugging.  If a parent and child node have the same value, that is not an
 error.  The only error is if a parent's value is less than its children.
 
 A huge thanks goes out to cappucinoking for debugging this down to the scheduler,
 and then producing an ast_heap test case that demonstrated the breakage.  That
 made it very easy for me to focus on the heap logic and produce a fix.  Open source
 projects are awesome.
 
 (closes issue ASTERISK-15721)
 Reported by: ib2
 Tested by: cappucinoking, crjw
 
 (closes issue ASTERISK-16044)
 Reported by: cappucinoking
 Patches:
       heap-fix.rev2.diff uploaded by russell (license 2)
 Tested by: cappucinoking, russell
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261498

By: Digium Subversion (svnbot) 2010-05-06 09:07:52

Repository: asterisk
Revision: 261499

U   branches/1.6.2/tests/test_heap.c

------------------------------------------------------------------------
r261499 | russell | 2010-05-06 09:07:52 -0500 (Thu, 06 May 2010) | 8 lines

Add test case that ensures the heap handles arbitrary removals properly.

(issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
     test_heap.diff uploaded by cappucinoking (license 1036)
Tested by: cappucinoking, russell

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261499

By: Digium Subversion (svnbot) 2010-05-06 09:15:58

Repository: asterisk
Revision: 261500

U   trunk/tests/test_heap.c

------------------------------------------------------------------------
r261500 | russell | 2010-05-06 09:15:57 -0500 (Thu, 06 May 2010) | 10 lines

Add test case for removing random elements from a heap.

I modified the original patch for trunk to use the unit test API.

(issue ASTERISK-16044)
Reported by: cappucinoking
Patches:
     test_heap.diff uploaded by cappucinoking (license 1036)
Tested by: cappucinoking, russell

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=261500