[Home]

Summary:ASTERISK-29947: testsuite: Premature shutdown causes tests to fail
Reporter:N A (InterLinked)Labels:
Date Opened:2022-03-03 13:46:27.000-0600Date Closed:2022-03-03 15:10:21.000-0600
Priority:MinorRegression?
Status:Closed/CompleteComponents:Tests/testsuite
Versions:18.9.0 Frequency of
Occurrence
Frequent
Related
Issues:
Environment:Debian 11Attachments:
Description:A problem I've been noticing for a while is that sometimes the test suite will prematurely kill a test if a UserEvent for success has been received, even if the success count hasn't yet hit the required count.

For instance, I have a test that aborts after 1 success user event, when 2 are required, and then it fails because Asterisk got shut down even though the test was still running, and obviously 1 is not 2 so the test fails.

If I simply comment out the success user event, the test runs to completion successfully, but obviously this doesn't make it work.

I have ensured that there is always a channel up at all parts of the test so it isn't the case that Asterisk is shutting down because the test suite has detected there are no channels. Instead, it seems to be reacting in a majorly buggy way.

The only workaround is simply to include at most 1 success user event, which kind of defeats the point of trying to add rich tests to the test suite in the first place.

Granted, this doesn't happen with every test, but it seems to with a large number.

In the past, I've rewritten tests to only include 1 success, to improve reliability, but it seems in some cases it will just consistently fail like this.

```
root@pbxdev:/usr/src/testsuite# ./runtests.py --test=tests/intertest/anoncallreject --timeout=60
Running tests for Asterisk (run 1 of 1)...

Tests to run: 1 * 1 time(s) = 1  Maximum test inactivity time: 60 sec.
--> Running test 'tests/intertest/anoncallreject' ...
Making sure Asterisk isn't running ...
Making sure SIPp isn't running...
Running tests/intertest/anoncallreject ...
[Mar 03 14:39:02] WARNING[4143745]: asterisk.ami:208 __check_result: Event occurred 1 times, which is out of the allowable range
[Mar 03 14:39:02] WARNING[4143745]: asterisk.ami:209 __check_result: Event description: {'count': 2, 'conditions': {'match': {'UserEvent': 'LineTest', 'Event': 'UserEvent'}}, 'requirements': {'match': {'Result': 'Pass'}}}
Test tests/intertest/anoncallreject failed



<?xml version="1.0" encoding="utf-8"?>
<testsuites>
 <testsuite errors="0" failures="1" name="AsteriskTestSuite" tests="1" time="18.62" timestamp="2022-03-03T14:38:43 EST">
   <testcase classname="intertest" name="anoncallreject" time="18.62">
     <failure>Running tests/intertest/anoncallreject ...
[Mar 03 14:39:02] WARNING[4143745]: asterisk.ami:208 __check_result: Event occurred 1 times, which is out of the allowable range
[Mar 03 14:39:02] WARNING[4143745]: asterisk.ami:209 __check_result: Event description: {'count': 2, 'conditions': {'match': {'UserEvent': 'LineTest', 'Event': 'UserEvent'}}, 'requirements': {'match': {'Result': 'Pass'}}}
</failure>
   </testcase>
 </testsuite>
</testsuites>
```
Comments:By: Asterisk Team (asteriskteam) 2022-03-03 13:46:27.989-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].

By: Joshua C. Colp (jcolp) 2022-03-03 13:49:54.829-0600

Have you investigated through the actual logs to understand what is actually happening? Why it's shutting down?

By: N A (InterLinked) 2022-03-03 13:53:08.206-0600

Yes, I have, the test will still be running and Originate() will fail due to an "active shutdown", and as such channel allocations fail.

Basically, the test is walking dead at that point.

e.g.

[Mar  3 14:39:00] VERBOSE[4143848][C-00000003] pbx.c: Executing [s2@test-anoncallreject:6] Originate("PJSIP/autotest2-00000001", "Local/22s3test-anoncallreject@orig-again,app,Wait,5") in new stack
[Mar  3 14:39:00] DEBUG[4143848][C-00000003] app_originate.c: Originating call to 'Local/22s3test-anoncallreject@orig-again' and connecting them to Wait(5)
[Mar  3 14:39:00] WARNING[4143848][C-00000003] channel.c: Channel allocation failed: Refusing due to active shutdown
[Mar  3 14:39:00] WARNING[4143848][C-00000003] core_unreal.c: Unable to allocate owner channel structure

I've found this happening every single time there is a problem like this. Why is there an "active shutdown"? I certainly didn't tell it to. It's the test suite telling it to.
It seems the logic to detect that a test is over is faulty.

I think this one is actually "Major" because if tests can just get prematurely killed when UserEvent is called, then that's a serious problem which pretty much makes it impossible to run tests without rewriting them to use only a single UserEvent.

By: Joshua C. Colp (jcolp) 2022-03-03 13:55:54.944-0600

That's the Asterisk log, not the testsuite log. The testsuite log also contains information about what is going on. You need to examine that log as well.

This may be Major in the end, but at this point you are jumping to conclusions and extrapolating things without full investigation or analysis/understanding of what is going on.

By: Joshua C. Colp (jcolp) 2022-03-03 13:57:01.783-0600

And otherwise you would need to either provide the test with both Asterisk logging and testsuite logging, or demonstrate this from an existing test. The way tests are written can, and will, have an influence on the testsuite behavior.

By: N A (InterLinked) 2022-03-03 14:07:41.525-0600

Here's logs/messages.txt, if that's what you mean by "testsuite log":

It indeed seems to be thinking that all channels are gone, which is the other possibility that was in the back of my mind: [Mar 03 14:57:57] INFO[4145206]: asterisk.pluggable_modules:328 __hangup_handler: All channels have hungup; stopping test

The test is coded in a way however such that this should never be the case.
Here are the last few seconds before this seems to kick in: https://paste.interlinked.us/jumafd20px.txt

[Mar 03 14:57:40] INFO[4145206]: asterisk.test_case:175 __init__: Executing tests/intertest/anoncallreject
[Mar 03 14:57:40] INFO[4145206]: asterisk.test_case:286 create_asterisk: Creating Asterisk instance 1
[Mar 03 14:57:40] INFO[4145206]: asterisk.test_case:465 _start_asterisk: Starting Asterisk instance 1
[Mar 03 14:57:41] INFO[4145206]: asterisk.test_case:350 create_ami_factory: Creating AMIFactory 1 to 127.0.0.1
[Mar 03 14:57:41] INFO[4145206]: twisted:154 publishToNewObserver: Starting factory <starpy.manager.AMIFactory instance at 0x7f748ef21550>
[Mar 03 14:57:41] INFO[4145206]: AMI:184 connectionMade: Connection Made
[Mar 03 14:57:41] INFO[4145206]: AMI:203 onComplete: Login Complete: {u'message': u'Authentication accepted', u'response': u'Success', u'actionid': u'pbxdev-140138591094896-1'}
[Mar 03 14:57:41] INFO[4145206]: asterisk.test_case:603 _ami_connect: AMI Connect instance 1
[Mar 03 14:57:41] INFO[4145206]: asterisk.pluggable_modules:93 ami_connect: AMI 0 connected
[Mar 03 14:57:41] INFO[4145206]: asterisk.pluggable_modules:116 originate_call: Originating call
[Mar 03 14:57:54] INFO[4145206]: asterisk.test_case:508 __stop_instances: Stopping Asterisk instance 1
[Mar 03 14:57:57] INFO[4145206]: asterisk.pluggable_modules:328 __hangup_handler: All channels have hungup; stopping test
[Mar 03 14:57:58] INFO[4145206]: AMI:1137 clientConnectionLost: connection lost, reconnecting...
[Mar 03 14:57:58] INFO[4145206]: twisted:154 publishToNewObserver: <twisted.internet.tcp.Connector instance at 0x7f748ef21640 disconnected IPv4Address(type='TCP', host='127.0.0.1', port=5038)> will retr>
[Mar 03 14:57:58] INFO[4145206]: AMI:1140 clientConnectionLost: None
[Mar 03 14:57:58] INFO[4145206]: twisted:154 publishToNewObserver: Stopping factory <starpy.manager.AMIFactory instance at 0x7f748ef21550>
[Mar 03 14:57:58] WARNING[4145206]: asterisk.ami:208 __check_result: Event occurred 1 times, which is out of the allowable range
[Mar 03 14:57:58] WARNING[4145206]: asterisk.ami:209 __check_result: Event description: {'count': 2, 'conditions': {'match': {'UserEvent': 'LineTest', 'Event': 'UserEvent'}}, 'requirements': {'match': {>
[Mar 03 14:57:58] INFO[4145206]: asterisk.test_case:529 __stop_reactor: Stopping Reactor
[Mar 03 14:57:58] INFO[4145206]: twisted:154 publishToNewObserver: Main loop terminated.
[Mar 03 14:57:58] INFO[4145206]: test_runner:313 main: Test run for tests/intertest/anoncallreject completed with result False

Test case:

{noformat}
[test-anoncallreject] ; Expect: 2. Test that Anonymous Call Rejection works.
exten => s1,1,Answer()
same => n,NoOp(${FILTER(${rcfilter},${SHELL(${rcexec} 'RC:LINE,TYP=CHG,ORD=0,TN=${clid1},AYK=ADD!' | ${rcpost})})}) ; add AYK to this line if it's not already on it.
same => n,SendDTMF(*77) ; enable Anonymous Call Rejection, if not already enabled
same => n,Wait(1)
same => n,Originate(Local/02s$[${EXTEN:-1:1}+1]${CONTEXT}@orig-again,app,Wait,5) ; in 0 seconds, go off-hook on peer #2, and send to s2 in this context.
same => n,Hangup()
exten => s2,1,Answer()
same => n,Wait(2)
same => n,Set(GLOBAL(receivecall)=1) ; yes, this call should come through.
same => n,SendDTMF(${peer1dial}) ; call A.
same => n,Wait(3)
same => n,Originate(Local/22s$[${EXTEN:-1:1}+1]${CONTEXT}@orig-again,app,Wait,5) ; in 2 seconds, go off-hook on peer #2, and send to s3 in this context.
same => n,Hangup() ; cancel the call
exten => s3,1,Answer()
same => n,Wait(2)
same => n,Set(GLOBAL(receivecall)=0) ; no, this call should NOT come through.
same => n,SendDTMF(*67${peer1dial}) ; call A, but "block" our Caller ID.
same => n,Wait(3)
same => n,Originate(Local/21s$[${EXTEN:-1:1}+1]${CONTEXT}@orig-again,app,Wait,5) ; in 2 seconds, go off-hook on peer #1, and send to s4 in this context.
same => n,Hangup() ; cancel the call
exten => s4,1,Answer()
same => n,Wait(2)
same => n,SendDTMF(*87) ; disable Anonymous Call Rejection
same => n,Wait(1)
same => n,Originate(Local/02s$[${EXTEN:-1:1}+1]${CONTEXT}@orig-again,app,Wait,5) ; in 0 seconds, go off-hook on peer #2...
same => n,Hangup()
exten => s5,1,Answer()
same => n,Wait(2)
same => n,Set(GLOBAL(receivecall)=1) ; yes, this call should come through.
same => n,SendDTMF(*67${peer1dial}) ; call A.
same => n,Wait(3)
same => n,Hangup()
exten => ${peer1},1,ExecIf($["${receivecall}"!="1"]?UserEvent(LineTest,Result: Fail))
same => n,Assert($["${receivecall}"="1"])
same => n,UserEvent(LineTest,Result: Pass) ; okay, non-anonymous call came through
same => n,Wait(4) ; should get disconnected during this time...
same => n,UserEvent(LineTest,Result: Fail) ; should not get here
same => n,Assert(0)

[orig-again] ; Delayed origination
exten => _X.,1,Answer()
same => n,Wait(0.25) ; wait at least 0.25 seconds, so that if 0 seconds, same line can hang up.
same => n,Wait(${EXTEN:0:1})
same => n,Originate(PJSIP/${plarcode}@${peer${EXTEN:1:1}},exten,${EXTEN:4},${EXTEN:2:2},1)
same => n,Hangup()
{noformat}

By: Joshua C. Colp (jcolp) 2022-03-03 14:17:22.894-0600

The file that contains all information for the testsuite is the full.txt file. This will also include log messages for all the channels it is tracking. If it's not present in the logs directory of testsuite, it would be in the /tmp/asterisk-testsuite sub directory for the test execution. Depending on how Originate works underneath and if it's fully asynchronous, then I could see the channel not being fully created before the current channel is hung up, but that's a theory. If so then the test would need to be written to not use the hangup monitor stuff and the test itself would have to trigger termination using its own mechanism. I don't have any examples of doing that handy, because I've rarely needed to do it.

By: N A (InterLinked) 2022-03-03 14:24:44.741-0600

Here's full.txt: https://paste.interlinked.us/izf6xfiflo.txt

This actually confirms my earlier theory: it can be seen clearly that a core stop gracefully is sent to the Asterisk *IMMEDIATELY* after a Success UserEvent occurs. There isn't any channel hangup in the middle. As soon as that event is processed, BAM, down she goes:

Then *LATER*, because the channel originate fails due to "active shutdown", THEN there are no channels left and it cleans up, but that is actually the effect, not the cause.

So it does seem to be improperly aborting on "Pass".

{noformat}
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Event: UserEvent'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Privilege: user,all'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Channel: PJSIP/autotest1-00000002'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'ChannelState: 4'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'ChannelStateDesc: Ring'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'CallerIDNum: 2156'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'CallerIDName: ALBERT, NAVEEN'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'ConnectedLineNum: <unknown>'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'ConnectedLineName: <unknown>'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Language: en'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'AccountCode: '
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Context: test-anoncallreject'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Exten: autotest1'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Priority: 3'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Uniqueid: 1646337474.36'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Linkedid: 1646337474.36'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'UserEvent: LineTest'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: 'Result: Pass'
[Mar 03 14:57:54] DEBUG[4145206]: AMI:171 lineReceived: Line In: ''
[Mar 03 14:57:54] DEBUG[4145206]: AMI:233 dispatchIncoming: Dispatch Incoming
[Mar 03 14:57:54] DEBUG[4145206]: AMI:263 dispatchIncoming: Incoming Message: {u'connectedlinenum': u'<unknown>', u'linkedid': u'1646337474.36', u'uniqueid': u'1646337474.36', u'language': u'en', u'accountcode': u'', u'channelstate': u'4', u'exten': u'autotest1', u'calleridnum': u'2156', u'priority': u'3', u'userevent': u'LineTest', u'connectedlinename': u'<unknown>', u'result': u'Pass', u'context': u'test-anoncallreject', u'calleridname': u'ALBERT, NAVEEN', u'privilege': u'user,all', u'event': u'UserEvent', u'channel': u'PJSIP/autotest1-00000002', u'channelstatedesc': u'Ring'}
[Mar 03 14:57:54] DEBUG[4145206]: asterisk.ami:165 __event_callback: Condition UserEvent: LineTest matches UserEvent: LineTest in event
[Mar 03 14:57:54] DEBUG[4145206]: asterisk.ami:165 __event_callback: Condition Event: UserEvent matches Event: UserEvent in event
[Mar 03 14:57:54] DEBUG[4145206]: asterisk.ami:254 event_callback: Requirement Result: Pass matches Result: Pass in event
[Mar 03 14:57:54] INFO[4145206]: asterisk.test_case:508 __stop_instances: Stopping Asterisk instance 1
[Mar 03 14:57:54] DEBUG[4145206]: asterisk.asterisk:574 __send_stop_gracefully: sending stop gracefully
[Mar 03 14:57:54] DEBUG[4145206]: asterisk.asterisk:938 cli_exec: Executing ['/usr/sbin/asterisk', '-C', '/tmp/asterisk-testsuite/238b995b0768752849801561e0174342/run_23/ast1/etc/asterisk/asterisk.conf', '-rx', 'core stop gracefully'] ...
{noformat}

By: Joshua C. Colp (jcolp) 2022-03-03 14:27:31.816-0600

Please attach stuff. If this is an actual issue somewhere, then having the log attached means it's not dependent on an outside server.

You also haven't provided the test itself, just the dialplan. The test itself defines how the testsuite will behave.

By: N A (InterLinked) 2022-03-03 14:29:59.166-0600

Here is the test template, dynamically filled in with the proper values at run time - $SUCCESS_COUNT is 2:

{noformat}
testinfo:
   summary: '$TEST_NAME'
   description: |
       '$TEST_DESCRIPTION'

test-modules:
   test-object:
       config-section: test-object-config
       typename: 'test_case.TestCaseModule'
   modules:
       -
           config-section: caller-originator
           typename: 'pluggable_modules.Originator'
       -
           config-section: hangup-monitor
           typename: 'pluggable_modules.HangupMonitor'
       -
           config-section: ami-config
           typename: 'pluggable_modules.EventActionModule'

test-object-config:
   connect-ami: True

caller-originator:
   channel: 'Local/$TEST_NAME@test'
   app: 'Wait'
   exten: '5'
   trigger: 'ami_connect'

hangup-monitor:
   ids: '0'

ami-config:
   -
       ami-events:
           conditions:
               match:
                   Event: 'UserEvent'
                   UserEvent: 'LineTest'
           requirements:
               match:
                   Result: 'Pass'
           count: $SUCCESS_COUNT
       stop_test:

properties:
   tags:
       - dial
       - apps
   dependencies:
       - python: 'twisted'
       - python: 'starpy'
       - asterisk: 'app_dial'
       - asterisk: 'app_userevent'
       - asterisk: 'app_originate'
       - asterisk: 'app_if'
       - asterisk: 'pbx_config'
{noformat}

By: Joshua C. Colp (jcolp) 2022-03-03 14:43:02.035-0600

I believe you want this instead:

{code}
ami-config:
   -
       ami-events:
           conditions:
               match:
                   Event: 'UserEvent'
                   UserEvent: 'LineTest'
           requirements:
               match:
                   Result: 'Pass'
           count: $SUCCESS_COUNT
           action: 'stop'
{code}

By: N A (InterLinked) 2022-03-03 14:49:49.671-0600

Yeah that works... looks like only difference is

stop_test: -> action: 'stop'

A lot of the stuff in the tree has "stop_test" so what's the difference with the older method?

By: Joshua C. Colp (jcolp) 2022-03-03 14:55:04.102-0600

The action: 'stop' is triggered when the count is reached for the specific thing. The stop_test is done when the event is done once. Normally you would send a single event to indicate the entire test is done and to stop the test. The action mechanism was added so count could be taken into account, while the stop_test method was kept as it was widespread (plus internally it's an action which is used in non-AMI cases as well).