ASTERISK-06932: Manager is not returning events properly

[Home]

Summary: ASTERISK-06932: Manager is not returning events properly

Reporter: Alex Richardson (alexrch) Labels:

Date Opened: 2006-05-09 05:29:59 Date Closed: 2006-09-06 12:59:36

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/ManagerInterface

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) TestSocket.rar

Description: Sometimes (once per 4000 lines or so - depending on speed of network) manager improperly returns events. For example, one QueueMember will get overwritten by (or as part of) another, like this:

Event: QueueMember
Queue: 09
LocatiEvent: QueueMember
Queue: 09
Location: Agent/09003
Membership: static
Penalty: 1
CallsTaken: 0
LastCall: 0
Status: 5
Paused: 0

****** ADDITIONAL INFORMATION ******

This behaviour can easily be reproduced in branch 25988, or releases 1.2.x (including 1.2.7.1), by repeatedly sending QueueStatus commands.

However, I have not been able to reproduce this in trunk 25930. My guess is that there is something buggy in the locking mechanism of the current Asterisk releases - but that's just my guess. Therefore I am wondering whether is it safe to use trunk version in production system?

Comments: By: Serge Vecher (serge-v) 2006-05-09 08:37:34

alexrch: didn't you post about this in 7013? The fix to 1.2 didn't fix the issue?
By: Alex Richardson (alexrch) 2006-05-09 10:24:30

venchers - yes, I posted this in 7013 as well. As 7013 is dealing with QueueStatusComplete event only, I thought I would open a new issue, because I am having random troubles with all events. The fix of 7013 seems to work fine in trunk, but the problem persists in branch 1.2.

I assume that the locking mechanism in trunk (which works okay) is different from the locking mechanism in branch (where the bug persists), that's probably why I am having these problems in branch only. Could this be true?

If so, I wonder is it safe to use trunk in production, as I have no problems with trunk version?
By: Alex Richardson (alexrch) 2006-05-11 06:13:37

It may aswell be that ast_cti has some problems. I have tried to implement astman_append in the branch version of Asterisk, but I can't figure out how to do it properly - Asterisk crashes when I try to send Action: QueueStatus command. Any suggestions?
By: Alex Richardson (alexrch) 2006-05-11 09:22:49

Another observation: it seems that executing command Reload in CLI makes things even worse.
By: Alex Richardson (alexrch) 2006-05-17 10:35:16

I created a simple test application (see TestSocket.rar) which may help you guys to reproduce the problem. The test application connects to the * Manager (using the default 'manager' & 'insecure' login) and is executing 'Action: QueueStatus' command every couple of seconds. The result returned from the Manager is then saved in log file.

The test application is written in VB.NET - source code is included. To run the test application, and check the Asterisk's problematic behaviour, you should do this:

1. Define 10 queues with 60 agents in each
2. In the source code of the test application, set variable 'IPAdrs' to the IP address of the Manager, and 'LogFileName' to the filename where you want results returned from the Manager to be saved (example: 'c:\test\test.log').
3. Run the test application and press 'Start test' button.
4. Leave application to run until the log file reaches 100 MB or more.
5. Press 'Stop test' button.
6. Press 'Check result' button, which will display a message box for each problematic line in the log file that was generated during the test (this can take a while to process the whole 100 MB)

Note that this test application was created only to help you reproduce the problem. The same test can be done using any other similar application - for example with Putty and continuosly sending 'Action: QueueStatus' command.

I have managed to reproduce it on a number of different computers - so please let me know, if you need any additional *.conf files as well.
By: Serge Vecher (serge-v) 2006-06-01 15:12:38

alexrch: can you please try 1.2.8 and perhaps it is magically fixed there ... Thanks.
By: Alex Richardson (alexrch) 2006-06-02 09:54:27

vechers: unfortunately the problem still persists. :(
By: Matt Riddell (zx81) 2006-06-12 16:19:49

I have also seen the same problem in customer sites, and have temporarily coded workaraounds in my client software to ignore corrupted results and to reconnect on disconnect.
By: Serge Vecher (serge-v) 2006-08-21 14:54:10

is this still an issue in 1.2.10?
By: Alex Richardson (alexrch) 2006-08-24 08:31:33

vechers: I will check right away and let you know.
By: Serge Vecher (serge-v) 2006-08-24 08:46:42

might as well check 1.2.11. then
By: Alex Richardson (alexrch) 2006-08-24 08:49:02

okay. I'm running the tests on 1.2.10 and 1.2.11 as we 'speak' :)
By: Alex Richardson (alexrch) 2006-08-24 09:08:56

Two out of 6247121 events were corrupted in 1.2.11, also similar statistics for the 1.2.10. It seems to behave a bit better than 1.2.8 though.
By: Serge Vecher (serge-v) 2006-08-24 09:25:00

that's a pretty good result, if you ask me ...
By: Serge Vecher (serge-v) 2006-09-06 12:57:13

alright: let's close this bug at this point. Given the miniscule ratio of failed events and difficulty to reproduce, I don't see a reason for this to be open. If anybody makes a "breakthrough" in terms of understanding the more specific mechanics of this bug, please feel free to reopen (or ask a bug-marshall to do so) the issue with such information presented.

Thanks