Summary:ASTERISK-12129: Manager lock (?)
Reporter:Antonis Psaras (apsaras)Labels:
Date Opened:2008-06-02 17:39:39Date Closed:2008-06-19 12:35:30
Versions:Frequency of
Description:Our app recently experienced a strange problem while communicating with an Asterisk (via a manager session). Apparently there was a long pause (106”) in the manager session. After we issued a redirect action, a response event was sent back. After exhaustive debugging (using detailed log information produced by our app and a memory dump) we figured out that the first line of the response (Response: Success) came in instantly, but the event text was completed after 106sec (when reception is complete we set a timestamp in the receiving structure). This is the actual response event text:

Response: Success
ActionID: 18af6ce4-e873-4923-9602-c149ef9f08b8
Message: Redirect successful

The redirect command was executed asynchronously (hence the actionid). Network is not an issue here (we received a lot of FastAGI sessions from the same asterisk server in the meantime). The manager connection was not dropped. In fact, after the response event was completed, all subsequent events were received immediately, and the session resumed flawlessly afterwards.

Our conclusion is that the delay occurred while reading the lines of the response event. We suspect that there was a lock in astman_append (manager.c) causing the delay. Since the problem is not reproducible, we do not have a way to provide more info (to our knowledge, the error has occurred only once among many installations of the same environment). Also, since the manager connection is not responding while the problem occurs, it is not possible to automatically execute a core show locks command.
Comments:By: Tilghman Lesher (tilghman) 2008-06-16 11:00:17

If this occurs again, you could create a separate AMI connection and run 'core show locks' over that particular connection.

By: Tilghman Lesher (tilghman) 2008-06-19 12:35:30

Without more specific debugging information, we are unable to proceed.  The fact that you're also unable to reproduce this problem makes me wonder if perhaps a network component failed for 106 seconds, then when services were restored, the TCP connection retransmitted what it was unable to send before.