ASTERISK-19722: Manager disconnects under high originate load volume

[Home]

Summary: ASTERISK-19722: Manager disconnects under high originate load volume

Reporter: JoshE (n8ideas) Labels:

Date Opened: 2012-04-13 14:17:11 Date Closed: 2012-04-17 09:37:13

Priority: Major Regression?

Status: Closed/Complete Components:

Versions: 1.8.11.0 Frequency of
Occurrence

Related
Issues:

Environment: Attachments:

Description: I am seeing issues trying to run high concurrency asynchronous originate actions from a remotely connected manager. I have a single manager logged on and am originating asynchronously. Within a 2-3 minutes of attempting several hundred originates per minute, my manager is disconnected. The following error appears in the logs 100% of the time:

[Apr 13 13:09:17] DEBUG[26389] utils.c: Timed out trying to write
[Apr 13 13:09:17] VERBOSE[26389] manager.c: == Manager 'mymanager' logged off from 10.178.x.x

This issue affects Asterisk 1.8.8.2, 1.8.11.0, and 10.3.0 in exactly the same way.

In some cases, depending on timing, the behavior resembles deadlock in that I will attempt to run a "manager show connected" from the CLI and will hang on:

Server-1*CLI> manager show connected
Username IP Address Start Elapsed FileDes HttpCnt Read Write

For a period of 10-20 seconds. In this state, I've been able to show this lock:

=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: 0x7f3202f87700 (fast_originate started at [ 4134] manager.c action_originate())
=== ---> Lock #0 (channel.c): MUTEX 3789 __ast_read chan 0x7f31f80c6ac0 (1)
/usr/sbin/asterisk(ast_bt_get_addresses+0x1d) [0x4f6736]
/usr/sbin/asterisk(__ast_pthread_mutex_lock+0xef) [0x4ef479]
/usr/sbin/asterisk(__ao2_lock+0x57) [0x447771]
/usr/sbin/asterisk() [0x475d9c]
/usr/sbin/asterisk(ast_read+0x1d) [0x478081]
/usr/sbin/asterisk(__ast_request_and_dial+0x53b) [0x47c0b4]
/usr/sbin/asterisk(ast_pbx_outgoing_exten+0xdf) [0x52c4a6]
/usr/sbin/asterisk() [0x4ff96b]
/usr/sbin/asterisk() [0x578d57]
/lib64/libpthread.so.0(+0x7d90) [0x7f329110dd90]
/lib64/libc.so.6(clone+0x6d) [0x7f329189af5d]
=== -------------------------------------------------------------------
===
=======================================================================

Not sure exactly what's going on with this, but anyone run into this before?

Comments: By: David Woolley (davidw) 2012-04-17 05:53:28.747-0500

Your AMI application isn't acknowledging the incoming packets fast enough. You need to update writetimout. This is a configuration issue, not a bug.
By: Matt Jordan (mjordan) 2012-04-17 09:40:26.856-0500

I agree that this does not appear to be a bug, and have closed it as such.

There's a few things to note based on what you've provided:
1. The lock statement you've provided implies you're running with DEBUG_THREADS. This significantly impacts performance. If there isn't a reason to be running with DEBUG_THREADS, then don't do it.
2. The core show locks output, on a system running with 200 originated calls in a space of a few seconds, should show something from time to time - you've just kicked off a ton of channels. Channels get locked sometimes. This is normal (particularly when that lock is around an ast_channel_read call). The fact that 'core show locks' outputs something is not indicative of anything other then there are things that are locked in the system.
3. Eventually, if you spawn enough calls fast enough, you will hit performance limitations of your system. That is not a bug.
By: JoshE (n8ideas) 2012-04-17 10:49:45.195-0500

Understood.

The cli lack of responsiveness / lockups seems to be an issue under load. I've updated writetimeout to extraordinarily long values with no effect on the issue. This isn't a massive socket load in the scheme of socket communications, and I'm seeing manager disconnects quickly. The rest of the resources on the boxes are totally fine - RAM, CPU, Disk IO, etc...

The debug isn't much help in figuring out which end the problem was on either. Hard to tell whether the network stack on my end wasn't accepting the write (or queue was full) or problem was on getting across to manager on other side.