Summary:ASTERISK-00855: Asterisk threads hang
Reporter:markus (markus)Labels:
Date Opened:2004-01-15 19:08:21.000-0600Date Closed:2008-01-15 14:43:07.000-0600
Versions:Frequency of
Environment:Attachments:( 0) ast_gdb_out.txt
( 1) ast_valgrind.pid24093
( 2) ast_valgrind.pid24107
( 3) mgrhack.patch
Description:This problem was originally reported on the mailing list about a week ago.

We have a lot more information regarding this issue now and we know how to reproduce it:

1) from your local box telnet into the asterisk manager (port 5038) and log in
2) disconnect your local workstation's network
3) make about 20 phone calls (no matter if internal or voice-mail) and asterisk will hang (no dial-tone, no nothing)
4) if you re-plug your network and wait (a minute or two) asterisk will wake up again

Theory: the ast-man tries to send events over the network to the local workstation. Since it was unplugged from the net ast-man can't send it's data, data buffers at the server and once the buffer is full the ast-man thread blocks waiting for the buffer to empty. It does this while it still holds a mutex that cause the other *-threads to block. And the buffer doesn't empty, so asterisk hangs.

Unfortunately this is only a theory.


Attached is a gdb thread trace of the described situation.
Comments:By: Brian West (bkw918) 2004-01-15 20:09:02.000-0600

The problem listed by derek isn't the problem you are describing AT ALL.  Go read that email from derek again.  If you can run asterisk under valgrind that would be most helpful... also if you are not running lastest libpri, zaptel or asterisk you might want to try to update.

By: markus (markus) 2004-01-15 21:18:12.000-0600

I know it sounds completely different, it's still the same issue. The difference comes from our findings.

Doing our research and testing we discovered at one point -- much to our surprise -- that after Asterisk "crashed" after about 80 calls (as described in Derek's mail) it would become live again if we closed all our client-side applications (which were connected to the *-Server via the manager port).

Once that had happened we started investigating the issue from a different perspective, the Asterisk manager and we found the aforementioned (and much easier) way to reproduce the problem.

I'll try running it under valgrind tomorrow.

PS: All the other details Derek mentiones in his mail don't seem to be related to this issue. We didn't know that back then, though.

By: markus (markus) 2004-01-19 11:35:59.000-0600

Did those valgrind logs help?

By: Matt Florell (mflorell) 2004-01-19 12:43:24.000-0600

We have experienced this problem as well. If the network connection is resumed Asterisk usually unfreezes. Unfortunately it doesn't always work. When we have no Manager connections active at all, and Asterisk can run for days without freezing.

Is there a way to send data(commands) to the Manager interface that wouldn't generate any output back?

Or better yet, is there a way to make Manager more forgiving of lost connections?

By: Matt Florell (mflorell) 2004-01-21 07:07:44.000-0600

Is anything being done with this bug? Is there any way to fix it? or do I need to stop using the manager interface somehow?

The only way I can see to stop using the manager interface is to build a perl wrapper that runs on the local asterisk machine (that would have more fault tolerance than manager has) and basically connect to the manager locally and pass commands and information to/from clients that would connect to it.

This seems very redundant but I'm dependant on having manager connections on the client side and that may be what I have to do.

By: markus (markus) 2004-01-21 14:00:25.000-0600

Yeah, we too, would be interested in a fix here. We also need the manager to communicate with Asterisk.

If the fix somehow doesn't happen, maybe we can coordinate our efforts writing some local wrapper to the manager as you suggested. I'd rather use C for this, though, not Perl as it means less overhead and memory usage on the Asterisk server.

By: Matt Florell (mflorell) 2004-01-21 20:46:24.000-0600

C networking programming isn't exactly my forte, but I could try to help, at least with a protocol or application design. I've done a lot of Asterisk manager interface programming with perl and have found a lot of interesting ways around some of it's limitations.

The simplest path may be just to emulate the manager interface and have our wrapper application just send a "Action: Logoff\n\n" string to the Asterisk server if no communication is received from the remote client in a few seconds or the connection is dropped. That would probably solve all of our problems.

By: bfranks (bfranks) 2004-02-02 15:12:58.000-0600

mflorell and markus, experienced the same exact symptoms you described today.  Would be interested in collaborating with you on a fix to remedy the problem.

By: Matt Florell (mflorell) 2004-02-02 21:13:07.000-0600

Ok, I've messed around with making a telnet wrapper for the manager interface and it got WAY too complicated and I kept hitting roadblocks, It didn't scale well, it still had issues with buffer-overflow and it was a processor hog so I decided to go in a completely different direction. What I ended up with was a database-based manager action queue system and localhost-only manager API child process executer and a localhost-only manager output listener. Yes I know it sounds complicated, but I whipped it up in about 3 hours and it's been running for 2 days and it's only lost 2 actions out of 4000, and that's a pretty good percentage when you compare it to a deadlock. I also needed to change my applications around so that they wouldn't expect an immediate response through a direct connection. In the end I had a system that was quite fast, very scalable and able to handle dozens of actions within a second all with NO DEADLOCKS :)

Here's a more detailed runthrough of how my new process works:

There is a simple database table where the action information with unique IDs is stored and where each action's information is updated by the listener. Here's the table I use:

CREATE TABLE manager_queue (
uniqueid DOUBLE(18,7),
entry_date DATETIME,
response  ENUM('Y','N'),
server_ip VARCHAR(15) NOT NULL,
channel VARCHAR(20),
action VARCHAR(20),
callerid VARCHAR(20),
cmd_line_b VARCHAR(50),
cmd_line_c VARCHAR(50),
cmd_line_d VARCHAR(50),
cmd_line_e VARCHAR(50),
cmd_line_f VARCHAR(50),
cmd_line_g VARCHAR(50),
cmd_line_h VARCHAR(50),
cmd_line_i VARCHAR(50),
cmd_line_j VARCHAR(50),
cmd_line_k VARCHAR(50),
index (callerid),
index (uniqueid)

1. First, the GUI client application inserts a record into the table as a NEW action and includes a unique callerID for REDIRECT and ORIGINATE commands(the callerID field is how the listener will update the record in the DB)
INSERT INTO manager_queue values('','','2004-01-30 17:22:53','NEW','N','','','Originate','DL40130172253cc160','Channel: local/8600011@demo','Context: default','Exten: 917274515135','Priority: 1','Callerid: DL40130172253cc160','','','','','');

2. Second, there is a constantly running application on the Asterisk box that selects queues from the database to be processed, and launches a new child script to send the action to the database, then marks the action as SENT.

3. Third, each child process logs into the manager interface and sends the action immediately, then stays open for 10 seconds to not cause any problems, clears it's buffer and then logs out and exits.

4. Fourth, the listener app is constantly connected on the Asterisk box to the manager interface and parses all output from the manager interface. Every time a "Newstate Ringing" event is seen an update statement is sent to the DB based upon the callerid of the call filling in the channel that the call is on and the uniqueID of the call and sets the record to "UPDATED"

5. Fifth, the listener also listens for "Hangup" events and sends an update to the DB with a "DEAD" status keyed by the call's uniqueID.

The above process is executed extremely fast and actually has suprisingly little effect on the load of the Asterisk box. Because each action is sent through it's own child process there is no risk of the system freezing because of a single bad action thread(like Asterisk :) ). I have been running this for 2 solid days and it has handled thousands of action executions a day, it's only 2 failures have been with the listener not updating 2 originate commands 6 hours apart while those 2 actions did get properly executed. I haven't figured out why but I've added a rediculous amount of logging now so I should be able to fix that small issue soon.

All of this is programmed in perl with MySQL as the backend database. I'd be willing to send my code to whoever would like to look at it, but it's not the prettiest code in the world, I did whip it up very quickly.

By: mjohnston (mjohnston) 2004-02-04 10:05:16.000-0600

I added some debugging ast_verbose calls to manager.c and tracked down this problem.  In manager_event() there are two writes - one with ast_cli() and the other with write().  I instrumented the code as follows:

   ast_verbose(VERBOSE_PREFIX_2 "Started write on %d\n", s->fd);
   ast_cli(s->fd, "Event: %s\r\n", event);
   ast_verbose(VERBOSE_PREFIX_2 "Finished write 1 on %d\n", s->fd);
   va_start(ap, fmt);
   vsnprintf(tmp, sizeof(tmp), fmt, ap);
   write(s->fd, tmp, strlen(tmp));
   ast_verbose(VERBOSE_PREFIX_2 "Finished write 2 on %d\n", s->fd);
   ast_cli(s->fd, "\r\n");

I then started Asterisk, connected to the manager interface, logged in, and unplugged the Asterisk server's network cable.  I simply picked up and hung up a phone on Zap/2 repeatedly, and after some 40-50 times, Asterisk froze and I couldn't get a dialtone.  The last message was:

Started write on 9

The write() in ast_cli() must be blocking, since the TCP buffer is full, and hanging the system.  I'm not sure what the best fix for it would be - perhaps the manager FDs should be set to non-blocking mode, and if the write results in EAGAIN, the connection should be dropped?

By: bfranks (bfranks) 2004-02-06 15:01:45.000-0600

Should we work for a better solution than the Manager port or try to fix the manager portion of *?

By: Matt Florell (mflorell) 2004-02-06 15:17:21.000-0600

I'm not an expert on Socket programming in C, but wouldn't it be possible to at the very least make the transmit buffer larger for the manager?

I don't have any problems with how the manager works, I just think it should be more fault tolerant of dropped/frozen/stalled connections.

Is there any easy way to drop the connections that have near-full buffers?

I really don't know if the connection control is built into Asterisk code or if it is a generic C library, does anybody know?

By: mjohnston (mjohnston) 2004-02-06 16:16:03.000-0600

I had a look at the code, and the problem is, from what I can tell, that it's not easy to drop connections when their buffers fill up.  There's a linked list of managers, with a separate thread running to process input.  However, output just goes directly to the manager's session.  If you just tear down the session when the buffer fills up, the input thread will be stranded, so it has to be alerted somehow.

I've modified my copy of manager.c to set the socket to non-blocking mode and, when the buffer fills up, to set a write_error flag on the session, but that's as far as I got - I don't know threads in C.  The only extra thing needed is a way to send a signal to the input session, telling it that there's been a write error and it should destroy itself.  Anyone?

I'm attaching a quick patch against manager.c that should keep Asterisk from locking hard.  It's far from ideal - if the buffer fills up, output will be discarded and the buffer maintained as long as the OS allows, and if the connection is reestablished (not reconnected, but reestablished), you'll get the buffered data, but you won't know that any data was lost.  Beats a hang, IMO.

Also, a proper fix would account for the other ways that a manager connection can be written, for completeness' sake - this is the only one that I can imagine ever hanging, though.

By: markus (markus) 2004-02-06 16:59:06.000-0600

I haven't had a chance to really look into the code yet, but I hope to be able to have a glimpse at the manager code. That input session you're talking about: which function implements that thread? Maybe I can come up with some way to notify that thread if a read-error occurs.

BTW: right now I'm playing around with Asterisk 1.0 on a test-system. Seems to have a few bugs fixed that we found (and fixed quite horribly ourselves in the previous releases ;-).

By: Brian West (bkw918) 2004-02-07 00:08:32.000-0600

Fixed in CVS

By: Digium Subversion (svnbot) 2008-01-15 14:43:06.000-0600

Repository: asterisk
Revision: 2138

U   trunk/Makefile
U   trunk/channels/chan_sip.c
U   trunk/configs/mgcp.conf.sample
U   trunk/manager.c

r2138 | markster | 2008-01-15 14:43:05 -0600 (Tue, 15 Jan 2008) | 5 lines

Insert blank after REFER (bug ASTERISK-991)
Correct path to VM sample (bug ASTERISK-988)
Make manager interface non-blocking (bug ASTERISK-855)
Don't bork on empty from in SIP (bug ASTERISK-881)



By: Digium Subversion (svnbot) 2008-01-15 14:43:07.000-0600

Repository: asterisk
Revision: 2139

U   branches/v1-0_stable/Makefile
U   branches/v1-0_stable/channels/chan_sip.c
U   branches/v1-0_stable/configs/mgcp.conf.sample
U   branches/v1-0_stable/manager.c

r2139 | markster | 2008-01-15 14:43:06 -0600 (Tue, 15 Jan 2008) | 5 lines

Insert blank after REFER (bug ASTERISK-991)
Correct path to VM sample (bug ASTERISK-988)
Make manager interface non-blocking (bug ASTERISK-855)
Don't bork on empty from in SIP (bug ASTERISK-881)