Summary:ASTERISK-07868: [patch] Asterisk deadlocks when channels starting an AGI script get hung up
Reporter:Marcus Hunger (fnordian)Labels:
Date Opened:2006-10-04 05:08:06Date Closed:2006-12-10 18:52:36.000-0600
Versions:Frequency of
Environment:Attachments:( 0) 20061130__bug8086__2.diff.txt
( 1) 20061210__block_signals_during_fork.diff.txt
( 2) asterisk-
( 3) emilec-gbd-output.txt
( 4) jlgdeveloper-gdb-two-processes-output_.txt
( 5) patch12.diff
Description:There seems to be a deadlock in asterisk when asterisk's forks to exec an agi and the channel get's hung up after the fork and before the exec. I noticed several dead-locked asterisk processes per day.
My guess is, that when the channel hangs up, asterisk sends a sighup to the child which is still an asterisk-instance and so has an sighup-handler that tries to reload asterisk's config (reloading the configuration is marked with /* XXX This could deadlock XXX */).
GDB seems to agree with that:

#0 ?0x4002a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 ?0x4002789f in _L_mutex_lock_73 () from /lib/tls/libpthread.so.0
#2 ?0x00000000 in ?? ()
#3 ?0x00000000 in ?? ()
#4 ?0x00000000 in ?? ()
ASTERISK-1 ?0x00000000 in ?? ()
ASTERISK-2 ?0x00000000 in ?? ()
ASTERISK-3 ?0x41277ec8 in ?? ()
ASTERISK-4 ?0x41277ebc in ?? ()
ASTERISK-5 ?0x08057a6b in ast_verbose (fmt=0x80f35b4 " ?== Parsing '%s': ") at lock.h:601
ASTERISK-6 0x08057a6b in ast_verbose (fmt=0x41277ec8 "\224?'A4\177'A\004\177'A?H\020\b??'AD\177'A\024\177'A?3\017\b") at lock.h:601
ASTERISK-7 0x0805e60b in config_text_file_load (database=0x4127a3e4 "", table=0x4127a2e4 "", filename=0x81048d6 "manager.conf",
? ? cfg=0x814a540) at config.c:583
ASTERISK-8 0x0805d721 in ast_config_internal_load (filename=0x811a29c "?B\017\b??\005\b", cfg=0x4127a2e4) at config.c:959
ASTERISK-9 0x0805dfb4 in ast_config_load (filename=0xfffffffc <Address 0xfffffffc out of bounds>) at config.c:976
ASTERISK-10 0x080a9c44 in init_manager () at manager.c:1676
ASTERISK-11 0x0805b982 in ast_module_reload (name=0x0) at loader.c:223
ASTERISK-12 0x080b75a2 in hup_handler (num=-4) at asterisk.c:751
ASTERISK-13 <signal handler called>
ASTERISK-14 0x40028dcb in pthread_sigmask () from /lib/tls/libpthread.so.0
ASTERISK-15 0x4043271c in launch_script (script=0x4043b558 "|?", argv=0x4127b330, fds=0x4127ab08, efd=0x0, opid=0x4127ab18) at res_agi.c:298
ASTERISK-16 0x404323db in agi_exec_full (chan=0x82d5c00, data=0x4127b330, enhanced=0, dead=0) at res_agi.c:2044
ASTERISK-17 0x40431d50 in agi_exec (chan=0x82d5c00, data=0x0) at res_agi.c:2063
ASTERISK-18 0x0808e515 in pbx_extension_helper (c=0x82d5c00, con=0x0, context=0x82d5d50 "pstn-out", exten=0x82d5e44 "_________",
? ? priority=16, label=0x0, callerid=0x0, action=0) at pbx.c:553
ASTERISK-19 0x0808f0ba in __ast_pbx_run (c=0x82d5c00) at pbx.c:2230
ASTERISK-20 0x0808fdaf in pbx_thread (data=0x0) at pbx.c:2517
ASTERISK-21 0x40025b63 in start_thread () from /lib/tls/libpthread.so.0
ASTERISK-22 0x401ad18a in clone () from /lib/tls/libc.so.6

Some function's parameters seem to be a little bit strange, I guess there is a problem with the core-file generated using kill -11, but you should get an idea of what is happening.

Attached is a little patch that tries to fix the issue by modifying hup_handler to only reload the config, if getpid() equals ast_mainpid. I am not sure if this is correct. E.g. getpid() behaves different on different kernel versions (mine is 2.6.x, so getpid() returns the same value in all threads of the process) but it solved the problem for me: days without deadlocked asterisk processes.
Comments:By: Anthony LaMantia (alamantia) 2006-10-04 16:40:27

maybe it would be better just to have run_agi() send a SIGKILL rather then a SIGHUP the the child processes running the agi script?

By: Anthony LaMantia (alamantia) 2006-10-04 17:20:23

actually, by setting the environment variable AGISIGHUP to "no" or by using deadagi over agi you should be able to avoid this problem without having to patch asterisk.

By: Marcus Hunger (fnordian) 2006-10-05 03:35:28

AGISIGHUP is only in trunk, isn't it? anyway, i think this racecondition is a bug that should be fixed and not only worked around. My agi-scripts rely on a SIGHUP when the channel gets hung up, so disabling this functionality is not an option.
Changing the kind of signal (e.g. SIGTERM, or anything else that could be intercepted and which is ignored by asterisk) would solve the racecondition but break compatibility.

By: Anthony LaMantia (alamantia) 2006-10-05 10:05:33

intresting, if that is the case, it may be better just to replace the the signal handler for sighup in the forked processes with one that just returns. that would prevent any problems related to getpid() what are your opinions on this?

By: Marcus Hunger (fnordian) 2006-10-05 10:57:56

trying to prevent the deadlock after fork()ing would reduce the possibility of the race condition, but still does not fix it. it would be a quick fix, though. rather than specifying a just returning signalhandler, i would ignore the signal in this case.
i wonder if someone realy familiar with asterisk's core has an idea about this.

By: jmls (jmls) 2006-11-04 02:19:34.000-0600

anyone familiar with the core have an idea on this ?

By: Emile Coetzee (emilec) 2006-11-17 00:14:03.000-0600

I am seeing this same problem at a number of customer sites. It seems to be easily reproducible when using a queue with a ring all strategy and about 6 static agents. After a while a number of asterisk processes start building up. They all have 0 CPU time and can only be ended with a kill -9. We have now been able to consistently reproduce this problem in a test environment using two asterisk servers each with a PRI linked with a cross over cable. We then use a simple script to push calls into the queue. I have attached my gdb output.

Asterisk: 1.2.13
OS: Suse Linux Enterprise 9 SP3
FreePBX 2.1.3

By: Emile Coetzee (emilec) 2006-11-28 01:41:54.000-0600

We have used the reporters' patch for almost 2 weeks now and we no longer get multiple asterisk processes. Any chance this could be commited to a future release?

By: Tilghman Lesher (tilghman) 2006-11-30 01:31:11.000-0600

I think the right thing to do here is to:
1) Block all signals prior to the fork().
2) Reset signal handlers in the child after the fork().
3) Unblock signals.

This ensures that there is no race condition:  signals are blocked until they can be handled properly in the child.  In this case, that means the child simply dies when it receives one of those signals, instead of messing with the Asterisk core.  Patch uploaded for this purpose.

By: Marcus Hunger (fnordian) 2006-11-30 06:24:56.000-0600

Corydon76: there's still a race, which could lead to the case, that the child get's sighuped before signal-handlers are restored. in this case, the child does not terminate but starts executing the agi, does it?

By: Tilghman Lesher (tilghman) 2006-11-30 08:18:22.000-0600

That's the beauty of this.  The signals are suspended; that is, they wait to fire until the mask is restored.

By: Marcus Hunger (fnordian) 2006-11-30 08:34:40.000-0600

I were not aware of that. Your patch looks good to me, though I did not try it, yet.

By: Jonathan Galpin (jlgdeveloper) 2006-11-30 11:16:25.000-0600

I have added a gdb dump of two hung asterisk processes that are typical of this problem.

I might have ten of them on a busy day, comming in groups of two.

I will be trying the patch in the next day or so.

By: Jonathan Galpin (jlgdeveloper) 2006-12-05 10:06:15.000-0600

I applied this patch last night:


And have had two processes spawned so far this morning....

Is this the correct patch?

If so, this has not worked for me.

By: Serge Vecher (serge-v) 2006-12-05 10:19:41.000-0600

you've applied the older (wrong) patch. Please try 20061130__bug8086__2.diff.txt

By: Francesco Romano (francesco_r) 2006-12-05 10:37:16.000-0600

I have applied this patch in one of the asterisk before i had many problems and now i have 5 days uptime, seems to works well. No more child processes. The machine is the same of the bug ASTERISK-7983.

By: Jonathan Galpin (jlgdeveloper) 2006-12-05 11:06:12.000-0600

I'll apply the correct patch this evening, and will report back after a few days, or sooner depending on the results.

Appreciate the input.

By: Jonathan Galpin (jlgdeveloper) 2006-12-08 09:55:58.000-0600

Ok, I have had this patch (20061130__bug8086__2.diff.txt) for about three days now, and have not had any spawned processes.

I believe it is working as I would have had a few each day otherwize.

I recommend committing it.

Thanks for creating it.

Jonathan Galpin

By: Tilghman Lesher (tilghman) 2006-12-10 18:52:36.000-0600

Committed in 48374.  Merged in 48375, 48376.