[Home]

Summary:ASTERISK-18318: Suspect deadlock in res_ais
Reporter:anonymouz666 (anonymouz666)Labels:
Date Opened:2011-08-22 12:54:52Date Closed:2011-09-08 19:55:49
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/General
Versions:1.8.5.0 Frequency of
Occurrence
Constant
Related
Issues:
Environment:Attachments:( 0) backtraces-threads.txt
( 1) backtrace-threads-without-timing-pthreads.txt
( 2) core-locks-without-timing-pthreads.txt
( 3) core-show-locks.txt
Description:I am testing the distributed device state between two machines (A and B) running Asterisk 1.8.6.0-rc1.
Both corosync (1.4.1) and openais (1.1.4) were compiled from the source code in a 64-bit CentOS (5.5).
The test consists in injecting calls from machine C into A, then hit queue application and send the device state to the machine B.
After some calls, the device state are no more in sync between the machines and 'ais clm show members' got stuck in machine A (all CLI commands startin with 'ais' no longer works). The attached info was gathered when that happened. Machine B whatever 'ais <command>' works fine from the CLI.

If I just restart the machine A ('core restart now') the 'ais <whatevercommand>' still not work. It is necessary to stop asterisk and after corosync. Then start corosync and asterisk, things starting working until the cycle happens again.

This happens in a testing enviroment (which I can provide access) and I can reproduce that all the time after a few minutes of calls.
Comments:By: anonymouz666 (anonymouz666) 2011-08-22 12:57:07.844-0500

When the machine A stops working, here it is the debug info.

By: Leif Madsen (lmadsen) 2011-08-29 09:39:44.025-0500

Don't use res_timing_pthread -- that definitely cause issues that appear to be deadlocks, when really it's just the timing module taking a very long time to process stuff. It is not an efficient timing module, and should be avoided in most cases.

By: Leif Madsen (lmadsen) 2011-08-29 09:40:08.387-0500

Basically, I'd suggest you use res_timing_dahdi.

By: anonymouz666 (anonymouz666) 2011-08-29 10:50:53.691-0500

The problem also occurs even with noload => res_timing_pthreads.so

By: anonymouz666 (anonymouz666) 2011-08-29 10:51:17.743-0500

The problem also occurs even with noload => res_timing_pthreads.so

By: Russell Bryant (russell) 2011-09-08 19:55:29.500-0500

I took a look at this and it turned out to be a problem in corosync.  I have submitted a patch to corosync to resolve it.

https://lists.linux-foundation.org/pipermail/openais/2011-September/016726.html