[Home]

Summary:ASTERISK-16521: Lock during module reload
Reporter:shays (shays)Labels:
Date Opened:2010-08-08 03:55:30Date Closed:2012-01-20 18:15:13.000-0600
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_config_pgsql
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) backtrace.txt
( 1) core_show_locks.txt
( 2) core-show-locks.txt
Description:When a phone performs registration during module reload, the PBX get lock in res_config_pgsql.c line 355
Comments:By: Paul Belanger (pabelanger) 2010-08-09 07:48:39

We'll need a backtrace to triage the issue.

---
Debugging deadlocks:

Please select DEBUG_THREADS and DONT_OPTIMIZE in the Compiler Flags section of menuselect. Recompile and install Asterisk (i.e. make install)

This will then give you the console command:

core show locks

When the symptoms of the deadlock present themselves again, please provide output of the deadlock via:

# asterisk -rx \\\"core show locks\\\" | tee /tmp/core-show-locks.txt

# gdb -se \\\"asterisk\\\" <pid of asterisk> | tee /tmp/backtrace.txt

gdb> bt
gdb> bt full
gdb> thread apply all bt

Then attach the core-show-locks.txt and backtrace.txt files to this issue. Thanks!

By: shays (shays) 2010-10-24 10:43:59

Still on 1.8.0

By: Matthew Nicholson (mnicholson) 2011-12-19 14:18:37.404-0600

The traces you uploaded don't show any problems. I will try to reproduce this.

By: Mark Michelson (mmichelson) 2012-01-13 11:06:56.842-0600

Looking at the backtraces and the core show locks, it appears the problem is not due to a typical deadlock situation such as lock inversion. Rather, the problem is that a thread has grabbed a lock and is now stuck waiting forever for a condition that is never going to happen. Specifically, the code is waiting in a poll() system call in the libpq code.

Having looked at the code, the only thing I could find that would make sense for such a thing to occur is if multiple Asterisk threads were stepping on each other and modifying Postgres data without proper synchronization. Specifically, it would have to be a conflict between update_pgsql() (used for modifying data in the database) and between realtime_pgsql() (used for getting data in the database). Specifically, the update_pgsql() function would have to be modifying data that realtime_pgsql() thought it had exclusive access to.

The only thing I could find was that update_pgsql() calls a function called find_table(), where it attempts to find a cached table. If a cached table cannot be found, then it queries the database for the table instead. The problem is, this database query is done with no lock held. In all other database accesses, the lock is held. This could lead to problems like the one you're seeing.

Now, the good news here is that this appears to have been fixed by a commit made in June of 2011. This means that you can update to any version of Asterisk 1.8 made from 1.8.6.0 or later should have this problem fixed. If you are able to confirm this, that would be fantastic. Given how old this issue is, I can understand if you're not keeping up with this anymore though. I'll give you some time, but if I don't hear back sometime soon, I will assume the fix I mentioned did the trick and that this is no longer a problem.

By: Mark Michelson (mmichelson) 2012-01-20 18:15:13.107-0600

Assuming this is fixed by Jonathan's locking additions to res_config_pgsql.c in June of 2011. Closing