[Home]

Summary:ASTERISK-09333: "Mute" IAX2 Extensions
Reporter:Fabricio Machado (fmachado)Labels:
Date Opened:2007-04-27 09:50:22Date Closed:2007-07-20 17:46:05
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) Backtrace9614_Nicchap.zip
( 1) debug-260407.txt.zip
( 2) gdb_dump.14786.gz
( 3) gdb_dump.19797.gz
( 4) gdb_dump.26426.gz
( 5) last-debug.gz
( 6) messages-260407.txt.zip
( 7) while_ERROR.zip
( 8) while_OK.zip
Description: In last three weeks, we've got some problems on IAX2 extensions and it's now very frequent. Ramdomly our iax extensions stop responding and it's not possible make and receive calls.
The iaxthreadcount and iaxmaxthreadcount parameters are set to '256'.

It does't generate a crash, so we set 'debug verbose' to '9' and did activate 'iax debug'. These information are in 'debug-260407.txt' file.
The asterisk's messages file is attached too (messages-260407.txt)

****** ADDITIONAL INFORMATION ******

-------------
Our scenario:
-------------
extensions prococol: IAX2
softphone: IDEFISK
softphone codec: ALAW
asterisk version: SVN 61658 (both servers)
* Both are compiled with DONT_OPTIMIZE and DEBUG_THREADS options.
-------------

---------------------
Hardware Description:
---------------------

Server 'A':
-Dell PowerEdge 1800
-Dual Intel(R) Xeon(TM) CPU 3.00GHz Coreduo
-Digium TE110P card
-Digium TDM2400 (with 6 FXS modules)
-Digium TDM400P (with 4 FXO modules)

Server 'B':
-Dell PowerEdge 430SC
-Intel P4 2.4Ghz (hyperthreading enabled)
-Digium TE110P card
-Digium TDM400P (with 4 FXS modules)
---------------------
Comments:By: Fabricio Machado (fmachado) 2007-05-17 10:12:43

I'm adding some new log files:

while_OK.zip: Output of commands executed on CLI while the system is OK.
while_ERROR.zip: Output of commands executed on CLI while the error happens.

As can you see, while the system is in normal operation, all threads are in 'idle' status. When the problem occur, all threads are showed as 'active'.
Maybe the problem is "find_idle_thread()" function.

I'll appreciate a feedback, because it's occuring every day.

thanks.

By: Christian Benke (christianbee) 2007-05-21 06:01:59

i had the same issue yesterday on asterisk 1.4.4, tried to reproduce it with a high iax-load(100 calls - usually i peak with 30) but i couldn't, seems to happen out of the blue...

By: Nicolas Chapleau (nicchap) 2007-05-23 06:18:22

I am having the same issue on one of our server. I generate volume test scenarios using FastAGI and AMI connections from server A (dual xeon, Sangoma a102D card and drivers), to server B (same config) using IAX2 channels only. The Asterisk process simply stopped, no longer in the process list, so I can't really check the thread state. Will run more tests. Note that I was generating approx. 60 IAX calls simultaneaously, so that means I also had 60 fastAGI connections going. I also use astmanproxy for my AMI connections. Last lines of my messages log:

[May 22 21:48:57] NOTICE[30114] chan_iax2.c: Out of idle IAX2 threads for I/O, pausing!

Asterisk 1.4.4
Zaptel 1.4.2.1
Libpri 1.4.0
Sangoma wanpipe drivers 2.3.4.8 zaptel patched



By: Thiago Garcia (thiagarcia) 2007-05-29 15:25:23

I am having the same issue last month.

By: timrobbins (timrobbins) 2007-06-01 05:17:42

Are the people who are seeing this issue using IAX realtime users/peers?

By: Christian Benke (christianbee) 2007-06-01 06:13:03

i have a postgres-connection configured in res_pgsql.conf and active, but i'm not using any tables in my dialplans/voicemail/... currently.

my setup:

asterisk 1.4.4
zaptel 1.4.2.1(patched with wanpipe)
libpri 1.4.0
WANPIPE Release: 2.3.4-8

By: Nicolas Chapleau (nicchap) 2007-06-01 07:34:44

and NO I'm not using realtime for this.

By: Thiago Garcia (thiagarcia) 2007-06-01 08:19:52

No I'm not using realtime.

By: Russell Bryant (russell) 2007-06-04 18:40:27

This is probably fixed in 1.4 and trunk by revisions 67020 and 67022. If you have any further problems, feel free to reopen this bug or file a new one.

Thanks!

By: Fabricio Machado (fmachado) 2007-06-11 11:21:01

Hi, we're now using SVN 67158 and the issue takes more time (2-3 days) to occur.
To avoid this problem, we increase to 1024 the values of iaxthreadcount and
iaxmaxthreadcount on iax.conf file. To do this, we changed channels/chan_iax2.c too.

By: Russell Bryant (russell) 2007-06-12 14:54:44

Alright, to figure out what is happening here I will need a backtrace of the deadlock.  First, build Asterisk without optimizations:

1) Run "make menuselect"
2) Go down to Compiler Flags
3) Select "DONT_OPTIMIZE" and "DEBUG_THREADS"

Then, when this happens, run the ast_grab_core script in the contrib/scripts directory.  It should automatically get a backtrace from the deadlocked process for you.  Then, upload the backtrace here.

Also, once you have this core dump of the deadlock, don't delete it, as I will likely need more information from it.

By: Fabricio Machado (fmachado) 2007-06-12 15:15:00

We'll rebuild with these options and with default thread limit (256).
Once it happens again, we'll send asked information.

By: Nicolas Chapleau (nicchap) 2007-06-12 15:32:59

Same here

By: Thiago Garcia (thiagarcia) 2007-06-12 15:36:15

Same here

By: Fabricio Machado (fmachado) 2007-06-13 15:16:30

Finally it happens again !
I uploaded backtrace 'gdb_dump.26426.gz'.
Thanks.

By: Nicolas Chapleau (nicchap) 2007-06-14 07:53:13

Running tests with DEBUG_OPTIONS enabled. However I am getting:

WARNING[7543]: chan_iax2.c:3586 iax2_trunk_queue: Maximum trunk data space exceeded to 192.168.0.88:4569

Run ast_grab_core and it generates a 100+ meg file. not sure I want to upload that. Did you still require that file? Meanwhile I'm still waiting for the:

Out of idle IAX2 threads for I/O, pausing to occur.

Update: I got rid of the Max Trunk Space error by setting my my IAX2 connection not as trunks. I am presently hammering the box my approx. 75 simultaneous IAX2 calls from another * box. Waiting to see what happens...

Update 2: 25 000 calls in 2 hours so far, and no errors.

Update 3: OK so it happened. Now, do I really upload a 303 mb backtrace file?



By: Nicolas Chapleau (nicchap) 2007-06-15 14:03:13

SO should I submit a 300+ MB backtrace?

By: Russell Bryant (russell) 2007-06-15 14:37:06

fmachado:  The backtrace you uploaded is what I am looking for, but it still appears that Asterisk was compiled with optimizations.  Please run "make menuselect", go to Compiler Flags, ensure that DONT_OPTIMIZE is selected, and hit 'x' to save and exit.  Then, do make clean, make, make install.  Also, I see in your backtrace that you are using 1.4.4.  There have been a lot of fixed to chan_iax2 since then that haven't yet made it into a release.  Please try with the latest code in the 1.4 branch in svn.

nicchap: The backtrace itself should definitely not be that large.  You are probably referring to the core dump, which is not what we need here.  To get the backtrace, run # gdb /usr/sbin/asterisk /path/to/core.12345.  Then, inside gdb, run (gdb) thread apply all bt, (gbd) thread apply all bt full.  Also, please make sure you're running the code from the 1.4 branch in svn.

By: Fabricio Machado (fmachado) 2007-06-15 14:53:07

Russel, we're using '1.4-branches SVN 67158' because of bug '9937'.
Asterisk is already compiled with DONT_OPTIMIZE selected.

By: Nicolas Chapleau (nicchap) 2007-06-15 15:01:32

Backtrace with version 1.4.4 branch uploaded.

By: Russell Bryant (russell) 2007-06-15 21:48:27

fmachado: Bug 9937 should be fixed now.  Please try the up to date code.

By: Russell Bryant (russell) 2007-06-15 21:50:57

nicchap: I don't see any locked up threads in your backtrace.  Furthermore, it looks like it was from an optimized build.  Please try the latest code from the 1.4 branch, or 1.4.5, as many fixes for problems in chan_iax2 have been made since 1.4.4.

By: Russell Bryant (russell) 2007-06-18 11:40:51

I'm closing this out because I think this is probably fixed in 1.4.5.  Please reopen if you find that you still have a problem with the latest code.

By: Fabricio Machado (fmachado) 2007-06-25 10:55:16

We're now using svn 71128 and the issue still occurs.
I did upload file gdb_dump.19797.gz with latest gdb debug.

By: Russell Bryant (russell) 2007-06-25 14:08:26

This backtrace is not valid.

Would you be willing to let me log in to your machine to debug the problem?  If you could capture a core dump of an instance of Asterisk built with DONT_OPTIMIZE and DEBUG_THREADS, then I can use gdb to analyze that.  You can use the contrib/scripts/ast_grab_core script to get a core dump.

By: Fabricio Machado (fmachado) 2007-06-25 14:22:47

Due our security policies we can't permit this access.
We always built asterisk with DONT_OPTIMIZE and DEBUG_THREADS:
                                                   *************************************
    Asterisk Module Selection
*************************************

Press 'h' for help.

[ ] 1.  DEBUG_CHANNEL_LOCKS
[ ] 2.  DEBUG_SCHEDULER
[*] 3.  DEBUG_THREADS
[ ] 4.  DEBUG_THREADLOCALS
[ ] 5.  DETECT_DEADLOCKS
[ ] 6.  DO_CRASH
[*] 7.  DONT_OPTIMIZE
[ ] 8.  DUMP_SCHEDULER
[ ] 9.  LOW_MEMORY
[ ] 10. MALLOC_DEBUG
[ ] 11. MTX_PROFILE
[ ] 12. RADIO_RELAX
[ ] 13. TRACE_FRAMES
[ ] 14. STATIC_BUILD
[*] 15. LOADABLE_MODULES

... and we used 'ast_grab_core' script to generate backtrace.

By: Russell Bryant (russell) 2007-06-27 16:29:21

Alright, I guess we'll have to try this the hard way ...

Once you have a core dump of a deadlock, look at it using gdb

# gdb /usr/sbin/asterisk /path/to/core.12345

Look for a thread that is stuck in a call to ast_mutex_lock().  You should see something like this:

ASTERISK-8 0xb68e0f16 in __ast_pthread_mutex_lock (filename=0xb6914413 "chan_iax2.c",
   lineno=1245, func=0xb69148dd "find_callno",
   mutex_name=0xb6914883 "&iaxsl[x]", t=0xb69436d4)

Switch to that thread (for this example, thread 25), and switch to the specific stack frame ...

(gdb) thread 25
(gdb) frame 12

Then, print the contents of the lock info structure:

(gdb) print t


Also, it would make things a lot easier to dig through if you reduce the number of IAX2 threads to no more than a dozen or so.  These hundreds of IAX2 threads is making it more difficult ...

By: Fabricio Machado (fmachado) 2007-06-28 15:49:07

Russel we decrease the number of threads to "10" (default value).
I did follow your instructions and upload file "gdb_dump.14786.gz".
Thanks!

By: Russell Bryant (russell) 2007-06-28 16:25:15

I hope you still have this core dump on your machine ...

Can you also give "p t->mutex" ?

By: Fabricio Machado (fmachado) 2007-06-29 07:11:27

Hi Russel, follow "p t->mutex":
---------------
(gdb) p t->mutex
$2 = {__m_reserved = 2, __m_count = 1, __m_owner = 0x39d7, __m_kind = 1,
 __m_lock = {__status = 1, __spinlock = 0}}
---------------

Thanks!

By: Russell Bryant (russell) 2007-06-29 09:40:37

Ok, I'm sorry this is a pain, but now I need:

(gdb) print *t

That's what I meant to ask for in my first set of instructions.  :(

By: Fabricio Machado (fmachado) 2007-06-29 10:39:57

No problem.

(gdb) print *t
$2 = {mutex = {__m_reserved = 2, __m_count = 1, __m_owner = 0x39d7, __m_kind = 1, __m_lock = {__status = 1, __spinlock = 0}}, file = {0xb6a49413 "chan_iax2.c",
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, lineno = {8106, 0, 0, 0, 0, 0, 0, 0, 0, 0}, reentrancy = 1, func = {0xb6a4d5c8 "__iax2_poke_noanswer", 0x0,
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, thread = {3062279088, 0, 0, 0, 0, 0, 0, 0, 0, 0}}

By: Fabricio Machado (fmachado) 2007-07-02 11:46:46

Russel, I'm trying to upload a full core dump file (1,7 Mb), but it crashes.
There's another way to do that ?
Thanks.

By: Russell Bryant (russell) 2007-07-02 11:58:53

There is no benefit to uploading the core dump itself.  It can't be used on a machine it was not created on.

By: Russell Bryant (russell) 2007-07-03 14:04:14

Is there anyone else who is having a problem with deadlocks in chan_iax2 using the latest version and could provide access to the machine?  I really need access to a core dump where I can use gdb.

Also, fmachado, in your last gdb_commands.txt upload, the line numbers that the backtrace do not match up to the 1.4.5 source, which the backtrace says you are using.  Do you have any modifications to the source?

By: Nicolas Chapleau (nicchap) 2007-07-03 16:45:35

Last time I checked this problem, it was against 1.4.5 and everything ran fine. I'll try to run it again using 1.4.6 and see what happens. It will take a day or 2.

UPDATE: I ran an overnight test with one box dialing another using IAX2, just like before. I ran over 30000 calls with 26 simultaneous calls on the receiving end with no issues. I ran this on 1.4.5.



By: Thiago Garcia (thiagarcia) 2007-07-06 09:28:09

last-debug.gz is a output of gdb "info thread","thread apply all bt","thread apply all bt full"

By: Russell Bryant (russell) 2007-07-06 09:36:54

thiagarcia: Would you be willing to let me log in to the machine to look at the core dump?

By: Thiago Garcia (thiagarcia) 2007-07-10 15:45:10

What's your e-mail address ? So I can send the access information.

By: Russell Bryant (russell) 2007-07-10 17:56:10

russell@digium.com.  Also, if you would like to make it secure, I have a gpg key uploaded to the key server, pgp.mit.edu.  You can encrypt email to me using that.

By: Thiago Garcia (thiagarcia) 2007-07-11 09:00:54

I sent the informations to your email, please see the email with subject 0009614: "Mute" IAX2 Extensions[CORRECT], this email is encrypted.

By: Russell Bryant (russell) 2007-07-11 13:21:15

got it, thanks.  I will report progress here.

By: Digium Subversion (svnbot) 2007-07-11 17:36:47

Repository: asterisk
Revision: 74766

------------------------------------------------------------------------
r74766 | russell | 2007-07-11 17:36:40 -0500 (Wed, 11 Jul 2007) | 5 lines

The function make_trunk() can fail and return -1 instead of a valid new call
number.  Fix the uses of this function to handle this instead of treating it
as the new call number.  This would cause a deadlock and memory corruption.
(possible cause of issue ASTERISK-9333 and others, patch by me)

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-07-11 17:40:20

Repository: asterisk
Revision: 74767

------------------------------------------------------------------------
r74767 | russell | 2007-07-11 17:40:19 -0500 (Wed, 11 Jul 2007) | 13 lines

Merged revisions 74766 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.2

........
r74766 | russell | 2007-07-11 17:53:26 -0500 (Wed, 11 Jul 2007) | 5 lines

The function make_trunk() can fail and return -1 instead of a valid new call
number.  Fix the uses of this function to handle this instead of treating it
as the new call number.  This would cause a deadlock and memory corruption.
(possible cause of issue ASTERISK-9333 and others, patch by me)

........

------------------------------------------------------------------------

By: Digium Subversion (svnbot) 2007-07-11 17:48:46

Repository: asterisk
Revision: 74769

------------------------------------------------------------------------
r74769 | russell | 2007-07-11 17:48:46 -0500 (Wed, 11 Jul 2007) | 21 lines

Merged revisions 74767 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

................
r74767 | russell | 2007-07-11 17:57:07 -0500 (Wed, 11 Jul 2007) | 13 lines

Merged revisions 74766 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.2

........
r74766 | russell | 2007-07-11 17:53:26 -0500 (Wed, 11 Jul 2007) | 5 lines

The function make_trunk() can fail and return -1 instead of a valid new call
number.  Fix the uses of this function to handle this instead of treating it
as the new call number.  This would cause a deadlock and memory corruption.
(possible cause of issue ASTERISK-9333 and others, patch by me)

........

................

------------------------------------------------------------------------

By: Russell Bryant (russell) 2007-07-11 17:50:41

I have fixed a problem which could have been the cause of this issue.  Please update to the latest code in the 1.4 branch and see if the issue still happens.  If so, I'll need access to the new core dumps so I can keep digging.

Thank you very much to thiagarcia for access to the system so I could do some debugging.

By: Thiago Garcia (thiagarcia) 2007-07-17 14:58:45

Since last update the issue did not occur. It's all stable, six days in my `show uptime`.
Thank you very much.

By: Russell Bryant (russell) 2007-07-20 17:46:00

That is *great* news!  Thank you for the update.  I'm going to close this out as fixed.