[Home]

Summary:ASTERISK-14994: [patch] utils.c:938 ast_carefulwrite: Timed out trying to write causes corruption to astdb
Reporter:David Brillert (aragon)Labels:
Date Opened:2009-10-15 10:02:40Date Closed:2011-06-07 14:01:04
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Resources/res_agi
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) 16081agi_debug_16102009.txt
( 1) 16081debug_agidebug_enabled.txt
( 2) agi_writetimeout.rev1.diff
( 3) carefulwrite_location.rev1.diff
( 4) database_show_valgrind.txt
( 5) gdb_asterisk_pid_database_show.txt
( 6) malloc_debug.txt
( 7) threadapplyallbtpidafter..txt
( 8) threadapplyallbtpidbefore.txt
( 9) threadapplyallbtpidduring.txt
(10) valgrind.txt
Description:Agent addmember to local extensions causes the agent to be logged as non-existent channels when ast_carefulwrite times out during login.

[Oct 15 10:40:07] NOTICE[29962]: utils.c:938 ast_carefulwrite: Timed out trying to write
[Oct 15 10:40:07] NOTICE[29962]: utils.c:938 ast_carefulwrite: Timed out trying to write

Here  is an example of the corruption where the same local extension is logged twice.  The "outgoing" agent cannot logoff unless I manually remove from astdb.  The local agent can logoff but since the outgoing cannot calls entering queue fail to reach agents and the local channel cannot log back in.

Local/3345@queuefifteen-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3345@queuefifteen-agent/n (dynamic) (Not in use) has taken 2 calls (last was                                                                 1915 secs ago)


****** ADDITIONAL INFORMATION ******

I don't have any "outgoing" agents they are ALL local.

asterisk*CLI> show queues
customer-queueone has 0 calls (max 3) in 'rrmemory' strategy (0s holdtime), W:0, C:0,                                                                 A:0, SL:0.0% within 60s
  Members: >
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
  No Callers
asterisk*CLI>
customer-queuetwo has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members: >
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3329@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers
asterisk*CLI>
customer-queuethree has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members: >
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3329@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers

customer-queuefour has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members:
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3329@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers

default      has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime), W:                                                                0, C:0, A:0, SL:0.0% within 0s
  No Members
  No Callers
asterisk*CLI>
customer-queuefive has 0 calls (max 6) in 'rrmemory' strategy (4s holdtime), W:0, C:5,                                                                 A:2, SL:100.0% within 40s
  Members: >
     Local/5015@customer-agent/n (dynamic) (paused) (In use) has taken 4 calls (last                                                                 was 1777 secs ago)
     Local/5007@customer-agent/n (dynamic) (Not in use) has taken 4 calls (last was                                                                 695 secs ago)
     Local/5010@customer-agent/n (dynamic) (Not in use) has taken 4 calls (last was                                                                 2917 secs ago)
  No Callers
asterisk*CLI>
customer-queuesix has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members: >
     Local/3318@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3331@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3355@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3345@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3345@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers
asterisk*CLI>
customer-queueseven has 0 calls (max 5) in 'rrmemory' strategy (0s holdtime), W:0, C:0,                                                                 A:0, SL:0.0% within 60s
  No Members
  No Callers
asterisk*CLI>
customer-queueeight has 0 calls (max 5) in 'rrmemory' strategy (0s holdtime), W:0, C:0,                                                                 A:0, SL:0.0% within 60s
  Members: >
     Local/3318@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3331@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3355@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3345@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3345@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers
asterisk*CLI>
customer-queuenine     has 0 calls (max unlimited) in 'rrmemory' strategy (7s holdtime), W                                                                :0, C:3, A:0, SL:100.0% within 60s
  Members:
     Local/3322@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3337@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3303@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3307@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3320@customer-agent/n (dynamic) (Not in use) has taken 1 calls (last was                                                                 1042 secs ago)
     Local/3311@customer-agent/n (dynamic) (Not in use) has taken 1 calls (last was                                                                 1256 secs ago)
     Local/3332@customer-agent/n (dynamic) (Not in use) has taken 5 calls (last was                                                                 2815 secs ago)
  No Callers

customer-queueten has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  No Members
  No Callers
asterisk*CLI>
customer-queueeleven has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  No Members
  No Callers

customer-queuetwelve has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members: >
     Local/3331@customer-agent/n (dynamic) (Not in use) has taken no calls yet
  No Callers

customer-queuethirteen has 0 calls (max unlimited) in 'rrmemory' strategy (23s holdtime),                                                                 W:0, C:1, A:0, SL:0.0% within 60s
  Members: >
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
  No Callers
asterisk*CLI>
customer-queuefourteen has 0 calls (max unlimited) in 'rrmemory' strategy (0s holdtime), W                                                                :0, C:0, A:0, SL:0.0% within 60s
  Members:
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3331@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
  No Callers

customer-queuefifteen      has 0 calls (max 5) in 'rrmemory' strategy (23s holdtime), W:0, C:2                                                                , A:0, SL:50.0% within 40s
  Members: >
     Local/3356@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3360@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3364@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3318@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3331@customer-agent/n (dynamic) (Not in use) has taken 1 calls (last was                                                                 5521 secs ago)
     Local/3355@customer-agent/n (dynamic) (Not in use) has taken no calls yet
     Local/3345@customer-outgoing/n (dynamic) (Invalid) has taken no calls yet
     Local/3345@customer-agent/n (dynamic) (Not in use) has taken 2 calls (last was

Load is not high
Centos 5.2
Two dual core Pentium Xeon 2.8 GHz CPU
2 GB RAM

Problem does not occur in Asterisk 1.4.24.1

I marked severity as block since I have to restart Asterisk several times per day and delete the astdb file before each restart.  Callers fail to reach agents.  I have to downgrade to Asterisk 1.4.24 on each affected site.  This has happened to two installation since upgrading to 1.4.26.x
Comments:By: David Brillert (aragon) 2009-10-15 13:29:13

I think this is related to an earlier ticket of mine that is not fixed.
ASTERISK-13909

We did fix under 1.4.24.1 by reverting r165796 but this appeared fixed in some SVN revision.  The problem is back in 1.4.26.1 when we do not revert r165796

I'd sure like to see this finally fixed before 1.4.27 is officially released.

By: David Brillert (aragon) 2009-10-15 18:32:18

I am currently testing Asterisk SVN 1.4 r223955
My developers analyzed ast_carefulwrite function on main/utils.c from pre-165796. 165796 and current code and latest code have a different version of the function so cannot "easily" revert the patch.

And the notices are still present in 1.4.27rc1
[Oct 15 19:21:54] NOTICE[32659]: utils.c:938 ast_carefulwrite: Timed out trying to write
[Oct 15 19:21:54] NOTICE[32659]: utils.c:938 ast_carefulwrite: Timed out trying to write

I deleted the astdb prior to the upgrade and have all agents logging back into the queues.  I'll have to monitor the queues with show queues to see how long it takes until the corruption returns.



By: David Brillert (aragon) 2009-10-15 18:37:59

Current symptoms in SVN 1.4 r223955 are exactly the same as my old ticket ASTERISK-13909 but it is no longer possible to revert r165796

  -- Goto (zap-incoming,7434,21)
   -- Executing [7434@zap-incoming:21] Set("Zap/4-1", "CHANNEL(language)=en") in new stack
   -- Executing [7434@zap-incoming:22] Gosub("Zap/4-1", "scm-aa-main|s|1") in new stack
   -- Executing [s@scm-aa-main:1] Answer("Zap/4-1", "") in new stack
   -- Executing [s@scm-aa-main:2] AGI("Zap/4-1", "/var/www/scopserv/telephony/scripts/agi/ivr.php|998") in new stack
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/ivr.php
[Oct 15 19:21:54] NOTICE[32659]: utils.c:938 ast_carefulwrite: Timed out trying to write
[Oct 15 19:21:54] NOTICE[32659]: utils.c:938 ast_carefulwrite: Timed out trying to write

By: David Brillert (aragon) 2009-10-16 08:22:50

The timeout definitely occurs to the AGI scripts processing.

It happens to most calls that reach an auto attendant (IVR) menu or when agents pause in queue.

I'll post some agi debug output in a little while.

By: David Brillert (aragon) 2009-10-16 09:05:14

I didn't set the category correctly.
This bug effects IVR and queues and appears to fall under AGI.
However r165796 falls under Core/ManagerInterface/utils.c and reverting that change in my 1.4.24 installations fixed my problems in pre 1.4.26 installations (using custom 1.4.24.1 rpms with r165796 reverted)

Debug info uploaded as 16081agi_debug_16102009.txt

I reproduced the carefulwrite notices by calling into the PRI span and the call was accepted on PRI Zap 49.

The interesting bit in the log prior to the carewrite notice is this (I don't have any agents logged to zap channels; all of the agents are local SIP channels):

   -- Accepting call from '5555551234' to '4350' on channel 0/1, span 3
[Oct 16 09:27:28] DEBUG[21737]: chan_dahdi.c:1784 dahdi_enable_ec: Enabled echo cancellation on channel 49
[Oct 16 09:27:28] DEBUG[27764]: manager.c:2233 process_message: Manager received command 'Ping'
[Oct 16 09:27:28] DEBUG[11247]: pbx.c:1843 pbx_extension_helper: Launching 'Answer'
   -- Executing [4350@zap-incoming:1] Answer("Zap/49-1", "") in new stack
[Oct 16 09:27:28] DEBUG[27764]: manager.c:2233 process_message: Manager received command 'Command'
[Oct 16 09:27:28] DEBUG[11247]: devicestate.c:302 ast_device_state_changed_literal: Notification of state change to be queued on device/channel Zap/49
[Oct 16 09:27:28] DEBUG[21730]: channel.c:1133 channel_find_locked: Avoiding initial deadlock for channel '0x9750fa8'
[Oct 16 09:27:28] DEBUG[21730]: channel.c:1133 channel_find_locked: Avoiding initial deadlock for channel '0x9750fa8'
[Oct 16 09:27:28] DEBUG[11247]: chan_dahdi.c:5474 dahdi_indicate: Requested indication -1 on channel Zap/49-1
[Oct 16 09:27:28] DEBUG[21730]: channel.c:1133 channel_find_locked: Avoiding initial deadlock for channel '0x9750fa8'
[Oct 16 09:27:28] DEBUG[11247]: pbx.c:1843 pbx_extension_helper: Launching 'Set'
   -- Executing [4350@zap-incoming:2] Set("Zap/49-1", "CHANNEL(musicclass)=default") in new stack
[Oct 16 09:27:28] DEBUG[21730]: devicestate.c:287 do_state_change: Changing state for Zap/49 - state 2 (In use)
[Oct 16 09:27:28] DEBUG[21779]: app_queue.c:682 handle_statechange: Device 'Zap/49' changed to state '2' (In use) but we don't care because they're not a member of any queue.



By: David Brillert (aragon) 2009-10-18 14:02:41

This problem is easy to reproduce on an inbound call via PRI to my IVR menu.
So I took three backtraces with gdb asterisk --pid='astersk' and I'm uploading to this bug report.

The back traces required three different test calls with gdb snapshots taken during each phase of a call.

Here is the call description:
-Call main number which rings direct to an IVR menu.
-PRI channel answers and 99.99999% of time there will be ast_carefulwrite error before I hear any audio.  Audio is delayed on channel until the ast_carefulwrite errors disappear.

One backtrace before seeing ast_carefulwrite appears
= threadapplyallbtpidbefore.txt

One backtrace during/between ast_carefulwrite messages
= threadapplyallbtpidduring.txt

One backtrace after ast_carefulwrite appears
= threadapplyallbtpidafter..txt

The good stuff appears in threadapplyallbtpidduring.txt



By: David Brillert (aragon) 2009-10-18 14:54:35

Since there were out of bounds errors in threadapplyallbtpidduring.txt I ran Asterisk in Valgrind for one reproduced call.
Attaching malloc_debug.txt and valgrind.txt

By: David Brillert (aragon) 2009-10-27 15:24:52

This bug is getting cold.
Anything else needed from me?
I think I have provided good debug info and identified root cause.
Getting this fixed is outside of my skillset.
I can test patches.

By: David Brillert (aragon) 2009-11-02 11:44:09.000-0600

I'm not sure what is happening to this report?
It was marked as a 1.4.27 blocker but now it isn't.
Appears that the related ticket ASTERISK-13909 which is not fixed made it into the roadmap for 1.4.27 instead...

By: Russell Bryant (russell) 2009-11-05 17:28:17.000-0600

You noted that the message appears to be related to AGI.  That is certainly plausible, since that is one of the few places where ast_carefulwrite() is used.  However, to be sure, carefulwrite_location.rev1.diff is a patch that will update the message to tell us what write specifically is resulting in this message.

I do not see any usage of this function that is directly related to the astdb, so to me it looks like this message and the queue problem are likely not related.

Another thing you can experiment with is increasing the write timeout in res_agi, as demonstrated in agi_writetimeout.rev1.diff.  However, any delay while write is blocking is going to have the side effect that you observed where no audio is being processed.  This could be caused by high load on the system, or a problem in the AGI application itself.

By: David Brillert (aragon) 2009-11-05 21:49:44.000-0600

Thanks Russell
I have to test carefulwrite_location.rev1.diff after hours so I will do so on Friday night.

We have tested increasing the timeout with no success in the past.

By: David Brillert (aragon) 2009-11-06 21:36:09.000-0600

Russell, here is the debug info

AGI("Zap/21-1", "/var/www/scopserv/telephony/scripts/agi/ivr.php|2944") in new stack
   -- Launched AGI Script /var/www/scopserv/telephony/scripts/agi/ivr.php
[Nov  6 22:28:24] NOTICE[25524]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 152 of agi_debug_cli in file res_agi.c
[Nov  6 22:28:24] NOTICE[25524]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 152 of agi_debug_cli in file res_agi.c
   -- AGI Script /var/www/scopserv/telephony/scripts/agi/ivr.php completed, returning 0

By: Joel Vandal (jvandal) 2009-11-07 07:57:10.000-0600

Can it be a "timeout" caused by AGI script when it start ? I'm currently converting most of our AGI to use FastAGI in order to have a "persistant" process instead of spawn a new process on each AGI call, maybe the problem is that the AGI take some time to "initialize" ?

By: Leif Madsen (lmadsen) 2009-11-13 08:46:49.000-0600

I'm relating this to 15609 based on the latest notes there.

By: David Brillert (aragon) 2009-11-20 12:28:29.000-0600

More debug info from CLI here:

   -- Executing [1625@default-local-devices:4] Dial("Local/1625@default-agent-83af,2", "SIP/1625||twk") in new stack
   -- Called 1625
   -- SIP/1625-0000e3b4 answered Local/1625@default-agent-83af,2
   -- Local/1625@default-agent-83af,1 answered SIP/siptrunk-0000e3b3
   -- Stopped music on hold on SIP/siptrunk-0000e3b3
[Nov 20 13:19:50] NOTICE[11600]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 546 of astman_append in file manager.c

By: David Brillert (aragon) 2009-11-20 14:39:57.000-0600

I noticed the carefulwrite NOTICES are triggered when we do DBGet/DBPut to update AstDB
So with *ngrep -d any port 5038 |grep Command* I watch and during

 Action: Command..Command: database get "default/1636/Agent" "Login"....
 Response: Follows..Privilege: Command..
 Action: Command..Command: database get "default/1636/Agent" "Paused"....

Causes this notice to appear:
[Nov 20 15:29:39] NOTICE[2210]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 546 of astman_append in file manager.c


Excellent.

I can now reproduce the NOTICES with * CLI command "database show"
Resulting in:

[Nov 20 15:23:39] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:39] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:39] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:39] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:39] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:23:40] NOTICE[32560]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c

By: David Brillert (aragon) 2009-11-20 14:52:49.000-0600

To reproduce on our busy system is now easy.
* CLI do
database show
core show locks

core show locks
fireworx*CLI>
=======================================================================
=== Currently Held Locks ==============================================
=======================================================================
===
=== <file> <line num> <function> <lock name> <lock addr> (times locked)
===
=== Thread ID: -1251673200 (netconsole           started at [ 1036] asterisk.c listener())
=== ---> Lock #0 (db.c): MUTEX 325 database_show &dblock 0x8180300 (1)
=== -------------------------------------------------------------------
===
=== Thread ID: -1252164720 (pbx_thread           started at [ 2646] pbx.c ast_pbx_start())
=== ---> Waiting for Lock #0 (db.c): MUTEX 178 ast_db_get &dblock 0x8180300 (1)
=== --- ---> Locked Here: db.c line 325 (database_show)
=== -------------------------------------------------------------------
===
=======================================================================

[Nov 20 15:42:50] NOTICE[4422]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:42:50] NOTICE[4422]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c
[Nov 20 15:42:50] NOTICE[4422]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 73 of ast_cli in file cli.c

But a lot harder to capture the manager.c NOTICE core show locks output!
Timimg that command would be like typing a needle in a haystack...

By: David Brillert (aragon) 2009-11-20 15:26:26.000-0600

valgrind and gdb pid files uploaded

By: David Brillert (aragon) 2009-11-20 15:39:32.000-0600

BTW you dont have to type core show locks to reproduce but it does show the locking after the database show.

By: David Brillert (aragon) 2009-11-25 13:01:47.000-0600

russell: increasing the ast_carefulwrite timeout locks a bunch of stuff in utils.c and makes things worse.

[Nov 25 13:42:17] ERROR[25487] utils.c: Thread 'pbx_thread           started at [ 2643] pbx.c ast_pbx_start()' still has a lock! - 'q' (0xa5955b0) from 'remove_from_queue' in app_queue.c:3426!
[Nov 25 13:42:17] ERROR[25487] utils.c: Thread 'pbx_thread           started at [ 2643] pbx.c ast_pbx_start()' still has a lock! - 'q' (0xa594a28) from 'remove_from_queue' in app_queue.c:3426!

By: David Brillert (aragon) 2009-11-25 14:49:42.000-0600

I am pretty certain that locking of db.c during some manager commands causes the error since I have timed the lock and it lasts on average 7 seconds.

By: destiny6628 (destiny6628) 2009-12-14 01:53:00.000-0600

Hi i have been also facing this bug for a quite long time now and this normally comes after some period of time when traffic (no. of channels) on the asterisk starts crossing beyond 85 etc.

I have been running server with asterisk version of 1.4.27 and error which keeps flowing on asterisk mentioned below

-- Got SIP response 503 "Service Unavailable" back from 174.46.235.190 [Dec 13 06:05:42] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c
   -- Got SIP response 500 "Service Unavailable" back from 174.46.235.190 [Dec 13 06:05:42] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:42] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:42] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c
   -- AMD: ANSWERING MACHINE: silenceDuration:2500 initialSilence:2500
   -- Executing [8380@default:2] AGI("SIP/kurtn-099eaab0", "VD_amd|192.168.1.236|192.168.1.235") in new stack [Dec 13 06:05:42] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:43] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:43] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:43] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:43] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c [Dec 13 06:05:43] NOTICE[2407]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 545 of astman_append in file manager.c
   -- Launched AGI Script /var/lib/asterisk/agi-bin/VD_amd
   -- Executing [8380@default:1] AMD("SIP/kurtn-0980b378", "") in new stack
   -- AMD: SIP/kurtn-0980b378 7223125012 (null) (Fmt: 256)
   -- AMD: initialSilence [2500] greeting [2500] afterGreetingSilence [800] totalAnalysisTime [5000] minimumWordLength [100] betweenWordsSilence [50] maximumNumberOfWords [3] silenceThreshold [256]
   -- AMD: Changed state to STATE_IN_SILENCE
   -- Executing [8380@default:1] AMD("SIP/kurtn-097a6e68", "") in new stack
   -- AMD: SIP/kurtn-097a6e68 7223125012 (null) (Fmt: 256)
   -- Executing [8380@default:1] AMD("SIP/kurtn-09840de0", "") in new stack
   -- AMD: initialSilence [2500] greeting [2500] afterGreetingSilence [800] totalAnalysisTime [5000] minimumWordLength [100] betweenWordsSilence [50] maximumNumberOfWords [3] silenceThreshold [256]
   -- AMD: SIP/kurtn-09840de0 7223125012 (null) (Fmt: 256)
   -- AMD: initialSilence [2500] greeting [2500] afterGreetingSilence [800] totalAnalysisTime [5000] minimumWordLength [100] betweenWordsSilence [50] maximumNumberOfWords [3] silenceThreshold [256]
   -- AMD: Changed state to STATE_IN_SILENCE
 == Spawn extension (default, 8380, 2) exited non-zero on 'SIP/kurtn-099eaab0'
   -- Executing [h@default:1] DeadAGI("SIP/kurtn-099eaab0", "hangup|192.168.1.236|192.168.1.235|16||n") in new stack
   -- Launched AGI Script /var/lib/asterisk/agi-bin/hangup
   -- AGI Script hangup completed, returning 0
   -- Executing [h@default:2] DeadAGI("SIP/kurtn-099eaab0", "agi://192.168.1.235:2222/AGI?Hangup") in new stack
   -- AGI Script agi://192.168.1.235:2222/AGI?Hangup completed, returning 0
   -- AMD: Changed state to STATE_IN_SILENCE localhost*CLI> exit



Any suggestions/help would be highly appreciated.

By: destiny6628 (destiny6628) 2009-12-14 01:58:31.000-0600

When this error starts floating on asterisk then call connectivity becomes poorer and at that time we have to kill asterisk and start asterisk and then things becomes better.

By: destiny6628 (destiny6628) 2009-12-15 02:25:49.000-0600

In case any logs / debug is required do let me know.
will get all the logs etc.

By: destiny6628 (destiny6628) 2009-12-15 14:10:48.000-0600

I would appreciate any suggestions which minimize this problem.
This is a big time problem which is hitting my clients as their asterisk requires to be killed quite a couple of times during the day loosing productions.

By: David Brillert (aragon) 2009-12-15 15:32:42.000-0600

I beleive the problem is in the background application

[Dec 15 16:27:30] WARNING[17320] file.c: Failed to write frame
[Dec 15 16:27:30] DEBUG[17320] channel.c: Set channel Zap/17-1 to write format ulaw
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Extension ΓΏ, priority 0 returned normally even though call was hung up
[Dec 15 16:27:30] DEBUG[17320] channel.c: Soft-Hanging up channel 'Zap/17-1'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Launching 'Goto'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Expression result is '0'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Expression result is '2'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Launching 'GotoIf'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Launching 'ResetCDR'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '"private" <5555555555>'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '5555555555'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4350'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'zap-incoming'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'Zap/17-1'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '(null)'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'BackGround'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'custom/0002'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:26'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:26'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:30'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'ANSWERED'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'DOCUMENTATION'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '_NONE_'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '1260912446.4587'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4350'
[Dec 15 16:27:30] DEBUG[17320] cdr_addon_mysql.c: cdr_mysql: inserting a CDR record.
[Dec 15 16:27:30] DEBUG[17320] cdr_addon_mysql.c: cdr_mysql: SQL command as follows: INSERT INTO cdr_pre (calldate,clid,src,dst,dcontext,channel,dstchannel,lastapp,lastdata,duration,billsec,disposition,amaflags,accountcode,uniqueid,userfield) VALUES ('2009-12-15 16:27:26','\"private\" <5555555555>','5555555555','4350','zap-incoming', 'Zap/17-1','','BackGround','custom/0002',4,4,'ANSWERED',3,'_NONE_','1260912446.4587','4350')
[Dec 15 16:27:30] DEBUG[17320] cdr_radius.c: Unable to create RADIUS record. CDR not recorded!
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Launching 'NoCDR'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Launching 'System'

By: David Brillert (aragon) 2009-12-15 15:49:32.000-0600

Specifically when background is executed by agi

By: destiny6628 (destiny6628) 2009-12-16 12:48:29.000-0600

Earlier with previous versions of asterisk, i do not think so we use to have this problem.
This has started coming above version 1.4.21.1 and have been affecting a performance a lot.

By: David Brillert (aragon) 2009-12-16 13:45:00.000-0600

There appears to be a processing delay when function result is background and we try to write to mysql table with cdr_addon_mysql.c:
The delay occurs when the mysql table is large or corrupt or being optimized.

[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'BackGround'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'custom/0002'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:26'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:26'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '2009-12-15 16:27:30'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'ANSWERED'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is 'DOCUMENTATION'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '_NONE_'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '1260912446.4587'
[Dec 15 16:27:30] DEBUG[17320] pbx.c: Function result is '4350'
[Dec 15 16:27:30] DEBUG[17320] cdr_addon_mysql.c: cdr_mysql: inserting a CDR record.
[Dec 15 16:27:30] DEBUG[17320] cdr_addon_mysql.c: cdr_mysql: SQL command as follows: INSERT INTO cdr_pre

By: destiny6628 (destiny6628) 2009-12-17 07:56:26.000-0600

Here we are using mysql as database and all the agi's running at the backend do not get hang or remain in use for a long period of time.

We normally check them through ps -A command and did not find any stuck there is ps -A

So how shall we get rid of this ?

By: destiny6628 (destiny6628) 2009-12-18 10:17:45.000-0600

Russel: Can you please guide as where issue lies and how shall it gets resolved

By: David Brillert (aragon) 2009-12-22 09:47:58.000-0600

CURRENT SUMMARY:
I have tried disabling cdr altogether and this has no effect on the  NOTICE[22022]: utils.c:938 __ast_carefulwrite: Timed out trying to write at line 152 of agi_debug_cli in file res_agi.c
The caller hears no audio from the IVR until the NOTICES are finished.

I still see this on every call I make to IVR on this system. The mysql table is optimized and the CPU load is normal. The NOTICE appears before any background prompt is played back and before any hangup, timeout, or digit is pressed. But exactly after agi/ivr.php script is launched and only if background exists in context.

I'm uploading a full agi debug with logger.conf debug to console enabled

I previously posted threadapplyallbtduring.txt when the condition occurred and one of the agi scripts returned a result success and then a bunch of out of bound and memory corruption errors and then a write to logger.c which I assume is what triggered the notices to be written.  A subsequent valgrind capture printed some references to cdr_addon_mysql.so.  Since the ivr.php script posts some data to mysql cdr it may be reasonable to assume that the memory corruption is in cdr_addon_mysql.so

By: David Brillert (aragon) 2009-12-22 13:27:04.000-0600

Partial part of the code for IVR AGI Script /var/www/scopserv/telephony/scripts/agi/ivr.php

#!/usr/bin/php -q
<?php

ob_implicit_flush(false);
set_time_limit(5);

require('lib/phpagi.php');

// Initialise AGI library
$agi = new AGI();

// Get AGI variable (ex. agi_channel, agi_callerid, agi_dnid, etc);
// ...

// Connect to SQL database
mysql_connect( ... );

// Generate SQL query
$sql = "INSERT INTO ivr SET ...";

// Execute the SQL query
mysql_query($sql);

?>



By: David Brillert (aragon) 2010-01-15 08:32:04.000-0600

Anything else needed from me to get resolving this issue back on track?
It is still service affecting.

By: Leif Madsen (lmadsen) 2010-01-15 13:32:50.000-0600

If there is nothing requiring feedback, then just development resources are required at this point.

By: David Brillert (aragon) 2010-06-18 13:43:44

It has been a long time since I updated this report.
This is still an issue with agi
I believe the cause is the agi timeout that waits a finite amount of time for data and then passes whatever data it receives prior to timeout to astdb.
This is a really big problem especially for agent login data since often times not all agent login data is received in time.
The only workaround is to use fastagi and store data in a mysql database instead of astdb file.
My assumption is that the delay is caused by the startup time required to launch the agi application for each call.  This assumption is supported by the reasoning that the fastagi acts as a service therefore the timeout is not as service affecting.
There must be a more reliable method to use agi + astdb without using the timeout mechanism???

By: David Brillert (aragon) 2010-09-27 14:37:28

Fastagi has been working instead of agi for months to fix the reported problem.
This bug report can be closed out.

By: Leif Madsen (lmadsen) 2010-10-04 12:26:24

Fix is to use FastAGI