ASTERISK-11905: handle_response_peerpoke floods asterisk cli with notices, cli crashes

[Home]

Summary: ASTERISK-11905: handle_response_peerpoke floods asterisk cli with notices, cli crashes

Reporter: Kamil Czajko (kactus) Labels:

Date Opened: 2008-04-23 22:35:43 Date Closed: 2011-06-07 14:00:36

Priority: Critical Regression? No

Status: Closed/Complete Components: Channels/chan_sip/Registration

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 10_seconds_of_sip_debug.txt
( 1) 20080702__bug12508__2.diff.txt
( 2) 20080702__bug12508.diff.txt
( 3) backtrace24062008.txt
( 4) debug_24_04_2008.txt
( 5) held_locks_prior_to_DB_spam.txt
( 6) malloc_debug.svn.20080702.txt
( 7) malloc_debug.svn.20080702number2_-_crash.txt
( 8) malloc_debug.txt
( 9) malloc_debugsvnpatch1.txt
(10) malloc_debugsvnr126681patch2.txt
(11) malloc_debugsvnr127662Mpatch2.txt
(12) sip_behaviour_1.4.txt
(13) valgrind.1.4.21.1.20080702.txt
(14) valgrind.svn.20080702.txt
(15) valgrind.svn.20080702number2_-_Crash.txt
(16) valgrind.txt.pid71048
(17) valgrind_-_1.4.txt
(18) valgrindsvnpatch1.txt
(19) valgrindsvnr126681patch2.txt
(20) valgrindsvnr127662Mpatch2.txt

Description: Since upgrading asterisk to svn later than 1.6 beta 8 cli is flooded with

NOTICE[7435]: chan_sip.c:16045 handle_response_peerpoke: Peer 'PEERNAME' is now Reachable. (71ms / 2000ms)

several times a second. Did not happen with svn releases around the time of beta 6 and 7 aprox around april 17 (sorry didn't keep the old revision numbers) or earlier revisions.

disconnecting phone or rebooting causes:
[Apr 24 11:22:31] NOTICE[7435]: chan_sip.c:19841 sip_poke_noanswer: Peer 'DRA-997896-0043' is now UNREACHABLE!

being spammed in the same fashion.

after running it for a while we end up seeing

[root@voip02 asterisk]# /usr/sbin/safe_asterisk: line 117: 7427 Segmentation fault (core dumped) nice -n $PRIORITY ${ASTSBINDIR}/asterisk -f ${CLIARGS} ${ASTARGS} >&/dev/${TTY} </dev/${TTY}
Asterisk ended with exit status 139
Asterisk exited on signal 11.
Automatically restarting Asterisk.
mpg123: no process killed

I'll need to recompile for "don't optimise" so I haven't attached a core debug, let me know if you need it. I will attached the sip info with sip debug as well.

****** ADDITIONAL INFORMATION ******

Its just a centos box updated from a previous svn that worked fine

get the same behaviour in both

Asterisk SVN-trunk-r114577M,
and
Asterisk SVN-trunk-r114601

Cheers
Kamil

Comments: By: Kamil Czajko (kactus) 2008-04-23 22:43:22

just grabbing beta 8 to see if the issue is present there also
By: Kamil Czajko (kactus) 2008-04-23 22:57:46

issue does not appear to occur in beta 8
By: Russell Bryant (russell) 2008-04-24 10:13:15

Ok, so the issue was introduced in trunk. We created a 1.6.0 branch a while ago, so trunk is getting a lot more changes than 1.6.0 is getting right now.

Can you check to see if going back to a trunk revision before 114190 fixes the issue?
By: Kamil Czajko (kactus) 2008-04-24 21:39:37

Hi Today is a public holiday down here in Australia so I'll check for you on Monday.

Cheers - Kamil
By: Kamil Czajko (kactus) 2008-04-28 21:55:10

Hi

Ran svn 114189 and can confirm that the issue does not occur.

All the best
By: Kamil Czajko (kactus) 2008-04-28 22:06:20

Hi

Further testing seemed to show that revision 114189 caused oneway speech with one of our providers so I have reverted to 1.6beta8

I belive it probably has something to do with packet 2 packet bridging and it would be great to be able to disable this as it does it even when canreinvite is set to no and nat is yes

Cheers
By: Kamil Czajko (kactus) 2008-05-21 00:48:47

Hmm not much activity here.

Still occurs with SVN Asterisk SVN-trunk-r117367M, just FYI
By: Kamil Czajko (kactus) 2008-05-21 01:16:24

Also just FYI

does not occur in the beta9

Hope that helps
By: Tilghman Lesher (tilghman) 2008-06-20 18:35:08

When this crashes, can you get me a backtrace, as specified in doc/backtrace.txt?
By: Kamil Czajko (kactus) 2008-06-23 20:19:22

Hello Corydon76

I've grabbed the latest trunk svn I'll need to let it run as it sometimes takes over night before it dies though sometimes it was within 15 min. Generally the more I tried to do with it the quicker it died.

It outputs the message anywhere from 2 to 11 times a second. I'd probably say its a memory leak as I timed it and the box lost 10 MB of free memory in 6 min but I couldn't see that any process increased by a coresponding amount (I might have missed it though) Load averages seem to be floating between 0.12 and 0.30 with one phone registered, baseline with asterisk off is 0.01.

Of interest as I am writing this it has suddenly stopped. Though registering a second extension starts the process for that extension. Try to call between the two however and I receive an error of "[Jun 24 08:55:20] ERROR[43224]: res_config_odbc.c:140 realtime_odbc: No Database handle available with the name of 'DBNAME' (check res_odbc.conf)" which kicks off 5 times.

Checking ODBC show all shows that 5 out of 5 connections are in use (using ODBC to MSSQL) and rebooting the phone to see what happens throws about 3 dozen No Database Handle errors and then it dies dumping the core.

Disconnected from Asterisk server
Executing last minute cleanups
Asterisk ending (0).

Version is Asterisk SVN-trunk-r124744M

I'll upload the core dump momentarily.
By: Kamil Czajko (kactus) 2008-06-24 00:55:39

Just to Clarify backtrace24062008.txt is the backtrace / core dump I just realised I wasn't clear.
By: Tilghman Lesher (tilghman) 2008-06-24 17:45:11

Okay, this backtrace suggests that you have corrupted memory somewhere, so you're going to need to run this under Valgrind, as specified in doc/valgrind.txt.
By: Kamil Czajko (kactus) 2008-06-25 01:16:41

Hello Corydon

Ok, after finally getting valgrind 2.1.2 working on freebsd it crashes as soon as I try and start asterisk. Now I don't know if this is because of valgrind on freebsd or anything else but I'll upload the logs in a moment.

Please note that the malloc_debug attached is generated without valgrind as it only contained the single word "asterisk" when running valgrind.

I had foolishly (in hind sight since we are trying to troubleshoot an issue) updated to Asterisk SVN-trunk-r124912M before I ran any tests so this might be causing the differing behaviour as

A) connecting to the SIP no longer spams the screen with warnings or messages which probably means one of the programmers fixed the bug in svn (yay)

b) we are still getting the issue everytime a sip connects or disconnects it sets one of the connections to "in use" which then causes the error:
[Jun 25 13:43:36] ERROR[71029]: res_config_odbc.c:140 realtime_odbc: No database handle available with the name of 'DBNAME'(check res_odbc.conf)
once they are all in use. A reregister locks 2 db connections.

Well at least it looks like progress though I don't know if you want me to try and roll back, run a different debugging tool or treat the first issue as resolved in SVN-trunk-r124912M with the issue b) above a continuation of the problem.

all the best - Kactus
By: Tilghman Lesher (tilghman) 2008-06-25 15:32:59

Unfortunately, I don't know how to help you on the FreeBSD platform. Is there a way that you could use Linux with Asterisk?
By: Kamil Czajko (kactus) 2008-06-25 18:49:23

Hello Corydon

I don't have a spare physical box atm, however I'm installing asterisk on centos 5.1 within a xen enterprise environment. This was initially to test asterisks suitability within xen, but if the environment causes any issues I'll push for an addition dev box.

I'll let you know how it all goes in a few hours.

Cheers - Kactus
By: Kamil Czajko (kactus) 2008-06-30 03:17:31

Hello Corydon

Something of interest, I built the centos system on xen to test a few different scenarios and just for the heck of it grabbed asterisk 1.4.21 (release) and low and behold we experience near the identical issue.

I'm going to recompile with do not optimize and run it with Valgrind for you tommorow as its getting a bit late here. Let me know if there is anything else you want me to run with it at the same time.

Of interest now is that the Snom phone I am using ends up now running out of memory and becomes unresponsive after registration. Asterisk uses 98% of CPU and its mem usage ticks up constantly. Kind of suprised this is happening in the release on a physically different box so it may be something to do with odbc, snoms, tds etc that is causing it other wise I'd imagine you would be getting a flood of error users running through.

Cheers -Kactus
By: Kamil Czajko (kactus) 2008-06-30 23:35:34

Hi guys, lets see what you can make of this.

Ran the pbx with version 1.4.21 under valgrind using the standard commands as in your valgrind doc. I then connected a SPA962 and started recieving the same behaviour as before. I then joined the Snom 190 just so that it might die a bit quicker. Examples of the behavior exprerienced is in "sip behaviour 1.4.txt"

It ran for a considerable length of time, much slower than without valgrind, until it eventually stopped spamming the console. I was able to connect via asterisk -r from another session but phones where unable to connect at that point. CPU usage was maxed out and memory was quite high. I turned on sip debug and 10 seconds of the out put are included above in "10 seconds of sip debug.txt"

It did not crash hard, mearly stopped accepting any connections from phones, so I killed the process using -11 to force a core dump. I ended up with a valgrind.core but could not backtrace it due to getting this error:

This GDB was configured as "x86_64-redhat-linux-gnu"..."/etc/asterisk/debugging/valgrind.txt.core.2821": not in executable format: File format not recognized

(gdb) bt
No stack.

I've included the valgrind.txt as "valgrind - 1.4.txt" but did not upload the malloc as it was an empty file.

It took about 4 hours to get to this state (unresponsive) under valgrind.

Please let me know what information of debugging you need.

I tested asterisk 1.6 beta9 under this environment to confirm that xen was not causing any issues.

All the best
By: Tilghman Lesher (tilghman) 2008-07-01 08:18:39

kactus: I think you may have run into the locking issue that caused the release of 1.4.21.1. Could you upgrade to that release and re-run your test on Xen?
By: Kamil Czajko (kactus) 2008-07-01 21:51:34

Hi Corydon I've just put 1.4.21.1 on one of the xen virtuals and still get the chan_sip.c:12669 handle_response_peerpoke: Peer 'DRA-997896-0041' is now Reachable. (49ms / 2000ms) spamming the screen

of interest even after it runs out of physical mem, and I disconnect from the ssh session it hammering the network at 500kbs. The snom 190 displays a low mem message and then reboots and is unable to register after it comes back up (same as with the previous version).

I also have an asterisk SVN-trunk-r126681 running through valgrind server that has a spa962 trying to connect to it that I left over night, it no longer takes sip registrations, CPU usage has dropped off but asterisk (well memcheck) is using 780MB of mem.

Let me know if you want me to kill this to try and generate a core dump. I double checked menuselect and realised I hadn't specified debug malloc on this virtual (the svn one) :( so more than likely I'll need to re run it over night again if you still think that it is a memory issue.

In any case let me know what you want me to do next, if you want me to run different versions etc or if you need specific compiler flags such as debug threads etc. The new xen server is beefy enough so, for all intents and purposes, I'm not limited by the number of boxes I can clone and configure.

Please let me know if it is worthwile running valgrind as without it I can get the boxes to a point of no longer accepting connections much faster.

Thanks - Kamil
By: Tilghman Lesher (tilghman) 2008-07-01 22:02:22

kactus: Killing Asterisk is fine. I do not need a core dump from either of these, so don't worry about that. I'd like to see just the valgrind output. Hopefully it will contain something helpful this time around.
By: Tilghman Lesher (tilghman) 2008-07-01 22:07:16

Also: MALLOC_DEBUG enables some CLI commands to show memory allocations. This may help to track down just what is using all that memory. If a deadlock is responsible for the slowdown, then compiling with DEBUG_THREADS and running 'core show locks' from the command line may show that information. Note that a deadlock might cause the memory usage to spike even if there isn't a true memory leak, if threads are stacking up (because each thread allocates just under 256k apiece for stack).
By: Kamil Czajko (kactus) 2008-07-02 02:51:48

Hi Corydon

The SVN-trunk-r126681 box was the first to stop spamming the screen with output. core show locks shows nothing locked. Rebooting the phone with sip debugging on shows that it is still recieving sip registers from the phone but it throws an error that there is "No database handle available with the name DBNAME"

odbc show all reports that all 5 connections are "in use".

Top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30425 root 15 0 700m 133m 4236 S 99.9 52.2 88:32.90 memcheck

The asterisk 1.4.21.1 looks to be not responding at all but sip set debug shows that it is constantly reliably transmitting nat CSeq 102 options

It however only has one db connection in use but the phone is unable to register (reports Network Failure and timeout in the logs) and sip trace just shows it send and recieve 102 options.

The 1.4 box did however have these locks:

=== Thread ID: 147204416 (do_monitor started at [15834] chan_sip.c restart_monitor())
=== ---> Lock #0 (chan_sip.c): MUTEX 15778 do_monitor &monlock 0x8a1e740 (1)
=== ---> Lock #1 (sched.c): MUTEX 223 ast_sched_add_variable &con->lock 0x6ced22c (1)

let me know if you want me to give you anything else or if stopping asterisk now to get you the valgrind output is sufficient.

Tried to catch you on the #asterisk-bugs but looks like you weren't reading it. I'll be hanging around for another 45 min so if you would like me to test anything for you before tommorow, please let me know.

Cheers - Kactus
By: Kamil Czajko (kactus) 2008-07-02 03:29:25

Hi just a quick update

I noticed that the cpu usage on the 1.4 box seems to have decreased some what, its only sitting at around 50% now. Core show locks shows no locks at present.

As a result the phone is able to register. It attempts to call but looks like it is proceeding very slowly, ssh takes a fair few seconds to update each line and the links come up as conjested.

Of interest is that sip debug does not actually display any of the calls being made. We only receive the "Really destroying SIP dialog" one every 7 seconds or so.

Trying to call through to an iax brings up a couple of frames before it dies on chan_iax2.c:7750 socket_process: Call rejected by <destination ip>: No such context/extension

bearing in mind that it works to dial out on asterisk 1.6 beta9
By: Kamil Czajko (kactus) 2008-07-02 03:36:09

Hello again

I'm uploading the valgrind files (suffixed with version and date) and I'll let it run over night.

I ommitted the malloc_debug file with 1.4 as it was empty

Please let me know if you require anything else

Cheers Kactus
By: Kamil Czajko (kactus) 2008-07-02 03:43:11

Just to clarify malloc and valgrind svn files are for SVN-trunk-r126681
By: Kamil Czajko (kactus) 2008-07-02 04:21:04

Hello again

As I was finishing up for the day I setup the SVN-trunk-r126681 box to let it run over night, and I noticed that it had stopped spamming the screen with a single line
Killed

I copied the valgrind and malloc files above suffixed as number2 crash.

I start asterisk with the standard
valgrind --log-file-exactly=valgrind.txt asterisk -vvvvcg 2>malloc_debug.txt

wait for it to finish loading then turn on the phone.

It seems to take 75 seconds to start, 20 for the phone to boot 15 seconds for it to hang and then dies 30 seconds later. This is only the svn xen virtual and I tried it a few times to confirm behavior was the same. (I hadn't rebooted after killing it at this point) Then after rebooting the virtual tried it again and found it was able to run as before (in that it spammed the screen with handle_response_peerpoke).

This crash was not seen on the 1.4.21.1 box.
By: Tilghman Lesher (tilghman) 2008-07-02 09:43:38

Patch uploaded that should fix at least part of the problem (applies to trunk)
By: Kamil Czajko (kactus) 2008-07-02 21:38:33

Hello Corydon

ran both sets of patches against both the SVN-trunk-r126681 and SVN-trunk-r127622M and in all cases I get:

ERROR[23985]: res_config_odbc.c:140 realtime_odbc: No database handle available with the name of 'DBNAME' (check res_odbc.conf) now spaming the screen instead of the peerpoke.

odbc show all shows that all connections are in use. No locks apparent.

I'll momentarily upload the valgrind and malloc files for the tests. I primarily have been looking at patch 2 as I assume its but a fix for patch 1.

Please let me know when you would like me to try anything else.

By: Kamil Czajko (kactus) 2008-07-02 21:48:57

Just an additional note, I killed the services after they had run for a while using the stop now command

Cheers
By: Kamil Czajko (kactus) 2008-07-02 22:31:10

Hi Corydon

Another thing to point out, when I start asterisk with the phones off, let it finish doing its thing, then connect the phones, the SVN-trunk-r127622M allows me to register the phone, (3 connections in use) and does not spam the screen. No locks.

so it appears that the peer poke issue is resolved and only a db connections locking issue remains.

The SVN-trunk-r126681M box seems to have registered a few times, thouwn a dozen errors (as it is out of db handles) then settled down. (this may just be the phone behaviour being different between the snom and the linksys)

It had several locks and I will upload in a moment.

however subsequent db accesses (such as reregistering the phones or trying to call out) then cause the spam of no database handles to occur, around a dozen per second.

Of additional interest, once the phone attempts to reregister (such as rebooting the phone) and is unable to the locks are no longer present.

All the best
By: Kamil Czajko (kactus) 2008-07-02 22:41:29

Its probably also useful to mention that I am connecting to realtime via ODBC to a microsoft sql db using freetds 0.64 drivers

I'm not sure if there have been any changes to the res_odbc code from beta to svn that might help locate the issue.

Since I would imagine not many people are using freetds with asterisk its probably why you are not being inundated with people having the issue :)

Thanks

By: Kamil Czajko (kactus) 2008-07-21 01:06:33

PROGRESS!!!!

Just letting you know that the issue does not appear to rear its ugly head with Asterisk SVN-trunk-r132312, however I have not run it for very long.

I'm still experiencing issue 12708 but we are getting somewhere at least in the Psip - SIP -> ASVN – IAX2 -> A1.2 -> PRI1 configuration, but I'm sure you guys will get it nutted out.

I'll let the system run for a while and let you know if it develops issues over time. For the time being though fingers crossed.

All the best - Kactus
By: Nicholas Blasgen (nblasgen) 2008-07-22 21:38:53

Also having the same issue. Was working just fine with Asterisk 1.4 SVN and Asterisk 1.6 SVN-rev-111565. So this was the current SVN rev in June. Recompiling in July to Asterisk 1.4.21 in order to see if a bug was fixed, now causes me to have a flood of registration requests (peerpoke).

This is from an upgrade. I run 50 Asterisk boxes and don't have the issue on the other 50, so I doubt it's a problem with any specific Asterisk deployment.

All platforms:
- CentOS 5.x
- Realtime: MySQL
- Asterisk 1.4.21.2

No issues with peers not located in RealTime.

By: Kamil Czajko (kactus) 2008-07-24 21:01:56

Hello again

I can confirm that this does not occur in SVN-branch-1.6.0-r133524

Thanks.
By: Tilghman Lesher (tilghman) 2008-07-25 08:07:08

kactus: so this is no longer a problem in either 1.6 or trunk?
By: Nicholas Blasgen (nblasgen) 2008-07-26 23:37:17

Confirmed as well. Asterisk SVN-branch-1.4-r133980 no issues reported.

Actually, I'll modify that statement. Something is still wrong. It's for sure not the same issue as I had before. I'm no longer being flooded with messages. I think the registration time is still being computed a little wrong (I have Asterisk configured to force registration after 600 seconds, but it seems like it's doing it after 60 seconds). I'm also loosing connectivity often, but because my X-Lite is behind a firewall I can't say it's an issue with Asterisk.

A difference between this server and the other 50+ Asterisk 1.4 systems I administer, is that this system (i just noticed) doesn't do RealTime Caching. But in any case, I'm no longer getting the flood of messages. Double check that the registration time is being computed correctly internally and then we're good to go I believe. Just odd that I'm loosing connectivity, but it's too late for me to see if the issue is limited to this machine or my other servers too.

UPDATE: Switched servers and then went back. Everything seems just fine at the moment. Registration time is looking good right now. I think this issue has been resolved.

By: Kamil Czajko (kactus) 2008-07-27 19:55:35

Hello Corydon

I've left a couple of boxes running over the weekend just to test this as in the past it would crash out after a day or so.

I can confirm that in both SVN-trunk-r132312 and SVN-branch-1.6.0-r133524 that it (I haven't tried any other svn revision yet but I would hope it had not been reintroduced :) )

A) no longer floods the screen with peer poke message
B) no longer locks the freeTDS connections
C) has not crashed in the last 3 days.

So far so good looks like this might be a winner. I haven't put the boxes under load but considering how quickly it used to fail previously I think its pretty safe to say its been fixed.

Brilliant :) now once 12708 is done that pretty much settles every issue I had with this release. (13121 only seems to be in SVN trunk and 13045 may be a faulty IAD but I can't test until that is replaced)

I'm happy to have this set to resolved unless you want me to test it a little longer. If it crops up again it can always be reopened.

Perhaps we should create a new case for Nblasgen's issue if the registration time issue crops up again, but beyond that it all looks good so far.

All the best and looking forward to the next beta/release client - Kactus
By: Tilghman Lesher (tilghman) 2008-08-07 12:08:05

Closed by request of reporter.