ASTERISK-12982: manager core dumps after 45,000 AMI originates

[Home]

Summary: ASTERISK-12982: manager core dumps after 45,000 AMI originates

Reporter: Ben Hubbard (hubbaba) Labels:

Date Opened: 2008-10-29 14:29:25 Date Closed: 2011-06-07 14:00:33

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/ManagerInterface

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 13802.patch

Description: I have a Java GatewayDriver that links ami to agi for outbound calls. Each call lasts 10 seconds (stream 2 (5 second) files via fast-agi). After 45,000 calls (2 hours of processing 90 simultaneous calls on 4 PRI spans), Asterisk proceeds to core dump.

I opened the core in gdb. The segmentation fault is occurring in manager.c on line 2385. s->last_ev is NULL for some reason and the NEW_EVENT calls AST_LIST_NEXT which then fails because it tries to dereference s->last_ev.

I have a 62MB core file if anyone is interested in looking through it.

Thank you for your help! I really appreciate it!

****** ADDITIONAL INFORMATION ******

My driver originates the call through AMI and, once the channel has connected, links it to a fast-agi script that processes the call.

Comments: By: Brandon Kruse (bkruse) 2008-10-30 14:52:36

Can you post the backtrace?

-bk
By: Brandon Kruse (bkruse) 2008-10-30 14:56:26

Also, are you using the 'async' option in the command?

Give me an example of your manager packet that you send ( action: originate, etc)

-bk
By: Ben Hubbard (hubbaba) 2008-10-30 15:09:18

It was an originate command. It was not async.

Here is the backtrace:

#0 0x080d113a in process_events (s=0x839c160) at manager.c:2388
#1 0x080dabf1 in do_message (s=0x839c164) at manager.c:2865
#2 0x080db0a7 in session_do (data=0xb7b1c100) at manager.c:2928
#3 0x0812e1db in dummy_start (data=0xb7b2a990) at utils.c:898
#4 0xb7e104fb in ?? ()
ASTERISK-1 0xb7efee5e in ?? ()
By: Brandon Kruse (bkruse) 2008-10-30 15:12:50

If you could get a core dump with DONT_OPTIMIZE enabled (from menuselect), and paste a 'bt full'

Thanks,

Am looking into it anyways,

-bk
By: Ben Hubbard (hubbaba) 2008-10-30 15:26:00

Wish I could. I just put business edition on a few hours ago. I want to see if the bug occurs in BE. Here is the bt full (I know it doesn't help much without turning off optimization):

(gdb) bt full
#0 0x080d113a in process_events (s=0x839c160) at manager.c:2388
eqe = (struct eventqent *) 0x1
ret = 138002788
#1 0x080dabf1 in do_message (s=0x839c164) at manager.c:2865
m = {hdrcount = 0, headers = {0x0 <repeats 128 times>}}
header_buf = '\0' <repeats 1024 times>
res = <value optimized out>
#2 0x080db0a7 in session_do (data=0xb7b1c100) at manager.c:2928
s = <value optimized out>
flags = <value optimized out>
res = <value optimized out>
__PRETTY_FUNCTION__ = "session_do"
#3 0x0812e1db in dummy_start (data=0xb7b2a990) at utils.c:898
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1213082520,
0, -1248081008, -1248082984, 596050069, 834898414},
__mask_was_saved = 0}}, __pad = {0xb59bc490, 0x0, 0xb59bc3d4,
0xb7f72ff4}}
__cancel_arg = (void *) 0xb59bcb90
not_first_call = <value optimized out>
ret = <value optimized out>
#4 0xb7e104fb in ?? ()
No symbol table info available.
ASTERISK-1 0xb7efee5e in ?? ()
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
(gdb)
By: Brandon Kruse (bkruse) 2008-10-30 15:29:40

Well BE is based on 1.4, so it will be completely different.

Thanks for the BT, will look into it.

-bk
By: Ben Hubbard (hubbaba) 2008-10-30 15:58:58

Thank you!
By: Ben Hubbard (hubbaba) 2008-10-31 08:57:47

I ran the same test last night on ABE and no core dump. So, it seems to be an issue with 1.6 that BE has resolved.

We'll do a long test over the weekend with BE and push 1,000,000 calls through to see what happens.
By: Brandon Kruse (bkruse) 2008-10-31 15:02:04

Actually, it's more like, is an issue that was introduced in 1.6 that doesn't exist in BE :/

I will continue to look into it in my investigation in manager for a few other problems as well.

-bk
By: Ben Hubbard (hubbaba) 2008-11-03 08:11:29.000-0600

Just an FYI. I ran my tests over the weekend using BE and pushed 1,536,777 calls through with only 17 failures. So, for the time being, we'll be on BE. Once you get to a point where you have a fix and are ready to test, I can setup another box with Asterisk 1.6 and try your fix there.

Thanks for your help! I really appreciate it!
By: Brandon Kruse (bkruse) 2008-11-04 15:40:08.000-0600

Since it is election day, I am uploading a patch :P

There is someone with a very similar (probably exact) bug to your, and I want you to test to see how it goes.

-bk
By: Ben Hubbard (hubbaba) 2008-11-05 10:36:54.000-0600

Thanks! Give me a bit to get a test system setup with 1.6 and I'll let you know if it works.

Also, we (DCC) will be down at Digium next week on Tuesday for meetings. Hope to meet you then.
By: Brandon Kruse (bkruse) 2008-11-05 15:48:24.000-0600

Great, look forward to you testing!

Yes, just ask for bkruse (Brandon Kruse), we will meet up and have a chat over some coffee or such :)

-bk
By: Ben Hubbard (hubbaba) 2008-11-07 08:03:37.000-0600

Ok, I have Asterisk 1.6 setup on my laptop. Yesterday, I ran the whole day with the original 1.6 code and it finally core dumped as I was walking out the door. This morning, I applied your path and it is currently running. It may take a day or two to see that it doesn't crash so I will let you know on Monday.

So far so good though. I've pushed 10,000 already in the last 20 minutes.
By: Brandon Kruse (bkruse) 2008-11-07 15:52:13.000-0600

Great!

I am actually usually not involved in bug tracking as I want to be, but I love manager, especially working on the Asterisk GUI and such.

Let me know how it goes, and I look forward to talking to you next week and hearing about this implementation!! I think I can give some good pointers!

-bk
By: Ben Hubbard (hubbaba) 2008-11-10 09:17:56.000-0600

Hi Brandon,

Well, it crashed. It performed just under a million calls with about 5,000 failures when it went down. This happened on Saturday in the afternoon. I turned off optimization and am running again. Right now, I've completed 1,000,000 calls with about 17,000 failed calls. When, and if, this crashes, I'll give you a full stack trace.

I'll walk you through everything tomorrow.

Ben
By: Ben Hubbard (hubbaba) 2008-11-10 09:18:48.000-0600

Also, it was in the destroy_session function where it was crashing this time. It was hard to tell from the stack trace with optimization on what exactly was happening. I hope to know more once this test completes.

Ben
By: Brandon Kruse (bkruse) 2008-11-10 10:01:05.000-0600

No problem Ben.

Even if we can get the function, I can look over the 1.4 version of it. Having a codebase that works often helps a lot. I would call this memory corruption, but 1.4 works!

Thanks for the info, look forward to talking to you tomorrow. You may want to email me at my mantis username @digium.com to setup some times. (Since I may be out of the office)

-Brandon
By: Ben Hubbard (hubbaba) 2008-11-10 16:54:14.000-0600

Here is the latest stack trace without optimization. This is running the patched version of Asterisk:

(gdb) bt full
#0 0x000000000050345f in ast_atomic_fetchadd_int (p=0x0, v=-1)
at /usr/src/asterisk-1.6.0.1/include/asterisk/lock.h:1185
No locals.
#1 0x000000000049e068 in unref_event (e=0x0) at manager.c:773
No locals.
#2 0x000000000049e0e5 in free_session (s=0x2318c1b0) at manager.c:792
eqe = (struct eventqent *) 0x0
#3 0x000000000049e1f8 in destroy_session (s=0x2318c1b0) at manager.c:800
No locals.
#4 0x00000000004a5b40 in session_do (data=0x7f03b40e4228) at manager.c:2946
ser = (struct ast_tcptls_session_instance *) 0x7f03b40e4228
s = (struct mansession *) 0x2318c1b0
flags = 2050
res = -1
__PRETTY_FUNCTION__ = "session_do"
ASTERISK-1 0x00000000004f9b5c in ast_make_file_from_fd (data=0x7f03b40e4228)
at tcptls.c:469
ser = (struct ast_tcptls_session_instance *) 0x7f03b40e4228
__PRETTY_FUNCTION__ = "ast_make_file_from_fd"
ASTERISK-2 0x0000000000504ca5 in dummy_start (data=0x7f03b40ab400) at utils.c:917
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {0,
9016794212168105489, 0, 1082468688, 4096, 1105133360,
9016794212193271313, 9016935682656697873}, __mask_was_saved = 0}},
---Type <return> to continue, or q <return> to quit---
__pad = {0x408521e0, 0x0, 0x7f03bb439878, 0x7f03bb439890}}
__cancel_routine = (void (*)(void *)) 0x430f96 <ast_unregister_thread>
__cancel_arg = (void *) 0x40852950
not_first_call = 0
ret = (void *) 0x40852950
a = {start_routine = 0x4f9a98 <ast_make_file_from_fd>,
data = 0x7f03b40e4228,
name = 0x7f03b41d9590 "ast_make_file_from_fd started at [ 161] tcptls.c ast_tcptls_server_root()"}
ASTERISK-3 0x00007f03bacc73f7 in start_thread () from /lib/libpthread.so.0
No symbol table info available.
ASTERISK-4 0x00007f03bb1b8b3d in clone () from /lib/libc.so.6
No symbol table info available.
ASTERISK-5 0x0000000000000000 in ?? ()
No symbol table info available.
By: Ben Hubbard (hubbaba) 2008-11-13 11:47:19.000-0600

Any luck on figuring out what is happening?
By: Brandon Kruse (bkruse) 2008-11-13 13:41:40.000-0600

Didn't see that you posted it without optimization!

Have you tried with trunk? 1.6.0x and 1.6.1 even, are very different (as far as manager is concerned). 1.6.1 would be a little safer than trunk, and should still have a multitude of changes.

We could make those functions null-safe, but it would just be covering up the problem.

What is the ulimit on your server? And the ulimit -n?

-bk
By: Ben Hubbard (hubbaba) 2008-11-14 08:40:17.000-0600

I'll try 1.6.1 and let you know what I find out. I agree that covering for null is not the right solution.

Also, ulimit is ulimited and ulimit -n is 1024.
By: Brandon Kruse (bkruse) 2008-11-14 16:27:24.000-0600

How much memory is in the box?

If around 4 gigs or more, try:

ulimit -n 999999

That will allow us to see if it's because there are not fd's available or not.

The box will most likely crash, but hopefully in a more useful way.

-bk
By: Ben Hubbard (hubbaba) 2008-11-15 10:58:12.000-0600

I have 4GB ram in my laptop where I am testing now. I'll give it a shot and let you know what I find out.
By: Tilghman Lesher (tilghman) 2009-01-16 14:13:41.000-0600

hubbaba: any further updates?
By: Tilghman Lesher (tilghman) 2009-01-29 15:23:37.000-0600

There has been no response or update from the reporter in 2 months. If this is still occurring, please feel free to reopen this issue. Closing.