ASTERISK-00776: 127th Conncurent SIP<->SIP call causes segfault

[Home]

Summary: ASTERISK-00776: 127th Conncurent SIP<->SIP call causes segfault

Reporter: e (e) Labels:

Date Opened: 2004-01-10 23:42:21.000-0600 Date Closed: 2004-09-25 02:49:39

Priority: Critical Regression? No

Status: Closed/Complete Components: Core/General

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 127th_concurrent_call_segfault.txt
( 1) asterisk_run.pid30412.gz
( 2) bt_full.txt
( 3) bug.tar.gz
( 4) p-console.gz
( 5) release_0_7_1_bt_full.txt

Description: With 126 active calls the 127th call causes a segfault.

****** ADDITIONAL INFORMATION ******

All information, testbed configuration etc is located in the attatched file.

Comments: By: e (e) 2004-01-10 23:48:07.000-0600

This looks like it could be the same issue as in bugid 422. Possibly debian specific ? I will see about getting another server and turning it up on a different distro and see what happens on Monday.
By: jrollyson (jrollyson) 2004-01-11 04:29:40.000-0600

Anybody else able generate enough calls to duplicate this?
By: Tilghman Lesher (tilghman) 2004-01-11 14:39:16.000-0600

Since it causes a segfault and a coredump, can you post a 'bt full' for this?
By: jrollyson (jrollyson) 2004-01-14 03:16:36.000-0600

Were you running under GDB when you made that bt full, or did you make that backtrace from the corefile. If possible we need the bt full off the core file.
By: John Todd (jtodd) 2004-01-14 15:09:23.000-0600

What are you using to create calls? Is this "normal" customer traffic, or are you using a call generator?
By: e (e) 2004-01-14 15:46:22.000-0600

Using a call generator see the 127th_concurrent_call_segfault.txt file for testbed configuration. Also the bt full I produced is from a corefile. i just reran the test against the 0.7.1 release and it now seems that the magic number has dropped to 102 calls, but the bt full produces the same footprint.
By: e (e) 2004-01-14 15:50:21.000-0600

I've added the bt full from the 0.7.1 release. I am puzzled as to why gdb is only producing that single frame in the bt let me know if I can get you anything else.
By: jrollyson (jrollyson) 2004-01-15 18:52:59.000-0600

You may want to find kram on irc.freenode.net in #asterisk and have him take a look at it, he will probably need to take a look at the system and possibly attach gdb to it as you are blowing it up.
By: Brian West (bkw918) 2004-01-17 23:04:30.000-0600

just for giggles look in app_dial for a line like this:

#define MAX 256

set it to 512 and see if you make it to 256 before it dies. I'm not sure that will have anything to do with it.. but it might.
By: Brian West (bkw918) 2004-01-17 23:14:19.000-0600

find me on IRC I wanna get in the box and watch this.. my nick is bkw_ on #asterisk
By: John Todd (jtodd) 2004-01-19 08:46:55.000-0600

e - out of curiosity, what is the call generator package you are using? Is it open-source or commercial? I'd like to try to replicate the problem, but I don't feel like creating the tools myself if they're already available...
By: zoa (zoa) 2004-01-28 04:17:56.000-0600

i know its some hardware device.
jtodd: i added a request for a call generator, post any ideas there :)
By: Mark Spencer (markster) 2004-02-08 23:50:20.000-0600

Can you e-mail me contact information, e, I'd like to check it out. Clearly the backtrace isn't fully useful, but it looks like it's got to be something pretty obvious.
By: zoa (zoa) 2004-02-25 07:37:27.000-0600

i tried to reproduce it and i can confirm the bug.

sometimes it coredumps, sometimes it just gives me errors about too much open sockets.

increasing the ulimit doesnt help.

Also the number of calls you can do depends on the ulimit and the system you are using.

I tried on 3 machines, max i got was 171.
By: e (e) 2004-02-27 11:27:29.000-0600

I sent Kram my ph# when he gets a chance I can give him access. If any of the other bug marshals wanna get in a take a look i'm usually on hanging on IRC durning the day (CST) nick is _E_ . Out of curiousity zoa what did you use for call generation ?
By: Mark Spencer (markster) 2004-03-19 01:13:07.000-0600

Okay I've made my first pass at this. Please note that these changes are in CVS head *only*. Pass #1 only makes a difference if you have a zaptel interface in the machine. If you don't, you'll have to wait for pass #2. Pass #1 saves two file descriptors per channel (so if it's SIP to SIP, that's 4 total), but again can only save it if you have zaptel in it.
By: e (e) 2004-03-20 13:43:22.000-0600

So I am using ztdummy as opposed to a true hardware device, but it looks like I'm hitting ~160 concurrent calls every time now before the segfault event, it dies in the same function with the same useless backtrace. Out of curiousity what is/was the issue there ? Was the systems simply runnning out of file descriptors? Just curious having never seen any kernel complaints. I also have my lab systems back and I can run it up to 207 calls so gimmee a shout if you need to get into it for anything.
By: Mark Spencer (markster) 2004-03-20 14:32:21.000-0600

I'd like to login to your system and analyze your core and failure. Can you find me on irc plz.
By: Mark Spencer (markster) 2004-03-20 14:33:36.000-0600

Also, out of curiosity, are you using reinvites or not?
By: e (e) 2004-03-20 14:35:37.000-0600

For informational purposes the magic number seems to currently be 168. I ran with a max of 167 originated calls for 15 minutes with no errors. FYI
By: e (e) 2004-03-22 21:11:13.000-0600

Ran it under valgrind with the following command line....

valgrind --tool=memcheck -v --leak-check=yes --track-fds=yes --logfile=asterisk_run --num-callers=20 --trace-pthread=some --trace-malloc=yes /usr/sbin/asterisk -v

the logfile is attatched
By: e (e) 2004-03-23 10:49:11.000-0600

Ok scratch that last one further analysis it looks like valgrind barfed and not asterisk. I'll repost when I get something useful.
By: Mark Spencer (markster) 2004-03-29 12:38:34.000-0600

I've made some recent changes that made a big difference in IAX, but i'm not sure if they'll make a difference on SIP. Can you upgrade to latest cvs head and tell me if anything changed?
By: e (e) 2004-03-30 00:19:00.000-0600

upgraded to cvs as of around 11PM tonight and no change, seg-faulted at 166
By: rmarchev (rmarchev) 2004-04-02 18:49:23.000-0600

Reproduced the problem under 110-120 calls on dual Dell Intel(R) Xeon(TM) CPU 3.20GHz, Linux 2.4.22, asterisk v1-0_stable
By: Mark Spencer (markster) 2004-04-02 19:21:26.000-0600

Please use -head, not -stable for testing
By: Mark Spencer (markster) 2004-04-03 21:06:05.000-0600

Can you duplicate it with -head?
By: rmarchev (rmarchev) 2004-04-05 11:47:51

Yes it is the same with -head. I think it is starting to break appart when it hits debug message Apr 2 20:13:38 DEBUG[71794794]: channel.c:2364 ast_channel_bridge: Nobody there, continuing...
Then it is in infinite loop until channel would hang.

Here is bt:
#0 0x406af320 in create_addr (r=0x8586830, peer=0xb23feda1 "media02-dev")
at chan_sip.c:907
#1 0x08586dc4 in ?? ()
#2 0x406ad435 in sip_request (type=0xb23ff42c "SIP", format=4, data=0x8586830)
at chan_sip.c:6076
#3 0x0805aaf0 in ast_request (type=0xb23ff42c "SIP", format=4, data=0x0)
at channel.c:1722
#4 0x4041ecf5 in dial_exec (chan=0x40f0c4b0, data=0xb23ff42c)
at app_dial.c:551
ASTERISK-1 0x08064daa in pbx_exec (c=0x40f0c4b0, app=0x80ef200, data=0xb23ff76c,
newstack=1) at pbx.c:396
ASTERISK-2 0x0806c3c1 in pbx_extension_helper (c=0x40f0c4b0,
context=0x40f0c608 "webley_sip", exten=0xb23ff42c "SIP", priority=1,
callerid=0x40eff628 "\"Connect Test\" <111>", action=135281328)
at pbx.c:1176
ASTERISK-3 0x08066d1c in ast_pbx_run (c=0x8103ab0) at pbx.c:1660
ASTERISK-4 0x0806ca81 in pbx_thread (data=0x0) at pbx.c:1885
ASTERISK-5 0x4002e6de in pthread_start_thread () from /lib/libpthread.so.0
By: Mark Spencer (markster) 2004-04-09 13:44:15

Are you able to make it die under valgrind and get a more meaningful backtrace?
By: jcdutton (jcdutton) 2004-04-09 14:52:15

I will offer to help. If I can ssh into the system and actually watch it fail, I should be able to track it down.

I am jcdutton in irc #asterisk.
By: alric (alric) 2004-04-14 10:23:09

Has any progress been made, or any attempt to contact jcdutton by rmarchev?
By: jcdutton (jcdutton) 2004-04-14 13:15:27

I have since found out that it is pointless for me to try, so no need to contact me.
By: zoa (zoa) 2004-04-14 13:47:08

jcdutton find me on #asterisk please, you seem to be very angry about something.
By: rmarchev (rmarchev) 2004-04-14 16:06:23

jcdutton,

I am sorry, I am working on several issues concurrently. That is the rason I did not contact you in timely fashion. I will work with our sysadmins to open ssh connection for you to our box.
By: jcdutton (jcdutton) 2004-04-14 18:45:02

Don't be sorry.
I made the offer before I knew that I have to sign a disclaimer before any fixes go into cvs. Now that I know about the disclaimer, and know that I am not going to sign it. So, if I found the problem and fixed it, the fix would not go in cvs. I therefore conclude that the whole exercise would be pointless.

edited on: 04-14-04 17:39
By: Mark Spencer (markster) 2004-04-15 15:15:24

In any case I'll be happy to login and work on it if you can make it happen under valgrind.
By: rmarchev (rmarchev) 2004-04-15 15:23:02

I will try to reproduce it under valgrind. It will take some time. I will let you know as soon I will be ready.
By: Mark Spencer (markster) 2004-04-21 21:08:13

I believe this is now fixed in latest CVS. The problem (at least on the box we were testing) is that gethostbyname is not reentrant. That is now fixed in CVS head and we can no longer duplicate the crash on the test machine. Please confirm, guys.
By: rmarchev (rmarchev) 2004-04-22 13:01:11

Unfortunately I cannot make more than 120 calls without a crash. Here are two crash back traces and log file attached

1.

#0 wait_for_answer (in=0x40be78f0, outgoing=0x86a0b68, to=0xb13fef74,
allowredir_in=0xb13fef78, allowredir_out=0xb13fef7c,
allowdisconnect=0xb13fef80) at app_dial.c:186
186 if (option_verbose > 2)
(gdb)
(gdb) bt
#0 wait_for_answer (in=0x40be78f0, outgoing=0x86a0b68, to=0xb13fef74,
allowredir_in=0xb13fef78, allowredir_out=0xb13fef7c,
allowdisconnect=0xb13fef80) at app_dial.c:186
#1 0xb13feadc in ?? ()
#2 0x4041f035 in dial_exec (chan=0x40be78f0, data=0x86bf5b8) at app_dial.c:679
#3 0x08065216 in pbx_exec (c=0x40be78f0, app=0x80f02e8, data=0xb13ff76c,
newstack=1) at pbx.c:413
#4 0x0806c811 in pbx_extension_helper (c=0x40be78f0,
context=0x40be7a48 "webley_sip", exten=0x86a0b68 "¸õk\bÿÿÿÿ", priority=1,
callerid=0x40bc9638 "\"Connect Test\" <111>", action=294912) at pbx.c:1195
ASTERISK-1 0x0806717c in ast_pbx_run (c=0x48000) at pbx.c:1679
ASTERISK-2 0x0806ced1 in pbx_thread (data=0x0) at pbx.c:1904
ASTERISK-3 0x4002e6de in pthread_start_thread () from /lib/libpthread.so.0

2.

#0 0x4041e080 in _init () from /usr/lib/asterisk/modules/app_dial.so
(gdb)
(gdb) bt
#0 0x4041e080 in _init () from /usr/lib/asterisk/modules/app_dial.so
#1 0x4041f035 in dial_exec (chan=0x40e00018, data=0x1) at app_dial.c:679
#2 0x08065216 in pbx_exec (c=0x40e00018, app=0x80f02e8, data=0xb15ff76c,
newstack=1) at pbx.c:413
#3 0x0806c811 in pbx_extension_helper (c=0x40e00018,
context=0x40e00170 "webley_sip", exten=0xb15feadc "\030", priority=1,
callerid=0x40c4a4c8 "\"Connect Test\" <111>", action=0) at pbx.c:1195
#4 0x0806717c in ast_pbx_run (c=0x0) at pbx.c:1679
ASTERISK-1 0x0806ced1 in pbx_thread (data=0xb15fef74) at pbx.c:1904
ASTERISK-2 0x4002e6de in pthread_start_thread () from /lib/libpthread.so.0
By: Mark Spencer (markster) 2004-04-23 09:33:12

rmarchev: we'll need to run the test under valgrind to get a better picture for what is going wrong. Can you find me on IRC? Thanks!
By: rmarchev (rmarchev) 2004-04-23 12:45:09

I think, all crashes that I reported where result of 1024 file descriptors limit on select statement. We encreased the limit to 4096 and now I cannot reproduce the problem. I think this bug should be closed.
By: Olle Johansson (oej) 2004-04-23 13:36:53

Could you please add how you incresed the number of file descriptors, so I can add it to the docs?
By: rmarchev (rmarchev) 2004-04-23 15:54:14

modify /usr/include/sys/select.h
add after #include <bits/types.h>
#undef __FD_SETSIZE
#define __FD_SETSIZE 4096
By: Olle Johansson (oej) 2004-04-23 16:12:18

...and recompile the kernel, or?
By: Mark Spencer (markster) 2004-04-23 17:43:11

How about we just use poll instead of select? It's more efficient anyway.
By: rmarchev (rmarchev) 2004-04-23 18:46:00

We did not recompile the kermel, just zaptel, libpri, and asterisk, but Mark is right it would be much more clean to use poll instead of select. Modification of system header is not clean solution.
By: cloos (cloos) 2004-04-23 18:46:26

poll(2) would be an improvement.

epoll may be even better, where it is supported.
By: Mark Spencer (markster) 2004-04-25 22:59:49

I've got poll implemented for everything that SIP -> Zap touches so let me know if this helps.
By: Mark Spencer (markster) 2004-04-26 08:47:55

rmarachev: can you confirm this really fixes it?
By: rmarchev (rmarchev) 2004-04-26 15:45:47

I will test it today
By: rmarchev (rmarchev) 2004-04-26 19:18:42

There is 20 seconds delay between first INVITE I would send from the SIP phone and any reaction of *. I turned on sip debug and I do not see first INVITE in a log for 20 seconds.
By: Mark Spencer (markster) 2004-04-26 19:32:12

This really sounds like some sort of DNS issue. I can't duplicate that problem here. Can you confirm? maybe with gdb?
By: rmarchev (rmarchev) 2004-04-26 19:41:54

Yes. You are right. I am starting the test now.
By: rmarchev (rmarchev) 2004-04-26 19:59:58

200 concurrent bridged sip<=>sip calls without any problems
By: Mark Spencer (markster) 2004-04-26 20:03:52

Excellent. Thanks!
By: e (e) 2004-04-27 12:14:02

Tested and confirmed fixed here as well.
By: Mark Spencer (markster) 2004-04-27 13:21:10

Finally, really, fixed in CVS head.