[Home]

Summary:ASTERISK-09969: chan_sip hangs with big number of sip channels
Reporter:Igor Goncharovsky (igorg)Labels:
Date Opened:2007-07-27 02:01:22Date Closed:2007-08-02 12:17:39
Priority:MajorRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) sip_hang_070731.txt
( 1) sip_hang_070801.txt
( 2) sip_hang.txt
Description:After two days on production system (4 ISDN lines) SIP phones can't register and make calls. mISDN channel seems work (on console I see incoming calls).

'sip show channels' show 91 active(!) channels

'show channels' show only 9 active channels

command 'restart now' do not restart asterisk



****** ADDITIONAL INFORMATION ******

Only modification made to asterisk is set priority flag to socket for VLAN QoS.
Comments:By: Joshua C. Colp (jcolp) 2007-07-27 11:35:04

It appears your system may be deadlocked. What is the regular callflow for these calls? Can you also use the ast_grab_core script in contrib/scripts to get a backtrace? Thanks.

By: Eliel Sardanons (eliel) 2007-07-29 00:34:05

I think this could be a duplicate of ASTERISK-8038

By: Eliel Sardanons (eliel) 2007-07-29 00:37:24

Are you using call-limit for the members in the Queue?

By: Igor Goncharovsky (igorg) 2007-07-29 11:22:24

Yes, seem that is very similar to 0008260. That more info about this system:
1) Yes, it is call-center. There is 8 incoming lines, 3 DID's and 4 main queues for incoming call. Call enter a queue then ususaly transfered to other peer.
2) I have now upgraded to 1.2.23. I have enabled DEBUG_THREADS and I'll use ast_grab_core as soon as bug take a place.
3) SIP peers configured from realtime (mysql).
4) call-limit for sip peers isn't set, column in database not exists.



By: Igor Goncharovsky (igorg) 2007-07-31 03:21:37

Ok, I have made backtrace, attached.

By: Steve Murphy (murf) 2007-07-31 22:39:17

I looked at the core, but I see nothing helpful there concerning a deadlock... (anyone else see something I don't?).

But, recently we discovered that gcore related mechanisms don't yield as readable a core file as you would get if the process crashed and dumped core. So, while I've made recent modes to ast_grab_core, I suggest you try this again, and repeat these steps by hand:

1. find the pid (process id) of the asterisk process. you can follow ast_grab_core for ways to do this; you can use ps, or look for the lock file...

2. as root, issue the command "kill -11 <pid>" where <pid> is the process id you dug up. Asterisk will die, and dump a core, usually "core.<pid>"; it's in whatever directory that asterisk was started in (commonly /tmp, if you use safe_asterisk).

3. run "gdb asterisk <path to the core.<pid> file"

4. tell gdb to "thread apply all bt full"

5. collect the results and post them to this bug. If all goes well, there will be much more detail in the backtrace!

By: Joshua C. Colp (jcolp) 2007-08-01 09:22:20

Indeed, chan_sip does not appear in there.

By: Eliel Sardanons (eliel) 2007-08-01 09:27:58

file, all the 1.2 issues will be drop today?

By: Joshua C. Colp (jcolp) 2007-08-01 09:30:21

Today is the day that 1.2 goes into a security fixes only state so we can focus on 1.4.

By: Igor Goncharovsky (igorg) 2007-08-02 03:24:21

As tell before this may be related not only to chan_sip, but to app_queue. May be app_queue block sip in some way?

After one day after restart it hangs again. I have uploaded new backtrace. Also I have no more possibilities to run buggy 1.2.23 and this night I'll back to 1.2.17.  If any new debug needed i'll install 1.2.23 again.

I think there are positive movings in resolve this issue: it is no need to close it, I think.

By: Steve Murphy (murf) 2007-08-02 12:11:22

This last backtrace was indeed better!

I've spent over an hour looking it over;

threads 3 and 9 are both in queue_exec, to the same number, it appears.

thread 18 appears to be in a lock, do_devstate_changes, and thread 7 is in a lock;
but the rest of the trace for that thread conveys no helpful info.

threads 3, 5, 9, are all executing macro-dial-queue

But none of this helps much in tracking why things are frozen!

By: Steve Murphy (murf) 2007-08-02 12:17:38

The 1.2 support period is over; we advise moving up to 1.4, and we've got some nifty deadlock debugging tools there; It could also be that your problem will go away in 1.4.

Yes, yes, I know that moving up to 1.4 will produce (maybe) more headaches than what you have now, but we made the decision to end support on 1.2, so we can concentrate on 1.4, and better focus our efforts.

If this problem recurs on 1.4 (and we all hope it doesn't), please feel free to reopen this bug, or file a new one, and some of the new stuff that Russell has designed will hopefully be in 1.4 by then, to help us locate the deadlock (if it is indeed a deadlock).

I know this will not come as good news to you, but really, if you move up to 1.4, you at least will have some support...!