[Home]

Summary:ASTERISK-07785: Asterisk crashes on handling many subscriptions at the same time
Reporter:Lars Saathoff (ecvoip)Labels:
Date Opened:2006-09-21 08:39:59Date Closed:2011-06-07 14:00:46
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Channels/chan_sip/Subscriptions
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) deadlock_20060925.zip
( 1) show_hints.txt
( 2) sip_show_subscriptions.txt
Description:We have several snom 360 phones (~500) with configured destination keys. After rebooting 10 phones at the same time the subscriptions of the destination keys causes a crash in the asterisk chan_sip.so (hang). We can reproduce the failure everytime when we reboot 10 phones at the same time with around 120 subscriptions.

****** ADDITIONAL INFORMATION ******

All previously established phone calls will not be affected from this crash, but we are not able to register a new phone or make calls. We started at version 1.2.0 and we think that the failure exists since this version. No errors will be written to the error log and debugs. The debug stops before the failure occurs.
The failure does not appear if we reboot 10 phones without defined destination keys.

snom versions are 5.3.8 -> 6.5

Please tell us how to debug this problem for you.
Comments:By: Serge Vecher (serge-v) 2006-09-21 08:45:49

crash, is when the Asterisk instance gets killed, which is not the case here. It appears, that you are looking at a deadlock problem. Please produce a SIP debug per the following instructions to see what's going on here.

1) Prepare test environment (reduce the amount of unrelated traffic on the server);
2) Make sure your logger.conf has the following line:
  console => notice,warning,error,debug
3) restart Asterik.
4) Enable SIP transaction logging with the following CLI commands:
set debug 4
set verbose 4
sip debug
5) Save complete *CONSOLE* log to file and _attach_ said file to the bug.

By: Lars Saathoff (ecvoip) 2006-09-28 03:25:51

Here are the last few minutes before the deadlock... there some subscriptions from 94276. we think theses subscriptions are the reason for the deadlock....

By: Serge Vecher (serge-v) 2006-09-28 09:34:13

�[Ksrv-voip02*CLI>
Disconnected from Asterisk server

Was this caused by you? Because if not, then Asterisk has crashed.

By: Lars Saathoff (ecvoip) 2006-09-28 10:00:58

Yes, I've restarted after deadlock on a second console. It?s a production system!

By: Serge Vecher (serge-v) 2006-09-28 10:05:11

ecvoip: ok, thanks: Let proceed to debugging the deadlocked asterisk. Please do the following:

  1) Go to http://www.voip-info.org/tiki-index.php?page=Asterisk%20debugging
  2) Read the "HowTo Debug a DeadLock in Asterisk" section
  3) Post the relevant ouput here.

By: jmls (jmls) 2006-11-01 12:17:53.000-0600

ecvoip, did you follow the above instructions ? If so, what were the results ?

By: Lars Saathoff (ecvoip) 2006-11-07 11:55:30.000-0600

OK. We have a little problem to reproduce this deadlock on a test environment with 10 phones. On the production system we have more than 400 snom phones.

But we are now testing a special version from the snom software engineers with the following modifications:

a.) Subscriptions were send separately with a delay (5 minutes random) to Asterisk

b.) reboot the phone without sending subcriptions with expire=0

And now, with this settings, we don't have any problems...

We think, Asterisk has a big problem on handling many subscriptions at the same time. But we can't make a debug on the production system.

"sip show subscriptions" shows more than 1000 subscriptions. With little systems (max. 100 subscriptions) we don't have a problem.



By: Olle Johansson (oej) 2006-11-12 15:06:54.000-0600

ecovoip: Thanks for this report. I think that the subscription system is not really made for this size of system. Having said that, we need your data in order to fix this. Good thing that SNOM fixed their side. Wonder if they would support a 503 with a retry-after header. That way, Asterisk could signal a temporary overload and ask clients to come back with a random time given.

We need to check that with SNOM and see if that would work. If that's the case, we could implement a counter and say that "if we have over YYY open transactions, send 503", like the limit we have for number of open calls (maxcalls in asterisk.conf).

What do you think?

I would also like to know how many subscriptions you have per extension in the dial plan. Is it widespread or does many phones subscribe to the same set of core extensions (like incoming lines).

By: snomy (snomy) 2006-11-17 05:25:05.000-0600

>>oej: I think that the subscription system is not really made for this size of system.

Oh oh, I hope nobody is reading this, it makes me really scared... I hope at least the rest is made for this size of system ;-)

>>oej: Good thing that SNOM fixed their side.

This wasn't a fix of the snom phone it was a hack added to the phone to prevent asterisk not to hang!

>>oej: Wonder if they would support a 503 with a retry-after header. That way, Asterisk could signal a temporary overload and ask clients to come back with a random time given.

This is a good idea! We could add this to the snom phone easily. May you contact us at "support at snom dot com" and we will queue that in.

BTW: This retry-after header would be also nice to have in a NOTIFY asterisk could send out before terminating/rebooting in order to inform the phones to re-subscribe after a timeout when asterisk is up again.



By: Lars Saathoff (ecvoip) 2006-11-20 10:03:13.000-0600

>>oej: I would also like to know how many subscriptions you have per extension in the dial plan. Is it widespread or does many phones subscribe to the same set of core extensions (like incoming lines).

@oej
Look at the attached files!

By: Lars Saathoff (ecvoip) 2006-12-04 04:28:50.000-0600

@oej & snomy

Any news?

By: Serge Vecher (serge-v) 2006-12-04 09:36:57.000-0600

ecvoip: did you test 1.2.13 by the way?

By: Lars Saathoff (ecvoip) 2007-01-09 11:25:28.000-0600

Same problem, what should we do?

Without the snom fix, we have several deadlocks a day...

We need help!

By: Serge Vecher (serge-v) 2007-01-09 12:03:30.000-0600

ecvoip: can you please contact snom support at the email address provided. Hopefully the outline given by oej in note 0054459 is sufficient  to get the process started on their end.

By: snomy (snomy) 2007-01-26 06:36:49.000-0600

We actually did this already, but cannot test it with Asterisk without this patch. How can we go on here ?

By: Olle Johansson (oej) 2007-01-26 07:31:19.000-0600

I think what  you need is a redesign of the subscription architecture, which is not a two line hack done to fix a bug. So there's no simple patch.

I need to see what's hanging, so you need to attach a gdb to the running asterisk as instructed before.

By: snomy (snomy) 2007-01-26 07:51:55.000-0600

But why were you suggesting then this before ?

> Wonder if they would support a 503 with a retry-after header. That way,
> Asterisk could signal a temporary overload and ask clients to come
> back with a random time given.

By: Olle Johansson (oej) 2007-01-26 09:32:04.000-0600

Because that is part of the solution, but I have no patch.

To implement that, you need to change the architecture in a way so Asterisk can say "I'm too busy, please wait". And to do that, we need to... You see. There's a lot of stuff that needs to be done internally to get this properly solved.

I'm not saying it's impossible, but again, it's not a two-liner and need someone that writes the code or pays for writing the code for it to happen.

By: Lars Saathoff (ecvoip) 2007-01-30 10:05:44.000-0600

OK. We try to get a budget for this... Can you predict how many hours of development are nessesary for this patch, or in other words what do you think is an appropriate amount of money to fix this problem? This Patch should be useful for up to 3000 subscriptions/server(500 phones*6 Keys average=3000 subscr.)...
Maybe it is possible to integrate it into MySQL-DB. So that the subscriptions are still available after a restart of the server and we don?t have to wait for an new registration from the phones.



By: Olle Johansson (oej) 2007-02-02 11:41:40.000-0600

Please constact me off bug tracker at oej@edvina.net! Thanks.

By: jmls (jmls) 2007-05-28 02:33:31

was a patch ever produced for this ? Can we close ?

By: Olle Johansson (oej) 2007-05-29 01:06:17

No, it stays open.

By: Russell Bryant (russell) 2007-06-19 19:04:03

Greetings!  I would like to start looking at why this is deadlocking on your system.  Also, it would probably be better if I do this debugging on Asterisk 1.4.  The first step is to set up a system with debug information available.

-> Run "make menuselect", go down to "Compiler Flags", set DONT_OPTIMIZE and DEBUG_THREADS.  Hit 'x' to save and quit.

-> Rebuild and reinstall Asterisk.

Then, when the system locks up, you will need to get a core dump from the running Asterisk process.  You can do this by running the ast_grab_core script that is in the contrib/scripts directory of the source tree.

After running the script, you should have both a core dump and a backtrace file in the /var/tmp directory.  Please upload the gdb_dump file here.  Also, please don't delete these files as we will need to get more information from them.

Also, it would be easiest if it were possible for me to log in and use gdb to further analyze the core dump.  If this isn't possible, I can try to provide instructions for getting the information out of it that I need.

Let me know what I can do to help make some progress on this.  Thanks

By: Lars Saathoff (ecvoip) 2007-06-29 07:54:47

Russell, one question before we start.

Do you work for oej? I'm all mixed up about the whole thing...



By: Russell Bryant (russell) 2007-06-29 09:46:22

No, I do not work for oej.  I work for Digium.

I just came across this bug and saw that there had not been any progress.  I also see that you may be working with oej to improve the number of subscriptions Asterisk can handle.  My intentions are to simply make what is there not deadlock.  No amount of load on Asterisk should be able to get it to deadlock.

By: Russell Bryant (russell) 2007-07-20 17:49:50

Feel free to reopen this issue if you are interested in working with me to resolve the deadlock problems you were having.  Thanks.

By: Lars Saathoff (ecvoip) 2008-08-28 16:30:04

Ok, as we have no solution for this issue I feel free to reopen this thread. Is anybody there?

By: Lars Saathoff (ecvoip) 2008-08-28 16:38:23

Actual setup:
Asterisk 1.2.24, stable with max. 450 clients and 2.000 subscriptions
snom 6.5.15



By: Theo Belder (tbelder) 2008-08-29 04:50:51

You should check the setting subscription_delay on the snom phone:
http://wiki.snom.com/Settings/subscription_delay

It is supported from snom fw version 6.5.1 and from version 7.1.35

By: Lars Saathoff (ecvoip) 2008-08-30 13:53:58

This setting was especially made for us but it´s only a workaround.

By: Leif Madsen (lmadsen) 2008-09-02 17:24:32

As Asterisk 1.2.x is end of life, I believe you will need to reproduce this issue on Asterisk 1.4.x. Please see russell's note http://bugs.digium.com/view.php?id=7997#65399 for information on the information needed to move this forward.

Thanks!

By: Russell Bryant (russell) 2008-09-06 16:42:56

Agreed with blitzrage.  Unless you're on 1.4, we can't help.