|Summary:||ASTERISK-17474: when using mpg123 as a streaming MOH source, issuing 'moh reload' from CLI causes stream to die|
|Reporter:||Luke H (luckman212)||Labels:|
|Date Opened:||2011-02-24 13:48:13.000-0600||Date Closed:||2011-12-27 14:56:40.000-0600|
|Versions:||126.96.36.199 188.8.131.52||Frequency of|
|Environment:||Attachments:||( 0) asterisk-17474-dahdi_timing-infinite-wait-fix_v1.diff|
( 1) asterisk-17474-dahdi_timing-infinite-wait-fix_v2.diff
( 2) asterisk-17474-dahdi_timing-infinite-wait-fix_v3_branch-1.8.diff
( 3) asterisk-17474-dahdi_timing-infinite-wait-fix_v3_branch-10.diff
( 4) asterisk-17474-dahdi_timing-infinite-wait-fix_v3.diff
( 5) core_reload_per_remote_connection_with_config_changes.log
( 6) simple_core_reload_without_config_changes.log
|Description:||Title says it all, I have a single stream set up as a MOH source, with the following code (server ip removed)|
application=/usr/bin/mpg123 -q -s -m -r 8000 -f 2048 -b 0 http://my.streaming.server/stream_mp3
Everything works fine upon the initial load. But after issuing a 'moh reload' the stream goes dead. I checked to make sure MPG123 is still running (ps aux| grep mpg123) and yes it is. I turned on verbose logging via logger.conf and noted the following while issuing 'moh reload':
[2011-02-24 14:45:15] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold.conf': [2011-02-24 14:45:15] VERBOSE config.c: == Found
[2011-02-24 14:45:15] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold_custom.conf': [2011-02-24 14:45:15] VERBOSE config.c: == Found
[2011-02-24 14:45:15] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold_additional.conf': [2011-02-24 14:45:15] VERBOSE config.c: == Found
[2011-02-24 14:45:15] DEBUG res_musiconhold.c: killing 24322!
[2011-02-24 14:45:15] DEBUG res_musiconhold.c: mpg123 pid 24322 and child died after 56752 bytes read
[2011-02-24 14:45:48] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold.conf': [2011-02-24 14:45:48] VERBOSE config.c: == Found
[2011-02-24 14:45:48] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold_custom.conf': [2011-02-24 14:45:48] VERBOSE config.c: == Found
[2011-02-24 14:45:48] VERBOSE config.c: == Parsing '/etc/asterisk/musiconhold_additional.conf': [2011-02-24 14:45:48] VERBOSE config.c: == Found
[2011-02-24 14:45:48] DEBUG res_musiconhold.c: killing 25173!
[2011-02-24 14:45:48] DEBUG res_musiconhold.c: mpg123 pid 25173 and child died after 60408 bytes read
I checked these PIDs against the 'ps aux' output and confirmed that they are aligned & correct. The fact that mpg123 says "60408 bytes read" even though calling the moh extension leads to silence leads me to believe that the pipe somehow got messed up but MPG123 was still streaming data (I confirmed this by watching traffic on my router with tcpdump).
|Comments:||By: Luke H (luckman212) 2011-02-24 13:50:56.000-0600|
Also have to add that this does not occur if using only 'res_timing_pthread' -- the bug only occurs if you have res_timing_dahdi loaded. Unfortunately, disabling res_timing_dahdi on my system results in extreme instability and constant deadlocks.
By: Luke H (luckman212) 2011-03-02 07:07:07.000-0600
I'd also like to add that while this might not seem like a serious bug it's a quite frustrating one because anyone using a gui for asterisk (e.g. FreePBX), every time they save the config it causes several modules to be reloaded, MOH is one of them. So even if you are not making moh-related changes (e.g. just adding an extension) you will kill your MOH.
Is there any idea where to even begin hunting for this bug? would the bug likely be in musiconhold.c or is it more of a timing bug? anyone have any idea?
By: Vladimir Mikhelson (vmikhelson) 2011-03-02 21:48:12.000-0600
There are at least three reports on this issue. So far no resolution. My workaround is to preload res_musiconhold.so in modules.
But as you said, this is until the first reload. I am using FreePBX too.
By: Luke H (luckman212) 2011-03-04 07:32:43.000-0600
I noticed today that a new version of DAHDI has been released (2.4.1+2.4.1)
Perhaps there are changes in there that might positively impact this bug? I haven't recompiled yet. Plan to do so this weekend.
By: Vladimir Mikhelson (vmikhelson) 2011-03-06 16:30:56.000-0600
Nope, DAHDI 2.4.1 does not seem to help...
By: Luke H (luckman212) 2011-03-06 18:20:57.000-0600
By: Luke H (luckman212) 2011-05-12 22:51:03
Does anyone have a workaround for this yet? I would like to help debug this but I don't even know where to begin looking. is the bug in res_timing_dahdi or res_musiconhold ? or somewhere else??
By: Luke H (luckman212) 2011-05-16 22:03:53
would it help If I posted a bounty for this? not sure how to go about doing that or if any others would be interested in contributing.
By: Vladimir Mikhelson (vmikhelson) 2011-05-16 23:12:34
It seems like the issue is mis-categorized. It probably belongs to res_timing_dahdi.so or to DAHDI, not quite sure.
Dear administrators, can you please assign the right category to the issue so that the right developers would take a look? Or should the issue be re-posted?
By: Brett Bryant (bbryant) 2011-05-23 15:49:53
I'm currently taking a look at this issue and hope to have a fix available for you guys soon.
By: Vladimir Mikhelson (vmikhelson) 2011-05-23 17:01:32
That would be great. Please let me know if any help is needed.
By: Luke H (luckman212) 2011-05-23 18:56:04
Wow this would be truly fantastic. This issue has been plaguing me for almost a year! I am so curious what the issue is, my PBX is not able to use the res_timer_timerfd so I am limited to pthread & dahdi timers. It definitely does seem like a dahdi issue (disabling dahdi works around it but causes extreme instability on my system).
By: Leif Madsen (lmadsen) 2011-06-04 10:08:04
Issues with res_timing_timerfd may be resolved by ASTERISK-16711 FYI
By: Vladimir Mikhelson (vmikhelson) 2011-06-04 15:19:05
Patch for issue 0018028 does not fix the problem.
By: Luke H (luckman212) 2011-06-04 19:06:10
My system doesn't even have res_timing_timerfd (old kernel) so that patch wouldn't apply
By: Luke H (luckman212) 2011-07-20 03:42:23.617-0500
> I'm currently taking a look at this issue and hope to have a fix available for you guys soon.
bbryant: just wondering if you ever had a chance to look at this mysterious issue. I've upgraded to Ast 1.8.5 and the issue hasn't budged.
By: Luke H (luckman212) 2011-08-04 22:45:48.309-0500
Sorry to keep posting on this issue but this has been plaguing me for over a year now. Is there anyone with any hint, whim, thought, or glimmer of an idea on what we can do to fix this issue?
By: Luke H (luckman212) 2011-09-08 07:50:29.006-0500
One other detail I would like to add, not sure if it's a clue: normal (non-streaming) MOH sources continue to function even after this 'moh reload'. i.e. the "default" music class still plays back the familiar Asterisk canned MOH files (which are WAV format). So the musiconhold module itself is still functioning at least to some extent. Could this be a problem with format_mp3?
Another note: I compiled and tested with the just-released mpg123 v1.13.4 this morning and - no change in this behavior.
By: zzsurf (zzsurf) 2011-09-08 11:21:12.729-0500
Same pb for me
MOH works fine with files, but with a stream (custom sox or mpg123) after the first phone call put on musiconhold, the process of sox is dead.
I have to restart asterisk, so it's impossible to use in production.
By: Luke H (luckman212) 2011-09-16 08:33:47.142-0500
Today updated to 1.8.7-rc1, DAHDI 2.5.1, Sangoma wanpipe 1.0.13 no change, bug persists
By: Luke H (luckman212) 2011-09-22 13:28:57.045-0500
Bug persists in 1.8.7-rc2
By: Luke H (luckman212) 2011-09-24 09:04:52.645-0500
Updated to 1.8.7 final, bug persists. I offered to create a bounty to fix this issue -- is anyone else interested in that??
By: Luke H (luckman212) 2011-10-10 00:19:22.286-0500
Bug persists in 1.8.8-rc1
By: Wes Van TIghem (wvt) 2011-10-20 16:41:26.652-0500
Reproduced and VERY frustrating on:
CentOS 5.7 (2.6.18-274.3.1.el5 x86_64)
DAHDI Version: 184.108.40.206
By: Luke H (luckman212) 2011-10-20 17:16:33.290-0500
Bug persists in 1.8.8-rc2 + DAHDI 220.127.116.11
I am curious if this bug is being deemed a 'will not fix' type of issue? Not sure how to get this fixed, I have offered a bounty and posted to asterisk-dev but it just seems like nobody wants to look at this. Is it really that rare of a use-case? I guess people really like to listen to the same MOH files play over and over again hehe.
By: Luke H (luckman212) 2011-11-07 08:30:24.117-0600
I posted this morning on asterisk-dev, now offering $250 of my own cash and hoping to get some extra additional support from perhaps Vladimir Mikhelson, Wes Van Tighem, "zzurf" and anyone else affected by this bug! I would like to get the bounty up to $500. Please join asterisk-dev if you haven't already and volunteer your support so we can close this one out!
By: Michael L. Young (elguero) 2011-11-07 11:54:22.604-0600
What version of mpg123 are you using?
By: Luke H (luckman212) 2011-11-07 14:56:50.339-0600
I'm currently testing with 1.13.4 (latest). I've always kept current with the mpg123 versions as they've been released. I started with v1.10.1 back in early last year and have steadily updated to all the point releases: 1.11, 1.12, 1.13, 1.13.1, 1.13.2 and so on.
By: Michael L. Young (elguero) 2011-11-07 15:51:11.073-0600
Thanks. I tried to reproduce on a Fedora 15 vm, using asterisk 1.8 branch and dahdi trunk. Used the latest mpg123 (1.13.4) just like you have. I tried not loading the res_timing_* modules and also tried to reproduce by only loading one timing module at a time in order to test each timing interface since I see above you mentioned this only occurred with res_timing_dahdi.so, but I could not reproduce.
I am going to try to reproduce on a Centos 5.7 machine with the latest dahdi release and asterisk 1.8 release in order to see if I can get any clues as to where one should be looking.
What does "module show like res_timing" show?
By: Luke H (luckman212) 2011-11-07 16:09:44.693-0600
Here's the output of 'module show like res_timing' on my test system:
Module Description Use Count
res_timing_dahdi.so DAHDI Timing Interface 1
res_timing_pthread.so pthread Timing Interface 1
2 modules loaded
I'm on CentOS 5.5 myself (kernel is 2.6.18) so not sure if it's a kernel issue. Interesting that you could not repro on Fedora 15, which AFAIK uses 2.6.38 so that may well be the explanation.
By: Michael L. Young (elguero) 2011-11-10 22:31:28.185-0600
Okay, I think I narrowed down what the problem here is.
The attached patch does the following:
- Changes the load priority so that this module is loaded after the timing interfaces are.
This is why it would work until a reload. More than likely res_timing_pthread was being used at first and then since res_timing_dahdi takes priority as a timer, upon a moh reload, the timing changed to another interface and would just sit there, hence the need for the following change.
- Adds the POLLPRI event for ast_poll, otherwise ast_poll just sits there waiting if the timer being used is res_timing_dahdi.so;
- Attempt to cleanup a few items
By: Michael L. Young (elguero) 2011-11-10 22:45:49.263-0600
Reviewboard link: https://reviewboard.asterisk.org/r/1578/
By: Thomas Arimont (tomaso) 2011-11-11 04:32:47.797-0600
sorry, I've tested the patch (applied on 18.104.22.168) but it does not work at all. Now, even after a core restart my mpg123 is not starting anymore:
[Nov 11 11:18:05] DEBUG: res_musiconhold.c:857 _get_mohbyname: Music on Hold class 'default' not found in memory
[Nov 11 11:18:05] DEBUG: res_musiconhold.c:857 _get_mohbyname: Music on Hold class 'default' not found in memory
; This file was automatically generated by ASTGEN 2.0
; (C) DATUS AG, 2007-2011
; Please do not edit. Changes will be lost!
root@Indali-1:/userdata/sounds/onhold# ls -l
-rw------- 1 root root 197804 Nov 11 11:24 17.mp3
Indali-1*CLI> moh show classes
Indali-1*CLI> moh show files
Indali-1*CLI> timing test
Attempting to test a timer with 50 ticks per second.
Using the 'DAHDI' timing module for this test.
It has been 1019 milliseconds, and we got 51 timer ticks
Indali-1*CLI> moh show classes
By: Thomas Arimont (tomaso) 2011-11-11 08:20:51.586-0600
I have to correct my bug description from issue Asterisk-18843.
It seems that a simple 'core reload' without any config changes does not force the observed behaviour.
I have to make config changes (e.g. extensions.conf or queue.conf) first before reloading ('core reload'). Only in these cases the mpg123 stream seems to die.
So I have attached both 'core reload' logs (with and without config changes) to find out the difference.
Infact the first reload (without config changes) is a 'CLI> core reload', the second reload (with config changes) is a asterisk -rx 'module reload' from external, but I think there is no difference between this two ways of reloading.
Sorry, about my confused posting:
The difference between those two cases is that musicconhold.conf is rewritten in the second reload case (although is was not changed, this is the behaviour of our 'configurator') This seems to be the important thing and explains why the res_musicinhold.conf is only parsed in the second case.
So finally to repoduce this issue only the following steps are necessary:
a) make a 'touch' on musiconhold.conf
b) apply cli 'moh reload'
By: Michael L. Young (elguero) 2011-11-11 08:54:54.922-0600
I see in ASTERISK-18843, you mentioned that you were preloading res_musiconhold.so. If you do not preload res_musiconhold.so when using the patch, does it then work?
The only change I can see in the patch is that if no timer interface is present (if you are loading res_musiconhold before the timer interfaces) the function to load the mp3 classes returns without doing anything. That would explain why with the patch the default class is not registered.
Can you give that a try and report back?
By: Thomas Arimont (tomaso) 2011-11-11 08:59:37.195-0600
Sorry, Please reread my last comment, I made some changes here.
I will try your suggested step immediately ...
By: Thomas Arimont (tomaso) 2011-11-11 09:14:07.849-0600
when I remove the preload => res_musiconhold.so from modules.conf your patch seems to work now!
Good job! Thanks!
By: Shaun Ruffell (sruffell) 2011-11-11 09:44:49.105-0600
Just linking to ASTERISK-16919 since that was POLLPRI related as well and I had looked for that issue whether the way DAHDI used POLLPRI had changed recently or not.
By: Luke H (luckman212) 2011-11-11 09:47:22.363-0600
I'm going to test the patch against 1.8.8-rc3 in a few minutes!
By: Luke H (luckman212) 2011-11-11 23:16:19.239-0600
I'm pinching myself. I think this patch works! Haven't done much testing but so far so good! Tested on Ast 1.8.8-rc3, CentOS 5.5 (32bit), DAHDI 22.214.171.124 and no issues yet. Michael Young you are a godsend!
By: Michael L. Young (elguero) 2011-11-14 10:55:56.218-0600
Nice! Glad to hear it is working.
In regards to POLLPRI issue, I also had tried to see if there was any recent changes in regards to using DAHDI as a timer but didn't see anything. The only change I could see that perhaps introduced this regression in res_musiconhold was probably when the module was converted to using the asterisk timing API last year instead of using DAHDI (psuedofd). http://svnview.digium.com/svn/asterisk?view=revision&revision=283771
By: Shaun Ruffell (sruffell) 2011-11-15 12:01:01.410-0600
Ahh..yes, that makes sense. The psuedofd would return POLLIN since it was actually reading data from the kernel, where the timing API uses /dev/dahdi/timer instead of /dev/dahdi/psuedo and that has never returned POLLIN when an event has fired. So, thanks for linking that revision.
By: Luke H (luckman212) 2011-11-17 12:48:00.188-0600
Been running with this patch for 6 days now, MOH has not stopped despite a handful of module reloads. Would love to see this patch committed to the trunk before 1.8.8 goes final!
By: Luke H (luckman212) 2011-11-17 13:53:50.357-0600
Also wanted to add that I have now tested the patch with 1.8.8-rc4 and it works fine.
By: Luke H (luckman212) 2011-11-18 08:39:56.705-0600
I noticed that elguero has submitted the patch to Reviewboard:
By: Mike Spuhler (mspuhler) 2011-11-28 15:30:48.417-0600
Tested it with 1.8.346239 SVN and it appears to work fine. Any idea how long something takes to get approved on the reviewboard? It apperas that sometimes they are marked "ship it" right away, while other times, the requests sit and linger.
By: Michael L. Young (elguero) 2011-12-07 12:55:15.364-0600
v2 - addressing comments on review board
By: Michael L. Young (elguero) 2011-12-07 13:12:42.334-0600
v3 - address another comment on rb... move unlock to immediately after the removal of the member object from the list
By: Luke H (luckman212) 2011-12-09 16:32:02.228-0600
Just wanted to share some good news, I installed Michael's latest "v3_branch-1.8.diff" patch against 1.8.8-rc5 and it compiles ok and still works fine! Can we have this baked into 1.8.8 final?? has it passed RB testing yet?
By: Michael L. Young (elguero) 2011-12-09 22:15:55.134-0600
Hopefully, the patch will be committed soon. A couple developers have reviewed the patch and one as given the okay to commit it. I do not have commit access and therefore I posted patches here for all supported branches to hopefully help get it in soon. We will need a developer with commit access to commit the patches.
More than likely, if it goes in soon, it would be in 1.8.9.
Thanks for testing the patches and reporting back.
By: Luke H (luckman212) 2011-12-15 22:32:53.455-0600
Just wanted to confirm that I've patched & built with 126.96.36.199 final and... still working! Hopefully they'll ship this with 188.8.131.52. Thanks again.