ASTERISK-21775: FPE during MOH playback

[Home]

Summary: ASTERISK-21775: FPE during MOH playback

Reporter: James Sharp (jpsharp) Labels:

Date Opened: 2013-05-09 15:39:50 Date Closed: 2017-12-18 19:09:23.000-0600

Priority: Major Regression?

Status: Closed/Complete Components: Resources/res_musiconhold

Versions: 1.8.21.0 Frequency of
Occurrence Occasional

Related
Issues:
is related to ASTERISK-22252 res_musiconhold cleanup - REF_DEBUG reload warnings and ref leaks

is related to ASTERISK-22083 res_musiconhold segfault in free, in moh_scan_files

Environment: Linux pbx 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux Attachments: ( 0) asterisk-21775-lock-class_1.8.diff
( 1) asterisk-21775-ref-mohclass_1.8.diff
( 2) asterisk-bt-11-4.txt
( 3) backtrace.txt
( 4) backtrace-211213.txt
( 5) backtrace-220113.txt
( 6) backtrace-26Aug.txt

Description: Asterisk has dropped several calls and triage found core files in the MOH directory. I will attach the backtrace and diagnostics.

Comments: By: James Sharp (jpsharp) 2013-05-09 15:48:16.393-0500

From the gdb backtrace:

#0 0x00002b57e084e99d in moh_files_readframe (chan=0x10040f38, data=<value optimized out>, len=<value optimized out>,
samples=<value optimized out>) at res_musiconhold.c:309
309 state->pos = ast_random() % state->class->total_files;

Some digging into the strucures showed that class->total_files is 0, thus triggering the FPE when trying to modulo divide by zero.

(gdb) print ((struct mohclass *)((struct ast_channel *) 0x10040f38)->musicclass)->total_files
$8 = 0

By: Michael L. Young (elguero) 2013-05-09 16:53:40.202-0500

Hmm... Looking at line 294, not sure how we would get this far since we check to see if it state->class->total_files is 0 and if it is then we return.

Can you attach an un-optimized backtrace?

https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

Thanks

By: James Sharp (jpsharp) 2013-05-09 19:38:12.403-0500

It is a production system, so I'll see if I can convince the system admins to allow that.

As far as how total_calls got set to 0 between lines 294 and 309, is it possible that other threads are stepping on each other with something like rapid/simultaneous reloads?

By: James Sharp (jpsharp) 2013-05-13 13:07:37.325-0500

I've had another FPE during MOH. Backtrace is as follows:

Program terminated with signal 8, Arithmetic exception.
#0 0x00002b857f948807 in moh_files_readframe (chan=0x2b859c32ad08,
data=<value optimized out>, len=<value optimized out>,
samples=<value optimized out>) at res_musiconhold.c:331
331 state->pos %= state->class->total_files;

(gdb) print state->pos
Attempt to extract a component of a value that is not a structure pointer.

(gdb) print state
$1 = 0

So something odd is going on here.

By: Rusty Newton (rnewton) 2013-05-30 12:01:21.941-0500

For this issue to proceed any further, we'll need:

* Full un-optimized traces as described here https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace

* Please attach them to the issue as .txt files using (More Actions > Attach Files)

* dialplan (extensions.conf) showing the dialplan involved where this issue occurs.

* Asterisk log with VERBOSE and DEBUG set to level 5 logged during reproduction of the issue. https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
By: James Sharp (jpsharp) 2013-06-11 19:18:21.658-0500

Here's the backtrace.
By: James Sharp (jpsharp) 2013-06-11 19:20:41.353-0500

And that's about the best we can do. Can't do a console log because we can't reproduce the bug on command. These backtraces were generated by unoptimized code.

By: Richard Mudgett (rmudgett) 2013-06-11 20:07:10.777-0500

The backtrace looks the same. It appears the MOH class may have a reference count problem where the reference is decremented while the pointer is still used.
By: Michael L. Young (elguero) 2013-06-14 13:06:02.456-0500

Are you using realtime at all? It looks like you might be.

It looks like we might be missing adding a reference to the channel's music class when discovering that there already is a music class on the channel. We are assigning the channel's music class to mohclass without adding a reference to it and then we un-reference it later on which is probably causing the problem.

Can you give this patch a try and report back? [^asterisk-21775-ref-mohclass_1.8.diff]

Thanks
By: James Sharp (jpsharp) 2013-06-14 13:43:40.031-0500

I'll install the patch. Based on your idea of the fault is there something we can do to tickle the bug and verify whether or not the patch solves the problem? The issue is kinda sporadic, so I don't want to patch the system and say we fixed it then have it crash a few weeks later.

By: Michael L. Young (elguero) 2013-06-14 14:26:51.041-0500

What is the longest you have gone before a crash? I am thinking we could just let it go for a few weeks or however long is good enough to say it is working, with the patch installed and we can leave this issue in "Waiting for feedback" to allow enough time to see if a crash occurs.
By: Matt Jordan (mjordan) 2013-07-07 19:59:02.791-0500

Have you had a chance to test Michael's patch? If so, have you had any other further problems with this issue?
By: Rusty Newton (rnewton) 2013-08-01 15:17:58.795-0500

Michael, we haven't had any feedback from James. We could assume that he has tested and things are fine, but he may not have tested at all.

If you are confident in your patch not breaking anything (or preferably have a way to test that) then do you want to move forward with it?

By: Michael L. Young (elguero) 2013-08-01 19:00:19.872-0500

Rusty, I tried to see if it was reproducible in my environment with the settings based on what I could get out of the backtrace (realtime with format_mp3) and cannot trigger the problem.

I just re-looked at the patch and the code surrounding it. It is essentially a one-liner since, as the comment in the codes states, the other line is never reached based on the current code. But for completeness sake, I included that line in there.

I did some testing with the patch installed and I would say it is safe to commit. I don't foresee it breaking anything and should be safe... but as you know, testing it in the environment that has the problem and seeing that it fixes it sure helps give a bigger boost of confidence.
By: Rusty Newton (rnewton) 2013-08-06 17:04:02.695-0500

Thanks Michael.

Also sounds like you have a "ship it" from Walter in review https://reviewboard.asterisk.org/r/2742/

"(4) Added the patch by Michael Young from ASTERISK-21775, which seems correct to me."

Feel free to commit and close as your will or you can wait till it goes in with Walter's res_musiconhold cleanup.

[Edits - I had misread review 2742 as being submitted by Tilghman, fixed]
By: James Sharp (jpsharp) 2013-08-26 17:00:13.849-0500

Hi all. My client had been testing error free for quite sometime, but it seems to have reared its ugly head again. Another crash:

#0 0x00002b6a5aef63fb in ast_moh_files_next (chan=0x2b6a840b8a88) at res_musiconhold.c:331
331 state->pos %= state->class->total_files;
(gdb) print state
$1 = (struct moh_files_state *) 0x2b6a8c0df650
(gdb) print state->class->total_files
$2 = 0

I'll attach the full backtrace as well.
By: Michael L. Young (elguero) 2013-08-27 17:02:08.001-0500

James,

Thanks for reporting back.

I think I may see the problem now. There appears to be a race condition.

Thread 1 runs ast_moh_files_next and when it checks class->total_files, it is not set to 0 yet. Thread 2 is running moh_scan_files and sets class->total_files = 0 and then, in Thread 1, you hit the line "state->pos %= state->class->total_files;" where class->total_files is now 0.

That is my theory on what is happening.
By: Karsten Schmidt (guggemand) 2013-08-28 07:03:53.335-0500

I just today got the following crash
asterisk[14854] trap divide error ip:7ffd8b692479 sp:7ffd5f0791e0 error:0 in res_musiconhold.so[7ffd8b68d000+b000]

I'm using realtime for my moh.

I have 122 classes configured, with a total of 444 files.
222 .wav files and 222.al files.

Each class has sort set to random in the realtime table

Edit:
{noformat}
14:17 < wdoekes> thanks Gugge, and cachertclasses=yes ?
14:18 < Gugge> wdoekes: yes, forgot to write that. cachertclasses=yes is set
{noformat}
By: James Sharp (jpsharp) 2013-11-04 10:14:32.888-0600

Another crash again. Backtrace attached. Looks to be identical to the racecondition suggestion.

By: Michael L. Young (elguero) 2013-12-11 15:49:47.250-0600

Give this patch a try [^asterisk-21775-lock-class_1.8.diff]. It adds a lock around the class object to prevent other threads from changing it when we are in the middle of using it.
By: Sébastien Couture (sysreq) 2013-12-23 08:30:19.412-0600

We've had two occurrences of this issue over the last couple days. We're on 1.8.22.0, using realtime moh with cachertclasses=yes.

I've attached a backtrace, but it seems to be the exact same issue. I'll give Michael's patch a try tonight.
By: James Sharp (jpsharp) 2013-12-23 16:55:30.643-0600

Well, the new patch did do something different.

I haven't laid out the design of the system before and now it seems relevant.

There's two Asterisk boxes that run Enswitch and share the MOH directory over NFS mounted from the a 3rd machine.

pbx #1 has the newest patch applied.
pbx #2 just has the previous ref_mohclass patch applied.

"Something" happened this morning. The asterisk process on pbx #1 locked up, requiring a reset.
pbx #2, at the same time, segfaulted with the previously seen FPE core dumps.

I'm now leaning towards a possible hiccup in NFS, but still, Asterisk needs to be able to live over that or at least not completely hang.

By: Sébastien Couture (sysreq) 2013-12-30 15:40:23.608-0600

I have yet to experience the issue again, since I've applied the newest patch (6 days ago). That being said, it is the holidays and we've seen a lesser amount of concurrent calls on our servers, which in turn could mean a lower probability of reproducing the bug.

I would give this one a little more time.
By: Sébastien Couture (sysreq) 2014-01-13 11:17:36.385-0600

Since applying the latest patch on December 23rd, I have not experienced the issue again. I have now patched all of my servers, which are running 1.8.25.0.
By: Sébastien Couture (sysreq) 2014-01-22 14:25:30.130-0600

I've had the issue happen on 1.8.25.0, using the patch from Dec 11th. I've attached a backtrace.
By: Michael L. Young (elguero) 2014-01-22 17:09:15.893-0600

That's a bummer... I was hoping that the issue was solved and that James' problem was narrowed down to NFS and coming up with a way to handle NFS hiccups.

Thanks for the feedback.
By: Walter Doekes (wdoekes) 2014-05-28 08:26:52.484-0500

This is the same problem as ASTERISK-22083, look at {{class=0x2b9ae0071268}}

In both threads, that {{class}} is used and moh_scan_files is destructive.

{noformat}
Thread 2 (Thread 26748):
#0 0x0000003baa6c7b97 in chdir () from /lib64/libc.so.6
#1 0x00002b9ac0a73283 in moh_scan_files (class=0x2b9ae0071268) at res_musiconhold.c:1074
#2 0x00002b9ac0a74dae in local_ast_moh_start (chan=0x2b9abc15d558, mclass=0x2b9ad9bc3eb0 "default", interpclass=0x0) at res_musiconhold.c:1516

Thread 1 (Thread 0x2b9adacf5940 (LWP 26746)):
#0 0x00002b9ac0a708e0 in moh_files_alloc (chan=0x2b9ae0071fc8, params=0x2b9ae0071268) at res_musiconhold.c:432
#1 0x000000000046eef1 in ast_activate_generator (chan=0x2b9ae0071fc8, gen=0x2b9ac0c789c0, params=0x2b9ae0071268) at channel.c:3150
#2 0x00002b9ac0a74ebe in local_ast_moh_start (chan=0x2b9ae0071fc8, mclass=0x2b9adaceeeb0 "default", interpclass=0x0) at res_musiconhold.c:1533
{noformat}

As soon as someone schedules a reload, things "might" break. In the other ticket, it happens more frequently, but the cause is the same: there is no locking of the class->stuff and they're free to be mutated by moh_scan_files. The free loop there runs for a longer time than the modulo here, but other than that, no difference.

In https://issues.asterisk.org/jira/secure/attachment/49162/backtrace-220113.txt 10 bucks says that {{state->class}} in {{state->pos %= state->class->total_files;}} is the same class as {{moh_scan_files (class=0x3b483c8)}}.

The only times moh_scan_files() is called (apart from alloc/init time) is when either using cachertclasses=yes or issuing a 'module reload res_musiconhold' or a 'moh reload'.

The fix, I think (other than adding loads of locking):
- (re)scan_files would have to build a new object, and replace that. and the others should decref the old one until it goes to zero and can be freed.
- combine that with fewer calls to moh_scan_files for realtime (why reload if it was just reloaded?)

By: Joshua C. Colp (jcolp) 2017-12-18 19:09:23.847-0600

I'm closing this out and tracking the problem under ASTERISK-22083.