ASTERISK-09989: On Mac OS X PowerPC, Asterisk 1.4.x stops bridging new calls shortly after start

[Home]

Summary: ASTERISK-09989: On Mac OS X PowerPC, Asterisk 1.4.x stops bridging new calls shortly after start

Reporter: John Covert (jcovert) Labels:

Date Opened: 2007-07-31 09:10:36 Date Closed: 2010-06-30 08:10:24

Priority: Major Regression? No

Status: Closed/Complete Components: Core/Portability

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 115843-config.log
( 1) Asterisk_1.4.test.notes
( 2) bug
( 3) config.log
( 4) configure.patch
( 5) dnspatch.diff
( 6) Makefile.patch

Description: I have unsuccessfully tried to get every 1.4.x version of Asterisk to work on Mac OS X PowerPC, most recently Asterisk 1.4.9 on Mac OS X 10.4.10.

Asterisk starts up fine, and works for a short while. But after about the third or fourth bridged call (local sip soft client, SJphone, or locally connected SIP ATAs Cisco ATA 186), RTP read too short errors start ocurring, or other dead call problems. It is still possible to connect to IVRs within asterisk, and inbound IAX2 calls from a 1.2.22 version of Asterisk to the 1.4.9 version still work, but originating SIP devices are unable to reliably make outbound calls.

****** STEPS TO REPRODUCE ******

Make several outbound IAX2 calls from a locally connected SIP client.

****** ADDITIONAL INFORMATION ******

Will attached detailed log with SIP, IAX2, and RTP debug on

Comments: By: John Covert (jcovert) 2007-07-31 09:12:46

I should note that exactly this configuration works 100% reliably on 1.2.22
By: John Covert (jcovert) 2007-07-31 09:37:43

More information. If a open a second asterisk console at the same time, i.e. another asterisk -vvvr, the following happens in the original console (compare with the attached file with the full debug output):

-- Goto (cnetlookup,14559900,1)
-- Executing [14559900@cnetlookup:1] GotoIf("SIP/localsjphone-0184a600", "1?v1d4") in new stack
-- Goto (cnetlookup,14559900,6)
[Jul 31 10:39:45] WARNING[7939]: asterisk.c:923 netconsole: poll returned < 0: Bad file descriptor
(the above error is repeated 82 times)
-- Remote UNIX connection disconnected
(at this point the second console is bounced off)
-- Executing [14559900@cnetlookup:6] Set("SIP/localsjphone-0184a600", "result=guest@speakup.octothorp.org/14559900") in new stack

and all is well and we actually get a good connection now.

But the very next call fails again. I note on that call that the RTP Read too short message only appears 93 times and then stops, but there is still no audio.
By: John Covert (jcovert) 2007-07-31 09:40:11

Here's another seqence of errors on another attempt:

-- Remote UNIX connection disconnected
-- Call accepted by 207.224.49.209 (format ulaw)
-- Format for call is ulaw
[Jul 31 10:55:56] WARNING[7939]: channel.c:937 ast_queue_frame: Unable to write to alert pipe on IAX2/207.224.49.209:4569-6, frametype/subclass 2/4 (qlen = 1): Broken pipe!
[Jul 31 10:55:56] WARNING[7939]: channel.c:937 ast_queue_frame: Unable to write to alert pipe on IAX2/207.224.49.209:4569-6, frametype/subclass 2/4 (qlen = 2): Broken pipe!
By: Olle Johansson (oej) 2007-12-16 03:02:54.000-0600

Is this still a problem with the latest 1.4? If so, we need to start working on this old bug report.
By: tfitch (tfitch) 2007-12-16 11:19:41.000-0600

As of Asterisk 1.4.14 running on PPC 10.4.11 it is still true for me and is the bug that is keeping me on the 1.2.24 build.

It is a bit of a pain for me to switch back and forth between 1.2 and 1.4, but I'll try to do that to confirm latest 1.4 SVN has the same issue today.
By: tfitch (tfitch) 2007-12-16 12:53:46.000-0600

Compiled and installed 1.4 SVN 93172 on MacMini PPC running 10.4.11. Made a dozen or so calls from SIP device (SPA-841) to and from SIP ITSP with no issues. Also made a number of SIP inbound and outbound calls to e164 numbers. All calls were fine.

I will continue to run this 1.4 SVN version and will report back in a few days if this issue crops up.
By: tfitch (tfitch) 2007-12-16 13:16:07.000-0600

v1.4 SVN 93172, MacMini PPC G4 running 10.4.11

Operator error: Found that previous runs today were using the v1.4 remote console against my old v1.2.24 daemon. Surprised that worked. In any case once I "corrected" that I immediately had one way voice on my first SIP call with the following on the console:

Connected to Asterisk SVN-branch-1.4-r93000M currently running on mini (pid = 17444)
Verbosity is at least 5
mini*CLI>
mini*CLI>
[Dec 16 11:11:56] WARNING[17444]: rtp.c:1142 ast_rtp_read: RTP Read too short
mini*CLI>
[Dec 16 11:11:56] WARNING[17444]: rtp.c:1142 ast_rtp_read: RTP Read too short

Hard to watch console with this setup because bug 0010202 (console disconnecting).
By: Caio Begotti (caio1982) 2008-01-11 12:03:43.000-0600

tfitch, I know it's a pain to switch 1.2 to 1.4 and back just for a few tests, but I'm having some issues with OSX 10.4.11 as well but none of them is just like this you're having... so, would you be able to test it on the latest 1.4 branch code, revision 98249?

PS: I'm not really getting console disconnects, that's why I'm asking for one more test using the latest 1.4 code
By: tfitch (tfitch) 2008-01-11 21:52:48.000-0600

Unfortunately it will a little while before I can get to this as I much prefer to do this "hands on" rather than through the VPN and need to do it at some time when it won't disrupt my users.
By: tfitch (tfitch) 2008-01-26 20:02:43.000-0600

Tried latest 1.4 SVN version again with largely the same result. Have uploaded my notes as "Asterisk_1.4.test.notes" for the full details. Executive summary:

1) Bug 10342 still exists on my system
2) Console disconnects, possibly bug 10202 still exists on my system

Going back to v1.2.24. . .
By: Mark Michelson (mmichelson) 2008-04-16 15:30:07

One big difference between 1.2 and 1.4 with regards to Mac OSX and the build system was that in 1.2, a Darwin system would not use the poll(2) system call but would instead use a version in the Asterisk source which actually used select(2) to determine ready file descriptors.

When the 1.4 build system changed to create a main/ directory, the logic for which version of poll to compile moved to that directory's Makefile. The result of this appeared to be that two different versions of poll() would be referenced in the build, and depending on which file poll() was called from, one or the other version would be used. This is bad.

I have written a patch which will revert 1.4 to properly using the version of poll() in the Asterisk source for Darwin systems. I have a feeling that this could be the cause behind issues ASTERISK-9876 and ASTERISK-9989, so I am posting the patches on both bugs.

Please test Makefile.patch and report if its use has solved the problem reported. Thank you very much!

By: John Covert (jcovert) 2008-04-17 08:27:47

I tried the makefile patches this morning. No luck. Asterisk stopped accepting new SIP connections shortly after it had been running.

BTW, if makefile changes are to be made the following probably unrelated changes are needed: utils/Makefile needs to filter out muted (as it does for BSD), and something prevents res_snmp from building, so I have been deselecting it in menuselect.

/john
By: Mark Michelson (mmichelson) 2008-05-08 17:11:01

I believe that I have reproduced this issue. I set up a Mac Mini in the hopes of reproducing this and issue ASTERISK-9876. This one I can reproduce reliably as long as I start asterisk with -vvvvvgc as the arguments. I place about three SIP-to-SIP calls and then boom, Asterisk gets in an unusable state. Like jcovert reported, I can still place calls to Asterisk, but if I try to place calls to a remote Asterisk server, the local server seems to get "stuck" during the Dial command. When this happens, the Asterisk console becomes completely unresponsive (i.e. when I type, characters do not appear on the screen and none of the commands I enter are executed). One thing that is a bit odd is that verbose messages still appear on the console. If I attach to the running process with asterisk -rvvvvv, then I can type commands, but they are apparently ignored. The absolute worst part is that when Asterisk gets in this state, there's no apparent way to stop the process. Killing it with the -9 flag doesn't even do it. I usually just restart the system to get things back to normal again.

Let me note that I spent the better part of a day attempting to reproduce this. I was able to make several thousand calls to and from Asterisk with no trouble at all as long as I started Asterisk using asterisk& and then connected via a remote console with asterisk -rvvvvvv.

So far I have been able to get a backtrace of the threads when the situation occurs. I will analyze this further and see if it is apparent if there's a deadlock or perhaps some sort of system call that is blocking forever.
By: John Covert (jcovert) 2008-05-08 18:38:58

This may or may not help in future reproduction of the problem:

Before deploying new versions of Asterisk to my production hom system (and incurring the wrath of She Who Does Not Tolerate Buggy Phone Systems), I test it on a 1.33 GHz PowerPC G4 Powerbook with 768 MB of DDR SDRAM.

In /Library/StartupItems/CovertEnvironment/CovertEnvironment/ there is:
#!/bin/sh
/sbin/ipfw -q /Users/jcovert/ipfw.conf
/Users/jcovert/asteriskinit.sh &

asteriskinit.sh is:

#!/bin/bash
# no renice in laptop version
(while true;do /usr/sbin/asterisk -f;done)>/dev/null 2>/dev/null&

Regards/john (writing while this week's Bootcamp in Las Vegas is doing an AGI lab)
By: Mark Michelson (mmichelson) 2008-05-08 19:57:52

All right, I have what hopefully will be great news. I believe I have solved this issue. The situation I described above stopped happening when I fixed up the configure script as in the patch configure.patch that I have attached here.

So, I'll run you through what was happening. After experiencing the apparent lock-up, I took a backtrace of the threads and found that there were some that appeared to be stuck deep, waaaaay deep, in dns-related system calls. What seemed odd to me was that the non-reentrant flavors of these dns calls were being used. Since you had mentioned that this same setup worked flawlessly in 1.2.22, I checked to see what was different between the two. What I noticed was that a check that was being done at compile-time in 1.2 had been moved to the configure script in 1.4. I checked my config.log file and found that the check for the existence of the function was failing even though it clearly did exist.

So, after some Googling and some configure-script voodoo, I finally was able to get the script to properly detect the existence of the res_ninit function. This appeared to solve the problem 100% for me. In addition, I was unable to reproduce issue 10202 anymore after this change.

So, I ask you now to please test the attached configure.patch. It contains a patch both for the configure file as well as the configure.ac file. While it isn't imperative that you patch your configure.ac file, I am including it for completeness. After patching, rerun configure and be sure that the line "Checking for res_ninit...yes" appears. This line will appear towards the bottom of the output, just little bit down from some lines that check for the existence of certain compiler attributes. After that, just rebuild Asterisk and hopefully you'll be bridging calls like a pro!
By: John Covert (jcovert) 2008-05-09 00:58:50

No joy.
We still get "no" for res_init. Even if I force it on by putting
| #define HAVE_RES_NINIT 1
| #define HAVE_RES_NDESTROY 1
in "manually" by making it go in wi
| #define HAVE_RTLD_NOLOAD 1
generates a version of asterisk with the same problem. I can get it to fail almost instantly, and in Version 1.4.19.1 asterisk exits immediately the second time I call from one line to another on a polycom phone.

I'm attaching my complete config.log

/john
By: snuffy (snuffy) 2008-05-09 06:52:36

there appears to be a compile issue there for conftest.c
By: John Covert (jcovert) 2008-05-09 08:35:50

Hi snuffy. conftest.c is dynamically created for each of the various possible configuration items; it's expected to fail when an item isn't present; that's how the determination is made. As each item is tested, the successful compile (and link if appropriate) of conftest.c is used to add lines like
#define HAVE_RES_NINIT 1
to the file confdefs.h. If the compile/link fails, the line is not added.

To see if the failure to detect res_ninit was really the problem, I forced the lines
#define HAVE_RES_NINIT 1
#define HAVE_RES_NDESTROY 1
into the file even though the test for res_ninit failed. (If the test for res_ninit fails, configure doesn't even bother to look for res_ndestroy.)

This made the resulting configuration behave as though res_init and res_ndestroy were present as far as configure were concerned. If this really is related to the problem, even if configure had properly detected them, something else is still failing to use them, i.e. the successful definition of the two symbols by configure isn't sufficient to solve the problem.

/john
By: Mark Michelson (mmichelson) 2008-05-09 08:51:17

Ah, forcing HAVE_RES_NDESTROY to be 1 could be what's causing your problem now. res_ndestroy is NOT defined on Mac OS X systems (or at least not by the libresolv that I was using). If you can, try another test where you force HAVE_RES_NINIT to be defined but HAVE_RES_NDESTROY to not be defined.

I'll have a look at the config.log to see why the conftest is failing and have a patch ready as soon as possible.
By: John Covert (jcovert) 2008-05-09 09:33:52

I started it "-c" in order to see why it's exiting:

poll.c:283: failed assertion `pArray != (struct pollfd *) NULL'

/john
By: Mark Michelson (mmichelson) 2008-05-09 09:46:48

Just as a means to be certain I'm not barking up the wrong tree, I'm going to upload a patch for main/dns.c that will revert its logic to use 1.2's macro definitions. If you don't mind, could you please run with this patch applied and see if the problem still exists? I would hate to go to all this work and then find out this wasn't what was causing the failed bridge attempts in the first place.

By the way, the output from your config.log file is quite bizarre. It's showing parse errors in system header files, which seems wrong to me. I'll look into it a bit more. Google to the rescue, I guess.
By: Mark Michelson (mmichelson) 2008-05-09 09:54:50

I have uploaded dnspatch.diff which reverts 1.4's dns.c to use 1.2's logic regarding which versions of functions to use.

I tested this on my Mac Mini and calls seemed to work just fine. Please test to see if this actually fixes the issue for you, too. Note that this patch makes it so that dns.c is not dependent on the configure script to find the existence of res_ninit. If this fixes the issue, let me know and I'll get to work trying to fix the configure script logic.
By: John Covert (jcovert) 2008-05-09 10:19:18

Argh.
JRCovert:~/asterisk/asterisk-1.4.19.1/main root# patch -b <dns.patch
patching file dns.c
Hunk #4 FAILED at 236.
Hunk ASTERISK-1 succeeded at 274 (offset -1 lines).
Hunk ASTERISK-2 succeeded at 293 (offset -1 lines).
1 out of 6 hunks FAILED -- saving rejects to file dns.c.rej

But I don't think this is it. I had previously backed everything I did out of configure, and then explicitly defined HAVE_RES_NINIT in autoconfig.c. When that didn't work I also tried just defining it directly in dns.c. (Yes, I did go back to the original dns.c before trying to apply the patch.)

The class is waiting for me.

/john
By: Joshua C. Colp (jcolp) 2008-05-12 10:00:13

I vastly changed how res_ninit and res_ndestroy are detected in the configure script the end of last week and confirmed they cause both to be detected (as they should) on my OSX Mac Mini. Please update to the latest branch of 1.4 and rerun configure, if they are not detected with these changes please attach the config.log and the version of OSX you are running.
By: John Covert (jcovert) 2008-05-13 00:01:08

By upgrading to XCode 2.5, I was able to get configure to find res_ninit and res_ndestroy. It also now builds muted, which it could not do before. And Apple has actually removed tar, but I had gtar in /sw/bin

However, Asterisk fails on the second outbound SIP call. Here's the CLI log for two SIP calls:

-- Call accepted by 192.168.0.100 (format ulaw)
-- Format for call is ulaw
-- IAX2/mainpbx-13005 answered SIP/localsjphone-01834000
[May 13 00:57:52] ERROR[1009]: chan_sip.c:15551 sipsock_read: We could NOT get the channel lock for SIP/localsjphone-01834000!
[May 13 00:57:52] ERROR[1009]: chan_sip.c:15552 sipsock_read: SIP transaction failed: 24902280-1DD2-11B2-BEFF-E02E4E3983F4@192.168.0.4
-- Hungup 'IAX2/mainpbx-13005'
== Spawn extension (localsjphone, 84, 1) exited non-zero on 'SIP/localsjphone-01834000'
poll.c:283: failed assertion `pArray != (struct pollfd *) NULL'
By: John Covert (jcovert) 2008-05-13 00:21:04

config.log uploaded.

Running Mac OS X 10.4.11

/john

By: Mark Michelson (mmichelson) 2008-05-14 13:36:26

The error you see with the message "We could NOT get the channel lock ..." is something that has been seen by a few users of the 1.4.20 release candidates and is something whose cause has not yet been tracked down. That would appear to not be a platform-specific issue. The assertion afterwards is definitely a darwin-specific issue; however, it may also be solved if the first issue is solved.
By: Joshua C. Colp (jcolp) 2008-06-10 10:07:34

I'm closing this out as the latest round of changes have fixed the platform specific issues.
By: John Covert (jcovert) 2008-06-10 19:57:14

I still cannot get this working on Darwin.

I've tried both trunk and a completely clean build of 1.4.21-rc2. The second SIP call causes asterisk to exit with a failed assertion.

<--- SIP read from 127.0.0.1:5060 --->
INVITE sip:953@covert.org SIP/2.0
Via: SIP/2.0/UDP 127.0.0.1;rport;branch=z9hG4bKc0a8016f0000004a484f240323950c16000000b9
Content-Length: 213
Contact: <sip:localsjphone@127.0.0.1:5060>
Call-ID: FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
Content-Type: application/sdp
CSeq: 1 INVITE
From: "John Covert"<sip:localsjphone@covert.org>;tag=196533811041561799
Max-Forwards: 70
To: <sip:953@covert.org>
User-Agent: SJphone/1.60.320/L (SJ Labs)

v=0
o=- 3422134915 3422134915 IN IP4 127.0.0.1
s=SJphone
c=IN IP4 127.0.0.1
t=0 0
a=direction:active
m=audio 49166 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-11,16

<------------->
--- (11 headers 10 lines) ---
Sending to 127.0.0.1 : 5060 (NAT)
Using INVITE request as basis request - FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111

<--- Reliably Transmitting (no NAT) to 127.0.0.1:5060 --->
SIP/2.0 407 Proxy Authentication Required
Via: SIP/2.0/UDP 127.0.0.1;branch=z9hG4bKc0a8016f0000004a484f240323950c16000000b9;received=127.0.0.1;rport=5060
From: "John Covert"<sip:localsjphone@covert.org>;tag=196533811041561799
To: <sip:953@covert.org>;tag=as560d8d68
Call-ID: FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
CSeq: 1 INVITE
User-Agent: Asterisk PBX
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY
Supported: replaces
Proxy-Authenticate: Digest algorithm=MD5, realm="asterisk", nonce="0b417c8a"
Content-Length: 0

<------------>
Scheduling destruction of SIP dialog 'FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111' in 32000 ms (Method: INVITE)
Found user 'localsjphone'

<--- SIP read from 127.0.0.1:5060 --->
ACK sip:953@covert.org SIP/2.0
Via: SIP/2.0/UDP 127.0.0.1;rport;branch=z9hG4bKc0a8016f0000004a484f240323950c16000000b9
Content-Length: 0
Call-ID: FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
CSeq: 1 ACK
From: "John Covert"<sip:localsjphone@covert.org>;tag=196533811041561799
Max-Forwards: 70
To: <sip:953@covert.org>;tag=as560d8d68
User-Agent: SJphone/1.60.320/L (SJ Labs)

<------------->
--- (9 headers 0 lines) ---

<--- SIP read from 127.0.0.1:5060 --->
INVITE sip:953@covert.org SIP/2.0
Via: SIP/2.0/UDP 127.0.0.1;rport;branch=z9hG4bKc0a8016f0000004a484f24030e4c8a9a000000ba
Content-Length: 213
Contact: <sip:localsjphone@127.0.0.1:5060>
Call-ID: FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
Content-Type: application/sdp
CSeq: 2 INVITE
From: "John Covert"<sip:localsjphone@covert.org>;tag=196533811041561799
Max-Forwards: 70
To: <sip:953@covert.org>
User-Agent: SJphone/1.60.320/L (SJ Labs)
Proxy-Authorization: Digest username="localsjphone",realm="asterisk",nonce="0b417c8a",uri="sip:953@covert.org",response="5c9aede53cb3a324b5b03eb729e2ede1",algorithm="MD5"

v=0
o=- 3422134915 3422134915 IN IP4 127.0.0.1
s=SJphone
c=IN IP4 127.0.0.1
t=0 0
a=direction:active
m=audio 49166 RTP/AVP 0 101
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-11,16

<------------->
--- (12 headers 10 lines) ---
Sending to 127.0.0.1 : 5060 (NAT)
Using INVITE request as basis request - FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
Found user 'localsjphone'
Found RTP audio format 0
Found RTP audio format 101
Peer audio RTP is at port 127.0.0.1:49166
Found audio description format PCMU for ID 0
Found audio description format telephone-event for ID 101
Capabilities: us - 0x8000e (gsm|ulaw|alaw|h263), peer - audio=0x4 (ulaw)/video=0x0 (nothing), combined - 0x4 (ulaw)
Non-codec capabilities (dtmf): us - 0x1 (telephone-event), peer - 0x1 (telephone-event), combined - 0x1 (telephone-event)
Peer audio RTP is at port 127.0.0.1:49166
Looking for 953 in dialstation25 (domain covert.org)
list_route: hop: <sip:localsjphone@127.0.0.1:5060>

<--- Transmitting (no NAT) to 127.0.0.1:5060 --->
SIP/2.0 100 Trying
Via: SIP/2.0/UDP 127.0.0.1;branch=z9hG4bKc0a8016f0000004a484f24030e4c8a9a000000ba;received=127.0.0.1;rport=5060
From: "John Covert"<sip:localsjphone@covert.org>;tag=196533811041561799
To: <sip:953@covert.org>
Call-ID: FC4DAF58-1DD1-11B2-B996-B7758EF11C67@192.168.1.111
CSeq: 2 INVITE
User-Agent: Asterisk PBX
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY
Supported: replaces
Contact: <sip:953@127.0.0.1:5062>
Content-Length: 0

<------------>
poll.c:283: failed assertion `pArray != (struct pollfd *) NULL'

By: John Covert (jcovert) 2008-06-10 20:29:20

I decided to see what happens if I comment out the assertion check at line 283 in poll.c. (Seatbelts off!) Asterisk stays up and runs for a while, but eventually the IAX2 channel gets very wedged, and in fact a "kill -9" is required to stop Asterisk.
By: John Covert (jcovert) 2008-06-10 20:46:19

Whoa! Darwin has poll() and doesn't need this emulation. I just rebuilt with the quickest way to eliminate it (#if 0 ... #endif around the whole poll.c module) and we have stayed up for a while.
By: Russell Bryant (russell) 2008-06-10 21:35:19

However, poll emulation should not hurt anything. In fact, on a mac, that's the only reliable choice, since some versions had a broken native poll implementation.
By: John Covert (jcovert) 2008-06-11 06:31:19

Well, by using Darwin's own poll function (Mac OS X 10.4.11) Asterisk has stayed up and running and has processed SIP and IAX2 calls for a long time: "System uptime: 8 hours, 18 minutes, 48 seconds", which I have been unable to achieve before with 1.4.

The real bug seems to be that Asterisk calls poll with pArray==NULL. The emulated routine uses "assert" to check for that case, and Asterisk exits. Since I stay up and running with the Darwin native routine, one of two things must be happening: either (a) the Darwin poll routine handles the case of being called with pArrary==0 in a graceful manner compatible with the situation at the call site or (b) the emulated routine must be partly responsible for Asterisk getting in the condition where it calls poll() with pArray==0.

The real test will be in two weeks when I use this laptop as part of the classroom infrastructure at the next Bootcamp. If that works, then I'll feel confident enough to bring 1.4 up on my "production" Asterisk system at home, where "she who does not tolerate buggy phone systems" must be obeyed.

/john
By: John Covert (jcovert) 2008-06-11 08:30:49

I've now discovered another problem...

With Asterisk started in the background, the connector (asterisk -r) does not display any output.

With due respect for the fact that russell indicates that some versions of Darwin (possibly even the latest 10.4 version) have a poll() function that Asterisk hasn't been able to use directly (either broken or "compatible" [aka "different"]) I have re-introduced the emulated poll() routine with the following change:

Instead of:

assert (pArray != (struct pollfd *) NULL);

I have:

if (pArray == 0) return -1; /* JRC */

No more failure on the second SIP call, and the connector works again. We'll see if this stays up and runs. However, the client I'm working for today does not allow "foreign" equipment on their site, so it will be this evening before I can do any further significant testing.

/john
By: John Covert (jcovert) 2008-06-11 08:57:37

That didn't last very long. After a while (and several SIP and IAX calls, we got wedged again, and it took a kill -9 to get Asterisk to go away).

Now trying

if (pArray == 0) return 0; /* JRC */

/john
By: John Covert (jcovert) 2008-06-11 09:17:02

That wedges after a while as well. What worked the best was the native Darwin poll(), but there the connector didn't work, only the direct CLI from running asterisk on the terminal.
By: Digium Subversion (svnbot) 2008-06-13 16:38:16

Repository: asterisk
Revision: 122713

U branches/1.4/main/autoservice.c

------------------------------------------------------------------------
r122713 | mmichelson | 2008-06-13 16:38:15 -0500 (Fri, 13 Jun 2008) | 9 lines

Short circuit the loop in autoservice_run if there are no channels to poll.
If we continued, then the result would be calling poll() with a NULL
pollfd array. While this is fine with POSIX's poll(2) system call, those
who use Asterisk's internal poll mechanism (Darwin systems) would have
a failed assertion occur when poll is called.

(related to issue ASTERISK-9989)

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=122713
By: Digium Subversion (svnbot) 2008-06-13 16:38:48

Repository: asterisk
Revision: 122714

_U trunk/
U trunk/main/autoservice.c

------------------------------------------------------------------------
r122714 | mmichelson | 2008-06-13 16:38:47 -0500 (Fri, 13 Jun 2008) | 17 lines

Merged revisions 122713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r122713 | mmichelson | 2008-06-13 16:44:53 -0500 (Fri, 13 Jun 2008) | 9 lines

Short circuit the loop in autoservice_run if there are no channels to poll.
If we continued, then the result would be calling poll() with a NULL
pollfd array. While this is fine with POSIX's poll(2) system call, those
who use Asterisk's internal poll mechanism (Darwin systems) would have
a failed assertion occur when poll is called.

(related to issue ASTERISK-9989)

........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=122714
By: Digium Subversion (svnbot) 2008-06-13 16:41:09

Repository: asterisk
Revision: 122715

_U branches/1.6.0/
U branches/1.6.0/main/autoservice.c

------------------------------------------------------------------------
r122715 | mmichelson | 2008-06-13 16:41:08 -0500 (Fri, 13 Jun 2008) | 25 lines

Merged revisions 122714 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r122714 | mmichelson | 2008-06-13 16:45:21 -0500 (Fri, 13 Jun 2008) | 17 lines

Merged revisions 122713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r122713 | mmichelson | 2008-06-13 16:44:53 -0500 (Fri, 13 Jun 2008) | 9 lines

Short circuit the loop in autoservice_run if there are no channels to poll.
If we continued, then the result would be calling poll() with a NULL
pollfd array. While this is fine with POSIX's poll(2) system call, those
who use Asterisk's internal poll mechanism (Darwin systems) would have
a failed assertion occur when poll is called.

(related to issue ASTERISK-9989)

........

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=122715
By: Digium Subversion (svnbot) 2008-06-13 17:12:28

Repository: asterisk
Revision: 122724

_U team/seanbright/resolve-shadow-warnings/
U team/seanbright/resolve-shadow-warnings/channels/chan_iax2.c
U team/seanbright/resolve-shadow-warnings/include/asterisk/channel.h
U team/seanbright/resolve-shadow-warnings/include/asterisk/timing.h
U team/seanbright/resolve-shadow-warnings/main/asterisk.c
U team/seanbright/resolve-shadow-warnings/main/autoservice.c
U team/seanbright/resolve-shadow-warnings/main/channel.c
U team/seanbright/resolve-shadow-warnings/main/dial.c
U team/seanbright/resolve-shadow-warnings/main/file.c
U team/seanbright/resolve-shadow-warnings/main/timing.c
U team/seanbright/resolve-shadow-warnings/res/res_config_pgsql.c
A team/seanbright/resolve-shadow-warnings/res/res_timing_dahdi.c

------------------------------------------------------------------------
r122724 | seanbright | 2008-06-13 17:12:27 -0500 (Fri, 13 Jun 2008) | 87 lines

Merged revisions 122523,122526,122557,122616,122664,122714,122716 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r122523 | russell | 2008-06-13 08:45:50 -0400 (Fri, 13 Jun 2008) | 11 lines

Merge changes from timing branch

- Convert chan_iax2 to use the timing API
- Convert usage of timing in the core to use the timing API instead of
using DAHDI directly
- Make a change to the timing API to add the set_rate() function
- change the timing core to use a rwlock
- merge a timing implementation, res_timing_dahdi

Basic testing was successful using res_timing_dahdi

................
r122526 | russell | 2008-06-13 08:53:08 -0400 (Fri, 13 Jun 2008) | 3 lines

Do not allow res_timing_dahdi to be unloaded. We can re-enable this once we
add automatic use count handling for timing modules.

................
r122557 | tilghman | 2008-06-13 10:15:07 -0400 (Fri, 13 Jun 2008) | 7 lines

Convert one more delimiter to use comma.
(closes issue ASTERISK-12187)
Reported by: bcnit
Patches:
20080613__bug12850.diff.txt uploaded by Corydon76 (license 14)
Tested by: bcnit

................
r122616 | jpeeler | 2008-06-13 13:38:28 -0400 (Fri, 13 Jun 2008) | 13 lines

Blocked revisions 122613 via svnmerge

........
r122613 | jpeeler | 2008-06-13 12:36:56 -0500 (Fri, 13 Jun 2008) | 6 lines

(closes issue ASTERISK-12184)
Reported by: Netview
Tested by: jpeeler

Use correct location to search for tonezone.

........

................
r122664 | jpeeler | 2008-06-13 14:58:29 -0400 (Fri, 13 Jun 2008) | 8 lines

Blocked revisions 122663 via svnmerge

........
r122663 | jpeeler | 2008-06-13 13:57:24 -0500 (Fri, 13 Jun 2008) | 1 line

fixed dahdi compatability header from assuming either dahdi or zaptel is installed (may not have either)
........

................
r122714 | mmichelson | 2008-06-13 17:45:21 -0400 (Fri, 13 Jun 2008) | 17 lines

Merged revisions 122713 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r122713 | mmichelson | 2008-06-13 16:44:53 -0500 (Fri, 13 Jun 2008) | 9 lines

Short circuit the loop in autoservice_run if there are no channels to poll.
If we continued, then the result would be calling poll() with a NULL
pollfd array. While this is fine with POSIX's poll(2) system call, those
who use Asterisk's internal poll mechanism (Darwin systems) would have
a failed assertion occur when poll is called.

(related to issue ASTERISK-9989)

........

................
r122716 | tilghman | 2008-06-13 17:50:28 -0400 (Fri, 13 Jun 2008) | 2 lines

Properly detect the size of char/varchar fields

................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=122724
By: Mark Michelson (mmichelson) 2008-06-13 17:18:44

I spotted a condition in autoservice which could cause that assertion to trip in poll.c and have fixed it in revision 122713. I don't know if that will account for all places where the condition may occur, but that one in particular is now taken care of.

If you trigger that assertion in poll.c again and Asterisk dumps core as a result, then if you upload backtraces we can potentially fix other areas too.
By: John Covert (jcovert) 2008-06-13 19:58:18

The assertion is only one part of the problem. Recall from Note 0088584 that I had been able to stay up a long time and run well EXCEPT for the console connector when I made asterisk use Darwin's poll(2). I then tried returning either -1 or 0 from the emulated poll if it was called with pArray==0 (I figured the Posix poll(2) probably does one of those) and that made the console connector work again, but reintroduced the hanging behaviour.

I get the same problem with 122713. No assertion failures, but we just don't stay up and running very long.

There seems to be something else about this emulated poll that isn't right.

Below, the first incoming IAX2 call fails, and then a stop now puts asterisk into a loop of some sort.

Asterisk SVN-branch-1.4-r122713, Copyright (C) 1999 - 2008 Digium, Inc. and others.

*CLI> -- Accepting AUTHENTICATED call from 192.168.0.100:
> requested format = ulaw,
> requested prefs = (ulaw|alaw|g726|gsm|ilbc),
> actual format = ulaw,
> host prefs = (),
> priority = caller
-- Executing [12640001@inbound-cnet-264:1] Answer("IAX2/mainpbx-15272", "300") in new stack
-- Executing [12640001@inbound-cnet-264:2] NoOp("IAX2/mainpbx-15272", "") in new stack
-- Executing [12640001@inbound-cnet-264:3] Playback("IAX2/mainpbx-15272", "/Users/jcovert/asterisk/laptop-pbx") in new stack
-- <IAX2/mainpbx-15272> Playing '/Users/jcovert/asterisk/laptop-pbx' (language 'en')

but it doesn't play; it's hung

*CLI>
*CLI>
*CLI> stop now

but we don't exit; we are looping

(first line in top -u below)

20005 asterisk 43.6% 2:00.38 >> 172 655 5.80M 14.4M 9.93M 104M
By: John Covert (jcovert) 2008-06-13 21:24:27

The time I had it up and running all night using Darwin's poll(), I might have just been lucky (or unlucky, depending on how you look at it). I can now make it hang in almost every case with an incoming IAX2 call.

I decided to throw in some debug output into the emulated poll() after it makes the select() call.

We start up idle, and occasionally some I/O happens; and I'm not sure whether the occasional EINTRs are a problem or not.

[Jun 13 22:19:29] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1824400,
[Jun 13 22:19:36] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:19:36] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:19:49] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1824400,
[Jun 13 22:19:56] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:19:56] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:06] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:20:06] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:20:06] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:06] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:20:06] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:09] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1824400,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,

Then I throw a call at it:

-- Accepting AUTHENTICATED call from 192.168.0.100:
> requested format = ulaw,
> requested prefs = (ulaw|alaw|g726|gsm|ilbc),
> actual format = ulaw,
> host prefs = (),
> priority = caller
[Jun 13 22:20:10] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
-- Executing [12640001@inbound-cnet-264:1] Answer("IAX2/mainpbx-8555", "300") in new stack

[Jun 13 22:20:10] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,

Reasonably normal for a while, I think...

[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:10] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,

Things look normal here, so we'll skip a bit and continue:

[Jun 13 22:20:11] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf760,
-- Executing [12640001@inbound-cnet-264:2] NoOp("IAX2/mainpbx-8555", "") in new stack
-- Executing [12640001@inbound-cnet-264:3] Playback("IAX2/mainpbx-8555", "/Users/jcovert/asterisk/laptop-pbx") in new stack
-- <IAX2/mainpbx-8555> Playing '/Users/jcovert/asterisk/laptop-pbx' (language 'en')
[Jun 13 22:20:11] WARNING[25154]: poll.c:308 poll: select errno=4, pArray=1814c00,
[Jun 13 22:20:11] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:11] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf5f0,
[Jun 13 22:20:11] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:11] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf5f0,

This goes on for a while...

[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf5f0,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf5f0,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=f04bf5f0,

... and then suddenly the output stops, and we are in a solid loop of:

[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,
[Jun 13 22:20:13] WARNING[25154]: poll.c:310 poll: ready=1, pArray=1814c00,

and this continues until I killall -9 asterisk.
By: Mark Michelson (mmichelson) 2008-06-16 10:53:00

Okay, if you can get a backtrace of the thread that is looping on the poll, we may see something that is occurring incorrectly down the stack a bit further.
By: John Covert (jcovert) 2008-06-17 06:41:24

ok, so I did this:

static int jrccount = 0 ;
static struct pollfd *pjrc = 0;

and after the select() call:

if (ready_descriptors > 0) {
if (pjrc != pArray) {
jrccount = 0;
pjrc = pArray;
} else {
if (++jrccount > 10) kill(getpid(),SIGQUIT);
}
ast_log(LOG_WARNING,"ready=%d, pArray=%x,\n",ready_descriptors,(unsigned int)pArray);
}

and we got

[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=f04bf760,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=f04bf760,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
-- Executing [12640001@inbound-cnet-264:2] NoOp("IAX2/mainpbx-748", "") in new stack
-- Executing [12640001@inbound-cnet-264:3] Playback("IAX2/mainpbx-748", "/Users/jcovert/asterisk/laptop-pbx") in new stack
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:23] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:24] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:24] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:24] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
[Jun 16 22:22:24] WARNING[642]: poll.c:319 poll: ready=1, pArray=1814c00,
Quit (core dumped)
jrcovert:~/asterisk/asterisk-1.4.122796 root# ls -l /cores/core.642
-r-------- 1 root admin 201625600 Jun 16 22:22 /cores/core.642

Being a VMS developer, not a Uish developer, I'm not sure what the best thing to do to get you the info you need would be. That's probably way too big a file to upload into bug tracker, and it probably has some SIP and IAX2 passwords and stuff in it that I'd prefer to only trust you and not the world with.

What would you like?

/john
By: Mark Michelson (mmichelson) 2008-06-23 10:50:06

Yes, you definitely don't want to upload the core file here, because it is way too big like you said, and it also is useless if ported to a different machine. What you should do is look in the file doc/backtrace.txt for instructions on how to obtain a backtrace from a core dump using gdb. If you issue the "thread apply all bt" command in gdb, you'll get a backtrace of all threads currently running. It may be a bit tricky to track down which thread is actually looping, but using the debug info you have added to poll.c, you should be able to match the address of pArray to the address given in the backtrace. If you can isolate that one thread, and upload the backtrace from that thread, that would be helpful. As far as passwords and other information goes, I doubt that sort of thing will be included in the backtrace you upload, but just in case, you may want to take a close look and sanitize it prior to uploading.
By: Mark Michelson (mmichelson) 2008-07-25 12:37:00

I'm closing this issue for a few reasons.

1. It's been a month since there has been a comment here.
2. There have several issues fixed in this bug report aside from the reported issue.

If you want to pursue this problem further, please open a new issue with whatever specific problem is occurring. In this case, if you're still having the infinitely looping calls to poll(), then we can try to resolve that in a separate bug report.