Summary:ASTERISK-04513: When Primary D-Channel goes down q931_hangup may receive a wrong (q931_call *) argument
Reporter:Ezio Vernacotola (ezio)Labels:
Date Opened:2005-07-03 03:51:58Date Closed:2011-06-07 14:00:42
Versions:Frequency of
Environment:Attachments:( 0) chan_zap_r1.500-patch.txt
( 1) error-log.txt
( 2) overlap_call_with_pri_debug.txt
( 3) pri-restart-after-timeout.txt
( 4) unallocated-call.txt
Description:There are obscure (for me!) situations, causing my telco restart the E1 pri. Asterisk does not handle well this. q931_hangup my receive a wrong q931_call pointer argument and cause asterisk to crash.

In asterisk/log/messages is reported:
PRI: We're not yet handling hanging up when our state is 956317232, contact support@digium.com,  ourstate Unknown, peerstate Unknown

where 956317232 is random


backtrace full:

#0  0x001d0356 in q931_hangup (pri=0x9dc4960, c=0x9e76810, cause=16) at q931.c:3018
       disconnect = 1
       release_compl = 0
#1  0x001c6d4a in pri_hangup (pri=0x9dc4960, call=0x9e76810, cause=16) at pri.c:530
No locals.
#2  0x0074d76e in pri_dchannel (vpri=0x762b20) at chan_zap.c:7897
       law = 2
       e = (pri_event *) 0x9dc5264
       fds = {{fd = 106, events = 3, revents = 0}, {fd = 0, events = 0, revents = 0}, {fd = 0, events = 0,
   revents = 0}, {fd = 3784424, events = 13188, revents = 235}}
       res = 165102592
       chanpos = 24
       x = 0
       haveidles = 0
       activeidles = 0
       nextidle = 0
       c = (struct ast_channel *) 0x9d74400
       tv = {tv_sec = 0, tv_usec = 999995}
       lowest = {tv_sec = 0, tv_usec = 999995}
       next = (struct timeval *) 0x9d74400
       lastidle = {tv_sec = 1120170271, tv_usec = 661160}
       doidling = 0
       idlen = '\0' <repeats 79 times>
       idle = (struct ast_channel *) 0x9d74400
       p = 0
       t = 1120342078
       i = 7744288
       which = 165102592
       numdchans = 1
       cause = 0
       crv = (struct zt_pvt *) 0x0
       threadid = 281280526
       attr = {__detachstate = 1, __schedpolicy = 0, __schedparam = {__sched_priority = 0}, __inheritsched = 1,
 __scope = 0, __guardsize = 4096, __stackaddr_set = 0, __stackaddr = 0x0, __stacksize = 262144}
       ani2str = "00\000\000\000"
       plancallingnum = "076635949", '\0' <repeats 246 times>
       calledtonstr = "33\000\000\000\000\000\000\000"
       __PRETTY_FUNCTION__ = "pri_dchannel"
#3  0x00116ce1 in pthread_start_thread () from /lib/i686/libpthread.so.0
No symbol table info available.
#4  0x00f5861a in clone () from /lib/i686/libc.so.6
No symbol table info available.

q931_call argument inspecting

(gdb) print c
$3 = (q931_call *) 0x9e76810
(gdb) print c->cause
Cannot access memory at address 0x9e76888
Comments:By: Matthew Fredrickson (mattf) 2005-07-06 15:03:46

What is happening on the PRI when this occurs?

By: Ezio Vernacotola (ezio) 2005-07-07 10:47:29

This error occours generally just after midnight between 00:07:00 and 00:11:00, every day or two. Only rarely asterisk crashes. Actually i have only full error log with no pri debug. Tonight i will switch pri debug on all three span, in hope can catch useful information. In meantime i upload my full error log

By: Matthew Fredrickson (mattf) 2005-07-08 11:58:17

Is this a patched system by any chance?

By: Ezio Vernacotola (ezio) 2005-07-10 02:32:53

The system is not patched. Is running cvs head, updated 1-2 times in a month.

By: Matthew Fredrickson (mattf) 2005-07-12 12:04:59

I wonder if this has anything to do with some of my NFAS code.  Does this occur in stable as well?

By: Matthew Fredrickson (mattf) 2005-07-12 16:55:25

I don't know if there is much more I can do with this unless I can find out some more information about what was going on in the system while this occurred.  The backtrace is pretty much useless to me and I haven't found anywhere that this can occur in.  Try building without optimizations and see if it still occurs.

By: Ezio Vernacotola (ezio) 2005-07-13 02:17:21

I can't revert to cvs stable because the system is in production and I use various  recent features, but I will rebuild libpri without optimization, and will try in the next days to learn the libpri code and insert somewhere additional logging instructions.

By: Ezio Vernacotola (ezio) 2005-07-22 03:42:29

No more errors in the last 10 days, and now I haven't no way to reproduce the problem. Now I am running with chan_zap.c optimization off ($(CC) -c $(CFLAGS) -O0 -o chan_zap.o chan_zap.c), I suspect the origin of errors was solved in the telco side, but I don't known, in case of errors, if asterisk will crash again. Now all I can do is turn optimization on again on chan_zap and see what will happen.

By: Michael Jerris (mikej) 2005-07-22 08:40:50

Suspended bug due to inability to reproduce at this time.  If you are able to get an update trace with valgrind, please re-open this at that time.  Thanks for the report.

By: Ezio Vernacotola (ezio) 2005-09-04 04:42:02

I receive errors again from 2005-08-16 and now have more info on what is happening.
1) at 00:11:07 every night I receive a N200 timeout
2) t200_expire calls q921_dchannel_down, PRI_EVENT_DCHAN_DOWN is fired
3) chan_zap.c tries to hangup all active channels (there are no calls at that time)
4) pri_hangup and pri_destroycall are called with pointer to calls that does not exists!!!
5) random things my happen, mostly times only stange error in the log, some time has crashed.

If I start asterisk with valgrind just about midnight I see the t200 timeout but no call destroy errors. I found that with a recently started asterisk there are no problems.

I think that somewhere, sometime, may happen that a call is destroyed but his reference in the struct zt_pvt is not set to NULL.

By: Matthew Fredrickson (mattf) 2005-09-06 13:52:00

Try updating to latest CVS HEAD and see if this still occurs.  I just found a possible race condition that might have caused your problem.

By: Ezio Vernacotola (ezio) 2005-09-07 15:20:28

Matt, I have applied your patch "diff -u -d -r1.500 -r1.501" to my current version of chan_zap.c (1.462). I don't want update all asterisk to cvs head because I had, on another machine, troubles with latest updates of chan_iax.

Just after appling the patch and restarted asterisk, I got these warnings on hangup, never seen before:

2005-09-07 20:58:45 WARNING[1949]: chan_zap.c:7462 pri_fixup_principle: Call specified, but not found?
2005-09-07 20:58:45 WARNING[1949]: chan_zap.c:8485 pri_dchannel: Hangup on bad channel 0/31 on span 3
   -- B-channel 0/31 restarted on span 3

After that i have removed the first line "p->call = NULL;" that you have added and keep the other two edits. Now more warnings on hangup for now.

Is hard for me understand chan_zap.c code, and may be that what I made have no sense,  but i will try going with these updates and see what happen.

By: Michael Gernoth (mgernoth) 2005-09-07 15:31:38

These errors happen on the CVS-Head version, too. See Bug 0005147

By: Matthew Fredrickson (mattf) 2005-09-07 18:14:25

fixed that little problem in head.  I accidentally screwed up hangups :-)

By: Ezio Vernacotola (ezio) 2005-09-09 13:42:29

I have finally found the origin of problem, and is reproducibile. Your patch does't resolve it. All begin when I receive a call from the pri to an extension that don't exists. The call is hanged up with cause "Unallocated(unassigned) number". If I do 'show channels' at the cli I see no calls, that's right, but if I do 'zap show channel N' (where N is the channel that received the call), I see 'PRI Flags: Call'.
In the next scheduled restart of the span the channel is not restarted.
In this status there can be troubles in case of PRI_EVENT_DCHAN_DOWN.
I looked at the code of chan_zap and q931 but I can't undestand where is the bug.

By: Matthew Fredrickson (mattf) 2005-09-09 14:04:54

Great!!! That will help out a lot.  I'm about to leave the office but I'll see if I can find the bug this weekend.

By: Ezio Vernacotola (ezio) 2005-09-11 13:55:57

More info: this occours only if Calling equipment is non-ISDN and overlapdial=yes in zapata.conf.

By: Mark Spencer (markster) 2005-09-21 08:34:27

Can you provide me a debug of the overlap dial situation?  I'd like to see the negotiation in that situation, since it sounds like we're not getting the RELEASE COMPLETE back.

By: Ezio Vernacotola (ezio) 2005-09-21 09:20:34

In overlap_call_with_pri_debug.txt there an unallocated call from '076630351' on channel 0/21, span 1. After hangup zap show channel 21' shows 'PRI Flags: Call'. There is some noise due to other activity.

In pri-restart-after-timeout.txt there is what happened tonight at 00:11:12 after a pri timeout and subsequent restart.

These two debug file don't are correlated (the unallocated call is successive to the restart) but should show clearly what is happening.

By: Mark Spencer (markster) 2005-09-24 22:31:28

That's exactly what I needed.  Fixed in CVS head!

By: Russell Bryant (russell) 2005-09-25 13:27:54

fixed in 1.0 as well

By: Digium Subversion (svnbot) 2008-01-15 15:48:56.000-0600

Repository: asterisk
Revision: 6639

U   trunk/channels/chan_zap.c

r6639 | markster | 2008-01-15 15:48:56 -0600 (Tue, 15 Jan 2008) | 2 lines

Null out call number on release complete when number is unallocated (bug ASTERISK-4513)



By: Digium Subversion (svnbot) 2008-01-15 15:49:02.000-0600

Repository: asterisk
Revision: 6646

U   branches/v1-0/channels/chan_zap.c

r6646 | russell | 2008-01-15 15:49:01 -0600 (Tue, 15 Jan 2008) | 2 lines

Null out call number on release complete when number is unallocated (bug ASTERISK-4513)