Summary: | ASTERISK-04513: When Primary D-Channel goes down q931_hangup may receive a wrong (q931_call *) argument | ||
Reporter: | Ezio Vernacotola (ezio) | Labels: | |
Date Opened: | 2005-07-03 03:51:58 | Date Closed: | 2011-06-07 14:00:42 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) chan_zap_r1.500-patch.txt ( 1) error-log.txt ( 2) overlap_call_with_pri_debug.txt ( 3) pri-restart-after-timeout.txt ( 4) unallocated-call.txt | |
Description: | There are obscure (for me!) situations, causing my telco restart the E1 pri. Asterisk does not handle well this. q931_hangup my receive a wrong q931_call pointer argument and cause asterisk to crash. In asterisk/log/messages is reported: PRI: We're not yet handling hanging up when our state is 956317232, contact support@digium.com, ourstate Unknown, peerstate Unknown where 956317232 is random ****** ADDITIONAL INFORMATION ****** backtrace full: #0 0x001d0356 in q931_hangup (pri=0x9dc4960, c=0x9e76810, cause=16) at q931.c:3018 disconnect = 1 release_compl = 0 #1 0x001c6d4a in pri_hangup (pri=0x9dc4960, call=0x9e76810, cause=16) at pri.c:530 No locals. #2 0x0074d76e in pri_dchannel (vpri=0x762b20) at chan_zap.c:7897 law = 2 e = (pri_event *) 0x9dc5264 fds = {{fd = 106, events = 3, revents = 0}, {fd = 0, events = 0, revents = 0}, {fd = 0, events = 0, revents = 0}, {fd = 3784424, events = 13188, revents = 235}} res = 165102592 chanpos = 24 x = 0 haveidles = 0 activeidles = 0 nextidle = 0 c = (struct ast_channel *) 0x9d74400 tv = {tv_sec = 0, tv_usec = 999995} lowest = {tv_sec = 0, tv_usec = 999995} next = (struct timeval *) 0x9d74400 lastidle = {tv_sec = 1120170271, tv_usec = 661160} doidling = 0 idlen = '\0' <repeats 79 times> idle = (struct ast_channel *) 0x9d74400 p = 0 t = 1120342078 i = 7744288 which = 165102592 numdchans = 1 cause = 0 crv = (struct zt_pvt *) 0x0 threadid = 281280526 attr = {__detachstate = 1, __schedpolicy = 0, __schedparam = {__sched_priority = 0}, __inheritsched = 1, __scope = 0, __guardsize = 4096, __stackaddr_set = 0, __stackaddr = 0x0, __stacksize = 262144} ani2str = "00\000\000\000" plancallingnum = "076635949", '\0' <repeats 246 times> calledtonstr = "33\000\000\000\000\000\000\000" __PRETTY_FUNCTION__ = "pri_dchannel" #3 0x00116ce1 in pthread_start_thread () from /lib/i686/libpthread.so.0 No symbol table info available. #4 0x00f5861a in clone () from /lib/i686/libc.so.6 No symbol table info available. q931_call argument inspecting (gdb) print c $3 = (q931_call *) 0x9e76810 (gdb) print c->cause Cannot access memory at address 0x9e76888 (gdb) | ||
Comments: | By: Matthew Fredrickson (mattf) 2005-07-06 15:03:46 What is happening on the PRI when this occurs? By: Ezio Vernacotola (ezio) 2005-07-07 10:47:29 This error occours generally just after midnight between 00:07:00 and 00:11:00, every day or two. Only rarely asterisk crashes. Actually i have only full error log with no pri debug. Tonight i will switch pri debug on all three span, in hope can catch useful information. In meantime i upload my full error log By: Matthew Fredrickson (mattf) 2005-07-08 11:58:17 Is this a patched system by any chance? By: Ezio Vernacotola (ezio) 2005-07-10 02:32:53 The system is not patched. Is running cvs head, updated 1-2 times in a month. By: Matthew Fredrickson (mattf) 2005-07-12 12:04:59 I wonder if this has anything to do with some of my NFAS code. Does this occur in stable as well? By: Matthew Fredrickson (mattf) 2005-07-12 16:55:25 I don't know if there is much more I can do with this unless I can find out some more information about what was going on in the system while this occurred. The backtrace is pretty much useless to me and I haven't found anywhere that this can occur in. Try building without optimizations and see if it still occurs. By: Ezio Vernacotola (ezio) 2005-07-13 02:17:21 I can't revert to cvs stable because the system is in production and I use various recent features, but I will rebuild libpri without optimization, and will try in the next days to learn the libpri code and insert somewhere additional logging instructions. By: Ezio Vernacotola (ezio) 2005-07-22 03:42:29 No more errors in the last 10 days, and now I haven't no way to reproduce the problem. Now I am running with chan_zap.c optimization off ($(CC) -c $(CFLAGS) -O0 -o chan_zap.o chan_zap.c), I suspect the origin of errors was solved in the telco side, but I don't known, in case of errors, if asterisk will crash again. Now all I can do is turn optimization on again on chan_zap and see what will happen. By: Michael Jerris (mikej) 2005-07-22 08:40:50 Suspended bug due to inability to reproduce at this time. If you are able to get an update trace with valgrind, please re-open this at that time. Thanks for the report. By: Ezio Vernacotola (ezio) 2005-09-04 04:42:02 I receive errors again from 2005-08-16 and now have more info on what is happening. 1) at 00:11:07 every night I receive a N200 timeout 2) t200_expire calls q921_dchannel_down, PRI_EVENT_DCHAN_DOWN is fired 3) chan_zap.c tries to hangup all active channels (there are no calls at that time) 4) pri_hangup and pri_destroycall are called with pointer to calls that does not exists!!! 5) random things my happen, mostly times only stange error in the log, some time has crashed. If I start asterisk with valgrind just about midnight I see the t200 timeout but no call destroy errors. I found that with a recently started asterisk there are no problems. I think that somewhere, sometime, may happen that a call is destroyed but his reference in the struct zt_pvt is not set to NULL. By: Matthew Fredrickson (mattf) 2005-09-06 13:52:00 Try updating to latest CVS HEAD and see if this still occurs. I just found a possible race condition that might have caused your problem. By: Ezio Vernacotola (ezio) 2005-09-07 15:20:28 Matt, I have applied your patch "diff -u -d -r1.500 -r1.501" to my current version of chan_zap.c (1.462). I don't want update all asterisk to cvs head because I had, on another machine, troubles with latest updates of chan_iax. Just after appling the patch and restarted asterisk, I got these warnings on hangup, never seen before: 2005-09-07 20:58:45 WARNING[1949]: chan_zap.c:7462 pri_fixup_principle: Call specified, but not found? 2005-09-07 20:58:45 WARNING[1949]: chan_zap.c:8485 pri_dchannel: Hangup on bad channel 0/31 on span 3 -- B-channel 0/31 restarted on span 3 After that i have removed the first line "p->call = NULL;" that you have added and keep the other two edits. Now more warnings on hangup for now. Is hard for me understand chan_zap.c code, and may be that what I made have no sense, but i will try going with these updates and see what happen. By: Michael Gernoth (mgernoth) 2005-09-07 15:31:38 These errors happen on the CVS-Head version, too. See Bug 0005147 By: Matthew Fredrickson (mattf) 2005-09-07 18:14:25 fixed that little problem in head. I accidentally screwed up hangups :-) By: Ezio Vernacotola (ezio) 2005-09-09 13:42:29 I have finally found the origin of problem, and is reproducibile. Your patch does't resolve it. All begin when I receive a call from the pri to an extension that don't exists. The call is hanged up with cause "Unallocated(unassigned) number". If I do 'show channels' at the cli I see no calls, that's right, but if I do 'zap show channel N' (where N is the channel that received the call), I see 'PRI Flags: Call'. In the next scheduled restart of the span the channel is not restarted. In this status there can be troubles in case of PRI_EVENT_DCHAN_DOWN. I looked at the code of chan_zap and q931 but I can't undestand where is the bug. By: Matthew Fredrickson (mattf) 2005-09-09 14:04:54 Great!!! That will help out a lot. I'm about to leave the office but I'll see if I can find the bug this weekend. By: Ezio Vernacotola (ezio) 2005-09-11 13:55:57 More info: this occours only if Calling equipment is non-ISDN and overlapdial=yes in zapata.conf. By: Mark Spencer (markster) 2005-09-21 08:34:27 Can you provide me a debug of the overlap dial situation? I'd like to see the negotiation in that situation, since it sounds like we're not getting the RELEASE COMPLETE back. By: Ezio Vernacotola (ezio) 2005-09-21 09:20:34 In overlap_call_with_pri_debug.txt there an unallocated call from '076630351' on channel 0/21, span 1. After hangup zap show channel 21' shows 'PRI Flags: Call'. There is some noise due to other activity. In pri-restart-after-timeout.txt there is what happened tonight at 00:11:12 after a pri timeout and subsequent restart. These two debug file don't are correlated (the unallocated call is successive to the restart) but should show clearly what is happening. By: Mark Spencer (markster) 2005-09-24 22:31:28 That's exactly what I needed. Fixed in CVS head! By: Russell Bryant (russell) 2005-09-25 13:27:54 fixed in 1.0 as well By: Digium Subversion (svnbot) 2008-01-15 15:48:56.000-0600 Repository: asterisk Revision: 6639 U trunk/channels/chan_zap.c ------------------------------------------------------------------------ r6639 | markster | 2008-01-15 15:48:56 -0600 (Tue, 15 Jan 2008) | 2 lines Null out call number on release complete when number is unallocated (bug ASTERISK-4513) ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=6639 By: Digium Subversion (svnbot) 2008-01-15 15:49:02.000-0600 Repository: asterisk Revision: 6646 U branches/v1-0/channels/chan_zap.c ------------------------------------------------------------------------ r6646 | russell | 2008-01-15 15:49:01 -0600 (Tue, 15 Jan 2008) | 2 lines Null out call number on release complete when number is unallocated (bug ASTERISK-4513) ------------------------------------------------------------------------ http://svn.digium.com/view/asterisk?view=rev&revision=6646 |