Summary: | ASTERISK-03246: [patch] Race condition in zaptel.c causes a kernel panic & fix | ||
Reporter: | klictel (klictel) | Labels: | |
Date Opened: | 2005-01-10 15:09:13.000-0600 | Date Closed: | 2008-06-07 10:47:21 |
Priority: | Critical | Regression? | No |
Status: | Closed/Complete | Components: | Core/General |
Versions: | Frequency of Occurrence | ||
Related Issues: | |||
Environment: | Attachments: | ( 0) diff.txt ( 1) diff.txt ( 2) mydiff.txt ( 3) panic.txt | |
Description: | Environment: Dual xeon servers (Dell 2850 or Vision), Asterisk 1.0.3 or latest CVS, Redhat enterprise ES3 update 3 or latest stock kernel 2.4.28, or fedora 2.6 kernel, TE410P Kernel Panic trace: Attached file panic.txt When Zaptel/Asterisk receives a high number of simultaneous calls, the pointer chan->curtone can become NULL while the function __zt_getbuf_chunk() tries to use it (line 4848). It might seem impossible since this condition is tested right above that line, but it does happen that between the time the test 'if (ms->curtone &&...' is performed and the time curtone is used, it gets set to NULL by the function zt_chan_ioctl() (line 4099). Examinig the code, you notice that no lock on the chan is performed during that operation. The calls to __zt_getbuf_chunk() are all within lock/unlock. By simply adding the protection in zt_chan_ioctl(), we eliminated the problem. Here is the fix: 4098a4099 > spin_lock_irqsave(&chan->lock, flags); 4099a4101 > spin_unlock_irqrestore(&chan->lock, flags); ****** ADDITIONAL INFORMATION ****** For the last three monthes we have been suffering kernel panics occuring at one of our clients site. Even though our own servers handle a higher volume of calls than his, we never experienced such a problem ourselfs. The particularity of their environment is that their call center does not receive the calls in a uniform manner through out the day, but in very high bursts. During a very short period of time (5 minutes) their call center is practically bombarded with calls, and the kernel panic always occured during such peaks, around once a day. We were able to reproduce the scenario the following way: We prepared two asterisk servers with TE410P, one for generating the calls and the second for receiving them (Asterisk A generates and Asterisk B receives). We connected the spans with four crossover cables (span1 to span1, span2 to span2, ... ). On *A we generated the calls by repeatedly dumping a large number of call files (100) in outgoing. I know it is not the best way for generating outbounds, but it did the job for our tests. What is important is dumping the files at the same time, and not sequentially with a delay in between. We need to have *B trying to accept the 96 calls simultaneously. We configured the channels as FXOKS on *A and FXSKS on *B. We were able to reproduce the kernel panic at will, with the first batch of 96 calls sent at once. We connected a NULL modem to capture the full trace of the panic (attached). After the change to zaptel.c, we let it run for 24 hours without a problem. Please note that the kernel panic only occures on the servers receiving the calls, not the one generating the calls. | ||
Comments: | By: Brian West (bkw918) 2005-01-10 15:12:26.000-0600 attach a diff -u please. By: klictel (klictel) 2005-01-10 15:30:32.000-0600 I had the diff at the end of the Description field. Here it is: 4098a4099 > spin_lock_irqsave(&chan->lock, flags); 4099a4101 > spin_unlock_irqrestore(&chan->lock, flags); By: Mark Spencer (markster) 2005-01-10 16:03:46.000-0600 We need "diff -u" not just diff.... By: klictel (klictel) 2005-01-10 16:32:37.000-0600 here is the diff -u... sorry for that... --- zaptel.c.ori 2005-01-10 17:28:49.635373128 -0500 +++ zaptel.c 2005-01-10 17:30:19.848658624 -0500 @@ -4096,7 +4096,9 @@ /* if no span, just do nothing */ if (!chan->span) return(0); /* if dialing, stop it */ + spin_lock_irqsave(&chan->lock, flags); chan->curtone = NULL; + spin_unlock_irqrestore(&chan->lock, flags); chan->dialing = 0; chan->txdialbuf[0] = '\0'; chan->tonep = 0; By: twisted (twisted) 2005-01-10 18:58:42.000-0600 please attach a diff -u in the same fashon you attached the normal diff. By: zoa (zoa) 2005-01-11 03:22:56.000-0600 patch confirmed by francois lambert on the mailinglist. By: Andrew Kohlsmith (akohlsmith) 2005-01-11 10:10:43.000-0600 I'd love to know more information about how you debugged this particular crash -- how you nailed it down to that line from the panic and so on -- Perhaps on the list moreso than here, but I think your experience being detailed would be a great help. By: klictel (klictel) 2005-01-11 11:09:42.000-0600 Hi akohlsmith, It was important to get the stack of the kernel panic, and since the console limits to 24 lines, i connected a NULL modem to serial port of the server and the other side connected to a serial port of a windows PC, and used Hyperterminal on the PC to connect to the serial port and capture whatever appears. Now everything that goes on the console is sent through the null modem to the PC. The stack even shows you which function used the null pointer in zaptel. From there on, it is C code to debug. Claude By: Mark Spencer (markster) 2005-01-11 11:18:37.000-0600 Fixed in CVS head. I also wanted to keep the lock for a few more parameters just to be on the safe side. By: Russell Bryant (russell) 2005-01-12 22:48:39.000-0600 fixed in 1.0 By: Digium Subversion (svnbot) 2008-06-07 10:47:21 Repository: dahdi Revision: 545 U branches/v1-0/zaptel.c ------------------------------------------------------------------------ r545 | russell | 2008-06-07 10:47:20 -0500 (Sat, 07 Jun 2008) | 2 lines lock some parameters (bug ASTERISK-3246) ------------------------------------------------------------------------ http://svn.digium.com/view/dahdi?view=rev&revision=545 |