[Home]

Summary:ASTERISK-03246: [patch] Race condition in zaptel.c causes a kernel panic & fix
Reporter:klictel (klictel)Labels:
Date Opened:2005-01-10 15:09:13.000-0600Date Closed:2008-06-07 10:47:21
Priority:CriticalRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) diff.txt
( 1) diff.txt
( 2) mydiff.txt
( 3) panic.txt
Description:Environment: Dual xeon servers (Dell 2850 or Vision),
            Asterisk 1.0.3 or latest CVS, Redhat enterprise ES3 update 3 or latest stock kernel 2.4.28, or fedora 2.6 kernel,
            TE410P

Kernel Panic trace: Attached file panic.txt

When Zaptel/Asterisk receives a high number of simultaneous calls, the pointer chan->curtone can become NULL while the function __zt_getbuf_chunk() tries to use it (line 4848). It might seem impossible since this condition is tested right above that line, but it does happen that between the time the test 'if (ms->curtone &&...' is performed and the time curtone is used, it gets set to NULL by the function zt_chan_ioctl() (line 4099). Examinig the code, you notice that no lock on the chan is performed during that operation. The calls to __zt_getbuf_chunk() are all within lock/unlock. By simply adding the protection in zt_chan_ioctl(), we eliminated the problem. Here is the fix:
4098a4099
>               spin_lock_irqsave(&chan->lock, flags);
4099a4101
>               spin_unlock_irqrestore(&chan->lock, flags);



****** ADDITIONAL INFORMATION ******

For the last three monthes we have been suffering kernel panics occuring at one of our clients site. Even though our own servers handle a higher volume of calls than his, we never experienced such a problem ourselfs. The particularity of their environment is that their call center does not receive the calls in a uniform manner through out the day, but in very high bursts. During a very short period of time (5 minutes) their call center is practically bombarded with calls, and the kernel panic always occured during such peaks, around once a day.

We were able to reproduce the scenario the following way:
We prepared two asterisk servers with TE410P, one for generating the calls and the second for receiving them (Asterisk A generates and Asterisk B receives). We connected the spans with four crossover cables (span1 to span1, span2 to span2, ... ). On *A we generated the calls by repeatedly dumping a large number of call files (100) in outgoing. I know it is not the best way for generating outbounds, but it did the job for our tests. What is important is dumping the files at the same time, and not sequentially with a delay in between. We need to have *B trying to accept the 96 calls simultaneously. We configured the channels as FXOKS on *A and FXSKS on *B.

We were able to reproduce the kernel panic at will, with the first batch of 96 calls sent at once. We connected a NULL modem to capture the full trace of the panic (attached). After the change to zaptel.c, we let it run for 24 hours without a problem. Please note that the kernel panic only occures on the servers receiving the calls, not the one generating the calls.
Comments:By: Brian West (bkw918) 2005-01-10 15:12:26.000-0600

attach a diff -u please.

By: klictel (klictel) 2005-01-10 15:30:32.000-0600

I had the diff at the end of the Description field. Here it is:

4098a4099
> spin_lock_irqsave(&chan->lock, flags);
4099a4101
> spin_unlock_irqrestore(&chan->lock, flags);

By: Mark Spencer (markster) 2005-01-10 16:03:46.000-0600

We need "diff -u" not just diff....

By: klictel (klictel) 2005-01-10 16:32:37.000-0600

here is the diff -u... sorry for that...

--- zaptel.c.ori        2005-01-10 17:28:49.635373128 -0500
+++ zaptel.c    2005-01-10 17:30:19.848658624 -0500
@@ -4096,7 +4096,9 @@
               /* if no span, just do nothing */
               if (!chan->span) return(0);
               /* if dialing, stop it */
+               spin_lock_irqsave(&chan->lock, flags);
               chan->curtone = NULL;
+               spin_unlock_irqrestore(&chan->lock, flags);
               chan->dialing = 0;
               chan->txdialbuf[0] = '\0';
               chan->tonep = 0;

By: twisted (twisted) 2005-01-10 18:58:42.000-0600

please attach a diff -u in the same fashon you attached the normal diff.

By: zoa (zoa) 2005-01-11 03:22:56.000-0600

patch confirmed by francois lambert on the mailinglist.

By: Andrew Kohlsmith (akohlsmith) 2005-01-11 10:10:43.000-0600

I'd love to know more information about how you debugged this particular crash -- how you nailed it down to that line from the panic and so on -- Perhaps on the list moreso than here, but I think your experience being detailed would be a great help.

By: klictel (klictel) 2005-01-11 11:09:42.000-0600

Hi akohlsmith,

It was important to get the stack of the kernel panic, and since the console limits to 24 lines, i connected a NULL modem to serial port of the server and the other side connected to a serial port of a windows PC, and used Hyperterminal on the PC to connect to the serial port and capture whatever appears. Now everything that goes on the console is sent through the null modem to the PC.
The stack even shows you which function used the null pointer in zaptel. From there on, it is C code to debug.

Claude

By: Mark Spencer (markster) 2005-01-11 11:18:37.000-0600

Fixed in CVS head.  I also wanted to keep the lock for a few more parameters just to be on the safe side.

By: Russell Bryant (russell) 2005-01-12 22:48:39.000-0600

fixed in 1.0

By: Digium Subversion (svnbot) 2008-06-07 10:47:21

Repository: dahdi
Revision: 545

U   branches/v1-0/zaptel.c

------------------------------------------------------------------------
r545 | russell | 2008-06-07 10:47:20 -0500 (Sat, 07 Jun 2008) | 2 lines

lock some parameters (bug ASTERISK-3246)

------------------------------------------------------------------------

http://svn.digium.com/view/dahdi?view=rev&revision=545