Summary:ASTERISK-20212: Deadlock / TCP SIP Stack
Reporter:Phil Ciccone (pciccone)Labels:
Date Opened:2012-08-10 06:34:40Date Closed:2012-10-12 11:03:27
Versions:10.7.0 Frequency of
Environment:CentOS 5.7 / Asterisk 10.7.0Attachments:( 0) ASTERISK-20212_v1.patch
( 1) ASTERISK-20212_v2.patch
( 2) backtrace-threads.txt
( 3) backtrace-threads-2.txt
( 4) core-show-locks.txt
( 5) core-show-locks-2.txt
( 6) locked_thread_and_bt_full.txt
Description:We have been experiencing Asterisk deadlocks for the past week, from at the time, what seemed like nowhere. We upgraded to the latest 10.7.0 stable and enabled the debug flag and disabled optimizations in hopes to catch the deadlock. It seems from looking at the locks we have some sort of problem with our TCP connected SIP devices. Things are now starting to make sense as connecting TCP SIP devices is something newer that we started to do. The deadlocks we experience happen anywhere from every-other-day to several times a day, all involving a restart. Would you be so kind as to look at the attached backtrace and locks output?
Comments:By: Phil Ciccone (pciccone) 2012-08-10 06:35:11.611-0500


By: Phil Ciccone (pciccone) 2012-08-10 06:35:25.986-0500

show locks

By: Mark Michelson (mmichelson) 2012-08-17 15:43:27.223-0500

Thanks for the core show locks and backtrace. It looks like a call to fgets() in chan_sip.c is blocking forever while holding a lock. The solution here is to figure out why that function call would block since we should only be calling it when there is data to be read. Reproducing this is not going to be easy since things appear to happen at random, so I'm going to have to inspect the code for potential problems and see if I can find anything.

By: Phil Ciccone (pciccone) 2012-08-17 16:38:45.129-0500

If there is anything you would like extra from me on the next occurrence, it would be no problem at all. Just let me know. It seems to come in waves, sometimes we get a crash 3 or 4 times a day, then other times we can go days without a single problem. Recently, we have been fine for a few days.

By: Mark Michelson (mmichelson) 2012-08-17 16:45:29.327-0500

One thing that seems a bit odd to me is that there are apparently two threads polling the same same tcptls_session. One is a client thread and the other is the server thread. As far as I'm aware, such a situation is not supposed to happen. I suspect that what has occurred here is that somehow, both threads were polling on the same file descriptors. When this occurs, it likely can lead to some weird things happening like what you are seeing. I will try to find what could cause this situation to occur and look into trying to prevent it from occurring.

By: Mark Michelson (mmichelson) 2012-08-17 17:00:19.378-0500

If I don't find anything myself, then I may supply a patch where I've added in extra debugging statements. If I do that, then I'd need for you to get me a debug log when this happens so I can see if there's anything fishy. I'd need to devise such a patch first though. Thanks for offering to help out!

By: Phil Ciccone (pciccone) 2012-08-17 17:37:10.626-0500

Man, we have gone days without a problem. We just starting chatting and it just happened again! What are the odds! :)

By: Mark Michelson (mmichelson) 2012-08-20 11:27:56.639-0500

Welp, let me give some analysis of what's going on here, based on the core show locks and backtrace that you've given me.

There are three threads vying for the same lock.

Thread 48 is the "monitor thread" of chan_sip. It handles incoming UDP SIP messages as well as running scheduled SIP tasks. In this case, it is stuck trying to grab a lock because it is trying to indicate to a TCP client thread that it needs to send an OPTIONS request out to a specific peer.

Thread 3 is the TCP server thread. It handles new incoming TCP connections. In this case, it is handling an incoming REGISTER request, presumably from the same peer that Thread 48 corresponds with. On an incoming REGISTER, the thread that receives the REGISTER also attempts to indicate that an OPTIONS request should be sent. Much like Thread 48, this thread is stuck trying to grab the lock that it needs to grab before it can indicate that the TCP client thread needs to send an OPTIONS request.

Thread 2 is the culprit and is unfortunately also a mystery. Thread 2 is the TCP client thread that threads 3 and 48 are trying to signal to send an OPTIONS request. This thread handles incoming and outgoing SIP messages on this OPTIONS connection. The thread operates by calling ast_poll() (a wrapper for poll()) to find out if there is data to be read or written on the TCP socket. ast_poll() has indicated that there is data to read, so a loop is entered to read the data line-by-line from the request. The fgets() call to read the message data is, for some reason, blocking and not returning. This should not be happening given that ast_poll() claimed there was data to be read.

So if a TCP client thread gets stuck in a situation like this, it eventually will also block incoming UDP and TCP traffic from all sources. The question to answer is why the TCP client thread is blocking to begin with.

One thing that jumps to mind and that seems somewhat fishy to me is that our call to ast_poll() switches on both POLLIN and POLLPRI. This means that the poll will be interrupted both by inbound data to read as well as "urgent" TCP out-of-band data. I'm curious if the poll is being interrupted by an out-of-band TCP message, which in turn does not result in fgets() being able to read any data. I'm going to supply a patch for you that tests this as a possibility. I'm not guaranteeing anything right now; this is mostly a guess.

Now, if my patch does not work for you and you experience the deadlock again, then I'll need you to do a couple of things for me.

First, I'll want the same data you got for me before (core show locks output and a backtrace of all threads). In the backtrace, try to find the thread that is blocking in the fgets() call. In the backtrace you uploaded before, it was Thread 2. In gdb, navigate to that thread by first typing "thread <threadnum>". From there, navigate to the stack frame corresponding to _sip_tcp_helper_thread. In the backtrace you provided here, that is frame 9. You can navigate to that frame by typing "frame 9". Now that you're at that frame, I want you to type "print req.data" . This will let me know if the SIP message came through partially or not.

Second, I'll want you to upload a full debug log from a run where the problem occurred. This may enlighten me if the remote endpoint for some reason closed the TCP connection or something of that nature.

By: Phil Ciccone (pciccone) 2012-08-20 14:43:03.246-0500

Not a problem. Once the patch is supplied I will apply it, recompile it into product and wait. If/when it fails again I will provide the details requested.

By: Mark Michelson (mmichelson) 2012-08-20 15:10:29.763-0500

Cool. Trying to write a patch is proving problematic at the moment because of some possible other complications that got discussed in #asterisk-dev when I brought this issue up. I'm trying to write something up that won't crash and burn on you when you try using it. :)

By: Phil Ciccone (pciccone) 2012-08-21 11:44:44.203-0500

Another crash - threads backtrace

By: Phil Ciccone (pciccone) 2012-08-21 11:45:06.356-0500

Another crash - core show locks

By: Phil Ciccone (pciccone) 2012-08-21 11:45:54.983-0500

Locked thread gdb request data & bt full

By: Phil Ciccone (pciccone) 2012-08-21 11:46:44.863-0500

Mark, we had just gone down again. I took the opportunity to attach gdb to the running process and get that request data for you. I hope it helps, and I hope I did it right.

By: Mark Michelson (mmichelson) 2012-08-21 13:20:00.345-0500

I can confirm based on the backtrace and core show locks that it's the exact same situation you previously experienced.

I have uploaded two patches to this issue. Hopefully, you will only need one of them.

ASTERISK-20212_v1.patch contains a fix where for TCP connections (not TLS though) we eschew the use of FILE handles in favor of direct actions on file descriptors. This was an idea that was suggested to me by people who have seen odd behavior when trying to mix FILE handles and file descriptors together.

ASTERISK-20212_v2.patch contains the same change from ASTERISK-20212_v1.patch plus it changes the polling to only check for normal incoming data and not for priority data.

Please try ASTERISK-20212_v1.patch first and see if it appears to fix the problem. If it does, then that's great, we won't need the change from v2 at all. If you do experience the issue with v1, then revert and try v2 instead. Hopefully one of these will solve the issue.

If it turns out to solve the issue, then the patch I have provided will likely not go into the code as-is since it is designed solely to assist with SIP TCP connections. A better patch will need to be made which eliminates the use of FILE handles from all places in Asterisk where TCP/TLS connections are used.

By: Phil Ciccone (pciccone) 2012-08-26 15:04:19.147-0500

As of now, v1 is installed. We will see if it works. More to come....

By: Mark Michelson (mmichelson) 2012-09-18 09:29:43.926-0500

Haven't heard from you in a couple of weeks. How's the patch working for you?

By: Phil Ciccone (pciccone) 2012-09-18 10:21:00.053-0500

Mark, we have had the first patch in since my last comment with ZERO failures. I think at this point it is safe to say that fix will work (and is working).