[Home]

Summary:ASTERISK-02880: endless loop due to ast_search_dns() taking too long
Reporter:gkempke (gkempke)Labels:
Date Opened:2004-11-24 10:08:37.000-0600Date Closed:2011-06-07 14:00:19
Priority:BlockerRegression?No
Status:Closed/CompleteComponents:Core/General
Versions:Frequency of
Occurrence
Related
Issues:
Environment:Attachments:( 0) dns_diff
( 1) dns_diff.txt
Description:I have several "register" lines in my sip.conf. each contains a host to be looked up during transmit_register(). On my system (Linux SuSE 9.1, PII 233MHz) ast_search_dns() takes about 10 seconds. Because the retransmit timeout for a register is 20 seconds, by the time all three registers have been sent, the first timeout strikes. This leads to an endless loop in ast_sched_runq(), because for the new registers ast_search_dns() needs to be called again.

****** ADDITIONAL INFORMATION ******

Suggestion (don't know if correct but works here):
--- dns.c       2004-06-22 22:11:15.000000000 +0200
+++ ../../asterisk/asterisk/dns.c       2004-11-24 17:50:51.863916752 +0100
@@ -169,15 +169,19 @@
#endif
       char answer[MAX_SIZE];
       int res, ret = -1;
+       static int is_init = 0;

#ifdef HAS_RES_NINIT
-       res_ninit(&dnsstate);
+       if (!is_init)
+               res_ninit(&dnsstate);
       res = res_nsearch(&dnsstate, dname, class, type, answer, sizeof(answer));
#else
       ast_mutex_lock(&res_lock);
-       res_init();
+       if (!is_init)
+               res_init();
       res = res_search(dname, class, type, answer, sizeof(answer));
#endif
+       is_init = 1;
       if (res > 0) {
               if ((res = dns_parse_answer(context, class, type, answer, res, callback)) < 0) {
                       ast_log(LOG_WARNING, "Parse error\n");
@@ -190,12 +194,7 @@
               else
                       ret = 1;
       }
-#ifdef HAS_RES_NINIT
-       res_nclose(&dnsstate);
-#else
-#ifndef __APPLE__
-       res_close();
-#endif
+#ifndef HAS_RES_NINIT
       ast_mutex_unlock(&res_lock);
#endif
       return ret;

Comments:By: Brian West (bkw918) 2004-11-24 10:15:54.000-0600

Actually you're barking up the WRONG tree here...  dns.c isn't used in this case.  I think if you disable SRV lookups you might solve the problem.  But we use ast_gethostbyname and unless you find us a non-blocking dns resolver lib this can't really be fixed.

bkw

edited on: 11-24-04 17:05

By: Brian West (bkw918) 2004-11-24 10:20:55.000-0600

Yep turn off srv lookup's if you have them on thats the ONLY place where ast_search_dns would EVER be called on a register.  If its blocking during the ast_gethostbyname then maybe you need a faster box or better DNS server.  Granted this whole bloking on gethostbyname has been known for a long time we just dont have a free asynchronous resolver lib that we can use so we have to live with it or write one from scratch.

By: gkempke (gkempke) 2004-11-24 10:25:59.000-0600

The problem is not the time it takes to look up the host...
The problem is the time it takes to initialize and that is done everytime
ast_search_dns is called. Why not leave the resolver initialized for subsequent calls (as I have done now)?

By: Brian West (bkw918) 2004-11-24 10:27:30.000-0600

But the only time this code would exec is if you have srvlookup on.  Otherwise the code in question would be ast_gethostbyname.  And attach your diff please.

bkw

By: gkempke (gkempke) 2004-11-24 10:53:48.000-0600

srvlookup was enabled by default (make samples).
I've attached a diff. As I said I don't know if it causes any ugly sideeffects but it solves my problem.

By: Mark Spencer (markster) 2004-11-24 11:44:46.000-0600

The whole point of this code in here is to have the SRV lookups be reentrant and fast.  Your code change would defeat that by using a single one, not protected by a mutex, for all lookups.

By: Mark Spencer (markster) 2004-11-24 12:18:31.000-0600

Did turning off the SRV lookup make the problem go away?

By: gkempke (gkempke) 2004-11-25 02:26:29.000-0600

Disabling srvlookups does indeed fix the problem.
Nonetheless the bug should be fixed, I think. If reentrance is an issue here then maybe a sanity check in ast_sched_runq() would be a better solution (like breaking out of the loop after the loop has run for more than 1 second, for example)?

Gunnar

By: Mark Spencer (markster) 2004-11-25 13:17:03.000-0600

As initially suspected, this is a configuration issue, not a bug.  There's no reasonable way to work around that kind of problem.