ASTERISK-30381: res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap

[Home]

Summary: ASTERISK-30381: res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap

Reporter: Mark Murawski (kobaz) Labels:

Date Opened: 2022-12-28 20:57:54.000-0600 Date Closed: 2023-01-13 12:00:01.000-0600

Priority: Minor Regression?

Status: Closed/Complete Components: Resources/res_resolver_unbound

Versions: 18.15.1 19.7.1 20.0.1 Frequency of
Occurrence

Related
Issues:

Environment: Attachments:

Description: Using what's probably a fairly standard DNS server list containing a local DNS server and some backups, using the unbound DNS resolver will result in non-deterministic lookup failures.

Given resolv.conf:
{code}
options attempts:3 timeout:1
nameserver 192.168.5.2
nameserver 4.2.2.2
nameserver 8.8.8.8
{code}

Given resolver_unbound.conf
{code}
[general]
hosts = /etc/hosts
resolv = /etc/resolv.conf
{code}

Given pjsip_wizard.conf
{code}
[wombat]
type = wizard
remote_hosts = foo.vpn.lan
aor/qualify_frequency = 60
aor/qualify_timeout = 2000
{code}

You wind up with contacts flapping in reachability due to DNS but not due to lack of SIP OPTIONS. (The foo.vpn.lan host was responding to SIP OPTIONS this entire time, but we had intermittent DNS failures):
{code}
Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
{code}

The reason for this is two fold:
Unbound does not query more than one DNS server to get the result for a given request.
Unbound does not respect the order of DNS servers in /etc/resolv.conf

Unbound debug logging shows the dns server order:
{code}
[pid 10346] write(2, "[1672280502] libunbound[8890:0] info: DelegationPoint<.>: 0 names (0 missing), 3 addrs (0 result, 3 avail) parentNS\n", 116) = 116
[pid 10346] getpid() = 8890
[pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 8.8.8.8 port 53 (len 16)\n", 71) = 71
[pid 10346] getpid() = 8890
[pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 4.2.2.2 port 53 (len 16)\n", 71) = 71
[pid 10346] getpid() = 8890
[pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 192.168.5.2 port 53 (len 16)\n", 75) = 75
[pid 10346] getpid() = 8890
[pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: attempt to get extra 3 targets\n", 70) = 70
{code}

Take this example:
{code}
Timestamp 12:00:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
Timestamp 12:01:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
Timestamp 12:02:00: DNS Lookup foo.vpn.lan using 192.168.5.2 .. success! endpoint dns is stored, host is marked reachable
Timestamp 12:03:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
Timestamp 12:04:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
{code}

If you change resolver_unbound.conf to the following:
{code}
[general]
hosts = /etc/hosts
nameserver = 192.168.5.2
{code}

This does not fix the issue. Unbound does not respect this as the full nameserver list and still uses /etc/resolv.conf for the 3 nameservers specified

The ideal behavior here would be:
1) Don't treat a contact as unreachable if the DNS suddenly fails, but SIP OPTIONS is still working to the last-known IP
2) Try all DNS servers until we get a successful lookup, or all servers have failed lookups

The only workaround for this is to noload res_resolver_unbound.so

Comments: By: Asterisk Team (asteriskteam) 2022-12-28 20:57:55.154-0600

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution. Please note that log messages and other files should not be sent to the Sangoma Asterisk Team unless explicitly asked for. All files should be placed on this issue in a sanitized fashion as needed.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

Please note that by submitting data, code, or documentation to Sangoma through JIRA, you accept the Terms of Use present at [https://www.asterisk.org/terms-of-use/|https://www.asterisk.org/terms-of-use/].
By: Joshua C. Colp (jcolp) 2022-12-29 03:55:30.651-0600

{code}
[general]
hosts = /etc/hosts
nameserver = 192.168.5.2
{code}

This doesn't do what you think it does. The default setting[1] for "resolv" is "system" which will be /etc/resolv.conf therefore you would need to set it to an empty value to have it not be used. Does this work?

For resolv.conf it's not a list of primary and backup. Modern DNS clients will round robin/balance across the given list, failing over if a nameserver is unreachable. Explicitly configuring the list using "nameserver" will do primary/backup.

I also fundamentally disagree with your DNS configuration. A list of name servers should have all addresses resolvable, not some, otherwise you have to do everything you're asking for - have the resolver client try to solve your problem, and if the client doesn't behave exactly as you need then this happens. What you're actually doing is split DNS, which should be done by a local caching server that is configured to send queries to the appropriate place based on the domain.

I don't know what your "1" means. As for "2" that's completely dependent on the unbound DNS client library. While the ability to set further configuration isn't implemented in res_resolver_unbound we can at least look to see if that would be a viable option[2][3]. Looking through the options I don't see any which would cause it to behave as you describe in "2".

[1] https://github.com/asterisk/asterisk/blob/18/configs/samples/resolver_unbound.conf.sample#L10
[2] https://unbound.docs.nlnetlabs.nl/en/latest/manpages/libunbound.html
[3] https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html
By: Mark Murawski (kobaz) 2022-12-29 08:22:30.706-0600

Thanks for your insight, it's always not what I expected!

My '1':
1) Don't treat a contact as unreachable if the DNS suddenly fails, but SIP OPTIONS is still working to the last-known IP

Means this:
The contact reachability is flapping reachable/unreachable based on DNS failures. But the contact itself never went down or was otherwise unavailable. It looks to be that PJSIP is treating the contact as having failed to be contacted, if all of a sudden a DNS lookup fails that was working previously.

If the DNS lookup fails, but we have a last-known-address for this contact, then it shouldn't flap the contact based on the DNS failure. We have a good DNS resolution of the contact's address from the last successful lookup. It should keep using that address to send SIP OPTIONS, and only if SIP OPTIONS fails to come back with an OK, only at that point should PJSIP mark the contact as unreachable. Or maybe add an option to behave as such. I'm having a hard time thinking of a use case for treating (most likely) temporary failures in DNS resolution as a hard-down for the contact, even when the contact is alive and well.. considering if the contact had a hard-coded IP, then all would be well.

Rationale: My go-to theory of failure handling is that the system should try all reasonable available options to continue operating, including not getting rid of last-known-good-data and as long as that last-known-good-data still works, then keep on chugging until you can get the new one. And throw alarms in the meantime, letting the user determine how to handle this.
By: Joshua C. Colp (jcolp) 2022-12-29 08:44:27.744-0600

If you want to explore it further then go ahead but this is NOT an easy thing and is full of traps. For example what if the underlying DNS record changes and now you're using stale information? Do you obey the TTL? Then you're just buying yourself time until failure, unless you also do a DNS Lookup and use the old information - but should that be configurable and for how long? That information also isn't used when dialling. That's a fresh DNS lookup along with any SRV/NAPTR so failover and load balancing occurs - so now does the OPTIONS cached information also get used for other things such as sending an INVITE? Do you cache all the results? What about the load balancing I mentioned? For how long, again?

It's a lot of knobs and configuration.

I also think "failure" is overloaded. The DNS server didn't fail, but the lookup process resulted in no records. If a DNS server does fail then it will go to an alternate.

This isn't something the Asterisk team at Sangoma will look into.
By: Asterisk Team (asteriskteam) 2023-01-13 12:00:00.962-0600

Suspended due to lack of activity. This issue will be automatically re-opened if the reporter posts a comment. If you are not the reporter and would like this re-opened please create a new issue instead. If the new issue is related to this one a link will be created during the triage process. Further information on issue tracker usage can be found in the Asterisk Issue Guidlines [1].

[1] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines