Troubleshooting DNS-specific problems requires use of techniques that generate definitive errors that point to a problem with DNS records and/or DNS servers.
Determining whether the problem is DNS-related is the first step. After that, testing can reveal an abnormality or error, which will provide clues to what is wrong.
The process can be broken down into three steps:
- Determining that a problem is DNS-related
- Determining the exact nature of the DNS problem
- Determining a fix for the problem and implementing the fix, if possible.
Is It a DNS-Related Problem at All?
An initial diagnosis of DNS involvement made by a user on a single host computer can easily be misidentified. Errors generated by a browser, which can include “host not found” and other related messages, can mislead a user to think that host resolution is a problem, while the actual problem exists with the connection. It is critical to rule out any connection problems before delving more deeply into DNS-related testing.
Listening carefully to the initial classification of the problem is paramount. Ask defining questions to determine why they would conclude they have a DNS-related problem. Determine if the problems are seen on multiple computers or, conversely, if some computers on the LAN/WAN are not experiencing the problem.
Begin testing on a command line in the system, if possible, using tools like nslookup, traceroute and ping to verify a computer’s network availability and ability to connect to the Internet gateway. Verify the TCP/IP-based settings, especially the name servers in the active configuration. Windows machines will return configuration with the use of the command ipconfig /all or, for older machines, winipcfg at the command line.
As the Windows 2000 architecture proliferates, and as the use of firewalls and NAT implementations become the norm rather than the exception, more internal DNS servers are providing service to the inside of the network directly. Watch for the use of private IPs in the configuration and clarify the functionality and activity of local name servers. Remember, an interactive nslookup session, when used with the default name server, will attempt to contact the first name server in the TCP/IP configuration of the computer. If the initial attempt to reach a legitimate name server times out, it could indicate a constant or intermittent problem with getting to a particular name server. This result could point to a connection problem inside the network or the malfunction of an internal name server.
The addition of firewalls or other technologies blocking ping can cause more complications. Clarify the passage of ping through the network before relying on the lack of ping results as an identifying characteristic.
A “best bet” is to find out exactly what level of connectivity a user currently is experiencing on a machine that is having the problems and identify the name server being used. Duplicate that action on the troubleshooter’s computer. Watch for differences.
Initial questions that may be helpful:
- When did the problem start?
- Was anything being done to the network before this difficulty began (including a consultant’s visit)?
- Has a firewall been installed lately or has the configuration of a firewall been altered?
- How may computers are experiencing the problem?
- Are the problems always manifesting in the same way?
- Can you successfully pull up information in a browser by using an IP address directly (for example, using http://18.104.22.168/ should bring up Missouri Department of Higher Education)?
- Has this problem ever happened before?
- Have you requested any DNS changes at MOREnet lately?
It is important to remember:
- MOREnet servers are not infallible. Their DNS services could fail or otherwise be partially unreliable. They could also have received bad data from root server(s). There may be a partial or total outage at a hub site.
- Watch for the customer who does not take troubleshooting directions well. Watch for customers who contradict themselves. Get them to send results (traceroutes, nslookup results) by e-mail if necessary.
- In the case of DNS record changes made by MOREnet, study results carefully. A single typo can cause malfunction. A serial number in the modified record must be advanced and the name server restarted before the record will properly propagate. Check with the DNS administrator in Core Network about the progress of any DNS change requests.
- In testing, do not assume that a third party name server or customer name server will be set to provide access to query it directly in an nslookup session. It may simply be set to provide resolution for a specific range or ranges of IPs.
- There could be security reasons why one network or computer on a network may not be able to access another. Make sure that e-mail is not getting a “relay denied” message. Check for internal block lists at MOREnet.
- Domain names that are registered as mx records only will not resolve in ping and traceroute. Only the related A records will resolve.
- Ability to ping or trace to a name and IP on a command line, but an inability to bring up a website by name in a browser window, can indicate the use of a Web proxy. Many customers may not be aware they are using a proxy. Check the configuration of the browser to confirm or deny the use of a proxy.
At the end of this phase, you should be reasonably sure that the nature of the problem is either connection or security related (and have properly switched troubleshooting tactics) or should be reasonably convinced that these areas are working and DNS is the actual cause.
Determining the Exact Nature of the DNS Problem
Trying to determine the exact nature of the problem is a case of detective work. Try to locate an anomaly or uncover a distinct error. Good record keeping in the process can help avoid doing queries multiple times.
The test can be boiled down to two elements:
- The results on a query on domain/host name and
- Exactly where that information is coming from.
Try to determine differences, such as testing MOREnet vs. non-MOREnet results, different third party results and inside and outside a firewall.
Use a tool like Squish’s DNS Checker (http://www.squish.net/dnscheck/) to get an outside, comparative evaluation of a specific domain problem.
Note that DNS uses both UDP and TCP protocols on port 53 for traffic. If all DNS resolution has stopped and either a new firewall has been installed or a firewall has been altered, have the customer double-check that the traffic has been allowed. UDP is almost always used for regular query/response traffic. TCP is used mainly for zone transfers, taking to some debugging tools and regular queries if UDP responses get truncated (that is, if all the records won’t fit).
Customer says that entire network cannot get to www.blogga.com. They must get to this website for student updates.
Facts uncovered: They are experiencing no other abnormalities to any other domain or IP. All their services are working normally.
Reasonable assessment: This is not a connection problem.
They cannot get to www.blogga.com nor can they send mail to blogga.com.
Queries time out.
This started two days ago. The customer has made no changes in the network.
They can send mail and get to their website from their Mediacom broadband connection.
Testing uncovered: I can also not connect to www.blogga.com. I am using University servers.
They are using MOREnet DNS for resolution. The customer’s default server, current.coreserv.more.net cannot resolve www.blogga.com. Name servers for blogga.com point ns1.ikky.com and ns2.ikky.com. When ns1.ikky.com is queried directly, it replies with a server failure.
Reasonable assessment: Something is amiss with blogga.com’s name resolution.
Customer cannot tell us name servers used for mediacom.com, but an ns query for mediacom.com reveals ns.mediacom.com. In an interactive nslookup session with mediacom.com, I find that it does answer to me and answers queries such as www.more.net with correct information.
I request an ns query for blogga.com. This time, it says that the ns are ns1.hellokitty.com and ns2.hellokitty.com.
I switch to a session with ns1.hellokitty.com. This time, it will resolve www.blogga.com as 22.214.171.124.
When I attempt to access 126.96.36.199 in a browser window, I get the site www.blogga.com.
Reasonable assessment: Two providers are carrying records on blogga.com.
Fixing the Problem
MOREnet can fix only what is within its ability to fix. Be clear about what needs to be added or changed.
A rare, but possible problem that MOREnet can fix may masquerade as someone else’s problem. A clue to this problem may be given by the customer, specifically in the observation that a number of other third party services can successfully resolve the address, but MOREnet servers only are showing no resolution. This problem could be caused by stranded incorrect records to a domain that MOREnet resolved as a secondary. If the domain owner moved the service and other name servers are now resolving the domain and MOREnet has not been informed of the change, MOREnet’s name servers may still be expecting zone transfer from a server that no longer holds the records. A Squish evaluation may show this stranded relationship.
Confirm your findings with MOREnet’s DNS expert in the Network Core group.
Adjustments to firewalls and proxies can include opening ports for queries and zone transfers. Host file entries may need to be added or modified when dealing with access issues from inside a NAT network.
- Albits, Paul and Cricket Liu. DNS & BIND, 4th ed. Cambridge, MA: O’Reilly Press, 2001.
- Liu, Cricket. DNS & BIND Cookbook. Cambridge, MA: O’Reilly Press, 2002.
- “DNS Q&A Corner.” Men & Mice.
- “The Men & Mice’s DNS Glossary.” Men & Mice.
- Identification of BIND log messages – good for specific errors generated by name servers. ”Log messages for BIND 8 named, named-xfer, ndc and some for BIND 9.” Men & Mice.
- More Explanations on How the DNS System Works Brain, Marshall. “How Domain Name Servers Work.” HowStuffWorks.