Hey y’all. Wanted to document some of the stranger bits I’ve encountered while running Kubernetes with one of my clients. We’ve finally got some decent sized clusters running in their environment and they’re being heavily utilized by developers, as they push new or rewritten services into the cluster. Win! That said, we got some complaints about the network performance these guys were seeing. It sounded like intra-cluster communication was working well, but trying to connect to other systems outside of the cluster or things on the public internet were really slow. Like anywhere between 4-10 seconds to resolve the names. Uh oh. Here’s some of what we did to help work around that, as well as how we figured it out.
So we had traditionally just been deploying the official KubeDNS deployment that is part of the Kubernetes repo. Or, rather, we were using the one that Kargo deploys, which is just a copy of the former. We’ll still be using that as our basis. It’s also important to note that the pod that’s deployed is 3 containers: kubedns, dnsmasq, and a sidecar for health checking. The names of these seem to have changed very recently, but just know that the important ones are kubedns and dnsmasq.
The flow is basically this:
- A request for resolution inside the cluster is directed to the kubedns service
- The dnsmasq container is the first that receives the request
- If the request is
in-addr.arpa, or similar, it is forwarded to the kubedns container for resolution.
- If it’s something else, dnsmasq container queries the upstream DNS that’s present in its /etc/resolv.conf file.
So, while all of the above seemed to be working, it was just slow. The first thing I tried to do was see if queries were making it to the dnsmasq container in a timely fashion. I dumped the logs with
kubectl logs -f --tail 100 -c dnsmasq -n kube-system kubedns-xxxyy. I noticed quickly that there weren’t any logs of interest here:
I needed to enable log-queries:
- You can do this by editing the RC with
kubectl edit rc -n kube-system kubedns.
- Update the flags under the
dnsmasqcontainer to look like the following:
- Bounce the replicas with:
Once the new pod is online you can then dump the logs again. You should see lots of requests flowing through, even on a small cluster.
WTF Is That?
So now that I had some logs online, I started querying from inside of a pod. The first thing I ran was something like
time nslookup kubedns.kube-system.svc.cluster.local to just simply look up something internal to the cluster. As soon as I did that, I saw a TON of queries and, while it eventually resolved, it was searching every. single. possible. name.
Once I did this, I tried an exteral name to see similar results and a super slow lookup time:
What’s happening? It’s the ndots. KubeDNS is hard coded with an ndots value of 5. This means that any request for resolution that contains fewer than 5 dots will cycle through all of the search domains as well in an attempt to resolve. You can see both of these by dumping the /etc/resolv.conf file from the dnsmasq container:
It turns out that this is kind of a known issue within KubeDNS if your google-fu is strong enough to find it. Here’s a couple of good links for some context:
Duct Tape It!
Okay, so from what I was reading, it looked like there wasn’t a good consensus on how to fix this, even though an ndots of 3 would have mostly resolved this issue for us. Or at least sped things up enough that we would have been okay with it. And yet, here we are. So we’ve got to speed this up somehow.
I started reading a bit more about dnsmasq and how we could avoid searching for all of those domain names when we know they don’t exist. Enter the
address flag. This is a dnsmasq flag that you can use to return a defined IP to any request that matches the listed domains. But, if you don’t provide the IP it simply returns an NXDOMAIN very quickly and thus doesn’t bother forwarding requests up to kubedns or your upstream nameserver. This wound up being the biggest part of our fix. The only real pain in the butt is that you have to list all the domains you want to catch. We gave it a good shot, but I’m sure there’s more that could be listed. A minor extra is the
--no-negcache flag. Because we’re sending so many NXDOMAIN responses around, we don’t want to cache them because it’ll eat our whole cache.
The other big part to consider is the
server flag. This one allows us to specify for a given domain which DNS server should be queried. This seems to actually have been added into the master branch of Kubernetes now as well.
So here’s how to fix it:
- Edit the dnsmasq args to look like the following:
- You may find that you want to add more domains as they are relevant to you. We’ve got some internal domains in the address block that aren’t listed here.
- Notice the last server flag. It should point to your upstream DNS server. You can also supply several of these flags if necessary.
- Also note that you may not need to worry about the
compute.internaldomains unless you’re in AWS.
- Bounce the replicas again:
That’s it! Hope this helps someone. It really sped up the request time for us. All requests respond in fractions of a second now it seems. I fought with this for a while, but at least had a chance to learn a bit more about how DNS works both inside and outside of Kubernetes.