Downtime Due to DNS
A quick story of an (ongoing) incident that was triggered by messing with DNS on another domain I own.
Yesterday I was configuring a new application to send out a summary email based on an RSS feed, a small bit of XML that describes a summary of the posts on a blog.
I updated my SendGrid account, which I use to send emails, to add the blog’s new domain name but no matter how long I waited the DNS change didn’t seem to be reflected in SendGrid. This domain was unusual because unlike my others it was managed by NameCheap and still used NameCheap DNS. Today, I tried pointing Namecheap to Cloudflare DNS (which manages all my other DNS) but found that it wasn’t updating either. Sick of having to deal with Namecheap I decided to transfer my domain over to Porkbun, which has manages about 25% of my domain names.
During, or perhaps even prior, to the transfer Namecheap suffered an incident causing the pages of NameCheap’s own DNS managemetn to load slowly and fail, and causing the transfer to initially fail too. Throughout, due to the power of DNS caching, the site remained accessible, and so I thought nothing of it until later today.
This evening, I checked to see if the domain was resolving via Cloudflare yet, and found it was now returning 0 records at all from any nameserver. I’m not entirely sure where in the process this went wrong.
DNSSEC was never enabled, the nameservers are all spelled correctly, and records are set with NameCheap, Porkbun, and Cloudflare! It looks like maybe NameCheap DNS stopped serving records whilst they were still the nameserver, or there’s something funky going on with the TLD’s root servers?
Whatever happened the site is currently down, but it should be back up as soon as the DNS caches expire or whatever is going on is resolved.
Update
This was a simple silly little bug because I signed up to Cloudflare with the wrong URL. I got one letter wrong but of course nothing went wrong until the DNS changeover occurred. I was able to diagnose this using the traceroute dig @1.1.1.1 +trace +noall example.com any
which showed that the DNS resolution was going all the way to the Cloudflare’s server beth.ns.cloudflare.com
. I was then able to double-check that the DNS was configured correctly at which point I spotted the typo of the account.