This afternoon there was an issue which affected all .io domains across the internet. As there has been no official word from ICB (the registrar which runs .io domains among others, I wanted to write a quick post to explain a) what we identified happening and b) how this affected us and what we've done to prevent future issues.
A bit of background
In order to convert a domain into an IP address, we use a system called DNS (Domain Name System). In its most basic form DNS allows names (for example codebase.atech.io) to be converted into an IP address. For example, each time you enter a domain into your browser, your computer will use your ISPs DNS servers to lookup the domain you have entered and return that information to you. In order to allow users to control their DNS, the system allows for individual domains to be delegated to other nameservers which are in the control of the domain owner. In order to determine which nameserver to use, the DNS system follows a simple path to determine the authoritative nameserver(s) for the domain. For example, imagine we are looking up www.atech.io:
The root nameservers (of which there are currently 13) are queried to find the nameservers responsible for the
.io top level domain. The root nameservers will return a number of different nameservers which are able to tell us where to look next. In the case of .io domains there are 7 possible servers to query next.
We will ask one of the nameservers provided where to look for
atech.io. It will return an array of nameservers to ask next. For these, you will often only see two or three possible nameservers.
One of the atech.io nameservers will now be queried for
www.atech.io. If it exists, the appropriate record will be determined and returned to you.
Using the dig tool you can interrogate this process, an example output is shown below:
; <<>> DiG 9.8.3-P1 <<>> +trace www.atech.io
;; global options: +cmd
. 2165 IN NS a.root-servers.net.
. 2165 IN NS b.root-servers.net.
. 2165 IN NS c.root-servers.net.
. 2165 IN NS d.root-servers.net.
. 2165 IN NS e.root-servers.net.
. 2165 IN NS f.root-servers.net.
. 2165 IN NS g.root-servers.net.
. 2165 IN NS h.root-servers.net.
. 2165 IN NS i.root-servers.net.
. 2165 IN NS j.root-servers.net.
. 2165 IN NS k.root-servers.net.
. 2165 IN NS l.root-servers.net.
. 2165 IN NS m.root-servers.net.
;; Received 228 bytes from 126.96.36.199#53(188.8.131.52) in 1453 ms
io. 172800 IN NS ns1.communitydns.net.
io. 172800 IN NS a.nic.io.
io. 172800 IN NS b.nic.ac.
io. 172800 IN NS ns3.icb.co.uk.
io. 172800 IN NS b.nic.io.
io. 172800 IN NS b.ns13.net.
io. 172800 IN NS a.ns13.net.
;; Received 354 bytes from 184.108.40.206#53(220.127.116.11) in 757 ms
atech.io. 86400 IN NS dns1.atech.io.
atech.io. 86400 IN NS dns2.atech.io.
;; Received 100 bytes from 18.104.22.168#53(22.214.171.124) in 286 ms
www.atech.io. 3600 IN CNAME atech-web.vips.atech.io.
atech-web.vips.atech.io. 3600 IN A 126.96.36.199
vips.atech.io. 3600 IN NS dns1.atech.io.
vips.atech.io. 3600 IN NS dns2.atech.io.
;; Received 201 bytes from 188.8.131.52#53(184.108.40.206) in 14 ms
What happened to .io domains today?
Today, an issue occurred in step 2 of the above process. Some of these .io authoritative nameservers were unable to tell us where to look next and therefore were failing.
We don't have any specifics about why this happened and things looked to be restored after about an hour of intermittent issues. We could speculate on what caused this, but we'd rather wait for an official response.
Why did this affect us?
Although we don't use many .io domains for our services, we do use it heavily in our backend infrastructure. All our domains (regardless of their top level domain) were configured to use
dns2.atech.io as their authoritative nameservers which meant that as these other domains could not be resolved at all.
What did we do to mitigate issues like this?
Unfortunately, it does seem as though the IO nameservers aren't as reliable as those provided for other top level domains like .com or .net. Another fact which disturbs us is that during this outage there has been zero communication from the registry responsible for managing the failed nameservers.
There we have taken the decision to set all our domains to use nameservers on the .com top level domain. This means that any future issues with smaller top level domain registries will not affect all our service in such as significant fashion. We will be moving all our domains over to use a.atechdns.com and b.atechdns.com over the next few days.