GCP – Increasing robustness of serving public DNS names using multi-provider setups
DNS is a foundational component of Internet-delivered applications. When DNS outages occur they cause service outages and can even cause issues with monitoring and remediation of problems. This is exacerbated with multi-tenant setups in Cloud where services from many providers are impacted since they all share the underlying DNS infrastructure. While a single-provider deployment is the most common, easy to set up and maintain, it’s also a single point of failure. Utilizing two different providers for hosting public DNS improves the overall reliability.
One way to set up multi-provider DNS is using zone transfer capability supported within the DNS protocol (RFC 5936). However, most Cloud providers don’t support this capability. A second way is to ensure that your DNS updates are reflected in both providers, typically done by tweaking the CI/CD pipelines to create and update DNS records in both providers at the same time. Managing and maintaining multiple providers tends to be operationally heavy and prone to human errors and this is one of the many reasons why multi-provider setups are not as common as they should be.
What if there’s a simpler way to achieve this? You continue to update a single provider but there’s an over-the-top solution that automatically picks up the changes and reflects it onto another provider. We are pleased to announce the launch of Terraform scripts that make it easy to use Cloud DNS as a second authoritative DNS server for public DNS hosting. Our solution relies on automation built using Terraform and OctoDNS to monitor your current DNS zones and reflect those changes in a managed DNS zone hosted on Cloud DNS. The scripts ensure that new Managed zones are created in Cloud DNS and when individual DNS records are updated, it automatically transfers the updates to the corresponding DNS zone in Cloud DNS. You don’t need to make any changes to your existing CI/CD pipeline. Finally, we’ve launched this solution in open source to enable you to customize it to your needs and environment. The rest of this post provides two approaches to consider depending on your needs.
Dual-provider in an active-active setup
The first option we’d like to discuss is the use of two providers in an active-active setup. Since both providers are primary, you need to configure the appropriate DNS name servers of both providers as the responsible name servers for your domain. This may involve working with the registrar to publish updated “NS records” in the parent domain registry, as well as publishing matching NS records at the domain’s “zone apex”. This enables resolvers to fall back to the remaining available provider’s nameservers in case one of the providers is down. If the outage is not resolved quickly, NS records can be updated to delist the affected provider’s nameservers until the problem is resolved. Resolvers typically keep track of the health status of the nameservers they use, and partial nameserver downtime is typically not service-impacting.
Pros
No single point of failure – the probability of both providers having simultaneous outages is very small
Failover is transparent and automatic
Cons
Need synchronization between the Primary and Secondary — more complex than a single-provider deployment
Potentially higher cost since the same service is being provided from two different providers
Dual-provider in an active-passive setup
A second way to set this up is with an active-passive configuration. You could designate one provider as primary and another as secondary. You list only the primary vendor’s nameservers in the “NS records” given to your domain registrar and all DNS requests by default will be sent to the primary.
Behind the scenes, all DNS Zone updates are sent to both the primary and the secondary providers. Choosing two cloud providers makes this easy because of API support — you can call both the APIs back to back to ensure the primary and the secondary zones are in sync. You can also utilize open-source tools like OctoDNS to keep the providers in sync.
The active-passive setup requires a trigger to be set up that detects failures and switches to the secondary DNS service. The trigger could become the single point of failure so the design and deployment of the trigger becomes critical. Note that some parent (TLD) domains have long NS TTLs, and recovery for domains registered under a public suffix can be slow in active-passive mode.
Pros
More reliable than a single provider system — it’s unlikely that both providers will have simultaneous outages
Potentially lower cost since the secondary is only storing the zone information and does not serve query traffic as long as it is in secondary mode
Cons
Need synchronization between the Primary and Secondary — more complex than a single provider deployment
Down time until secondary provider is switched to be the primary, and for some time after as resolver caches catch up
Need a triggering mechanism to switch between primary and secondary services — could become the SPOF
Use Cloud DNS as a second authoritative provider
We believe that the active-active configuration is the best mechanism to minimize downtime due to an outage. In order to assist customers to use Cloud DNS as a second authoritative provider, we’ve launched Terraform scripts that do the bulk of the automation work.
Our solution helps you to do a one-time data transfer or do periodic transfers of DNS records over to Cloud DNS depending on your needs. The caveats to this is that any provider-specific DNS records (like custom apex CNAME records) will not work on Cloud DNS. We also don’t support DNSSEC for multi-provider zones. Currently, we support migrations to Cloud DNS from Amazon Route 53 and Azure DNS. This solution can be easily extended to other API-enabled providers.
You can find the open-sourced version of these scripts at Google Cloud’s Github repository.
Read More for the details.