In the post I describe how we did migration of Service CIDR Blocks in the fleet of our kubernetes clusters in production without downtime.
Note: This is a copy of my blog post published in Zendesk Engineering on medium
At Zendesk, we started our Kubernetes journey in September 2015, shortly after version 1.0 was publicly available. At the time of v1.0 not much information was available and we didn’t have a lot of experience, so some configuration choices that made sense four years ago have had to be reconsidered as the years went by.
Our clusters have seen multiple potentially disruptive migrations in production. For example in 2018, we changed the CNI implementation from flannel to AWS VPC CNI Plugin, in 2019 we migrated our etcd installation from v2 to v3 data format, and most recently we performed a Service CIDR Block migration across our live production clusters.
When we originally provisioned clusters we assumed having ClusterIPs in different clusters allocated from the same CIDR Range block would be fine since Service ClusterIPs are only resolvable in-cluster, and therefore we would be conserving valuable private IPv4 network space. So, each of our clusters was configured using the same RFC 1918 private CIDR block for service ClusterIPs.
Since that time hyper-growth in both the scale of our infrastructure and adoption of our Kubernetes platform at Zendesk has led us to consider spreading related microservices across multiple clusters, connected by service mesh technology. Clearly, having the same ClusterIP assigned to different services across clusters had downsides for service discovery and routing that needed to be fixed.
Testing quickly showed that changing the
--service-cluster-ip-range flag on the API Server would not be enough. Existing services would continue to use the old ClusterIPs, whereas kube-proxy would no longer populate IPTables rules for these services, whose old ClusterIPs would then be outside of the configured range. Beyond this, we found that ClusterIPs are immutable. In order to update them each service would need to be deleted and recreated to pick up it’s new IP, thus creating a short but unacceptable service interruption in production.
To overcome these issues, we performed a few rounds of research and tried different approaches for migration.
We had a brainstorming session and identified several options that might work out:
Finally, we outlined the following migration procedure:
With that in place, we started to test the migration in our staging infrastructure and found a few other issues:
During testing of the migration in our staging clusters, the migration would sometimes fail with an error indicating there are no free Cluster IPs available. After some digging in the kubernetes codebase, we found that there is a RangeAllocation object that stores a bitmap of used IP Addresses in the block, and if Service CIDR changes, the data becomes invalid and causes issues when new ClusterIP is allocated. So, we updated the migration procedure to stop the kube-controller-manager and remove the RangeAllocation object from etcd after confirming it wouldn’t cause any other issues.
When we changed the service CIDR range, we needed to reissue certs used by Kubernetes API Servers. The new certs would need to include the ClusterIP address of the
kubernetes.default.svc.cluster.local from the new Service CIDR range. However, to generate those certs we would need to know ahead of time which IP from the new CIDR block would be assigned to the Kubernetes service. So, our migration utility ensured that Kube API Server would get the first IP address from the configured Service CIDR range and we update code that requests certificates to follow the same logic.
After hosting pre-mortem meetings and performing the migration back and forth several times in staging we were confident that we could proceed to production. We deemed this change to be high risk in its nature, so we need to take extra precautionary steps. We worked with our Incident Management team and requested a maintenance window to perform the migration. To further reduce potential impact, we integrated the migration utility with an internal system to look up the criticality tier of the services, so that we could perform the migration for the less critical services first, before moving on to increasingly visible services. From a technical perspective, the migration steps were the following:
With all preparation done, we sent out our notification to customers, adjusted our timeframes a bit to meet their needs, and blocked a few weekends to perform the rollout in production. Thanks to detailed planning and preparation, we were able to complete our migration across all production clusters without even the smallest blip in QoS for Zendesk customers. This paved the way for cross-cluster service mesh, extending Kubernetes Service IP lookups and routing across cluster boundaries at Zendesk.