This article explains in-depth the way GKE performs the node surge upgrade, and what you should think about to influence the upgrade process. Also I’ll try to clarify some common misunderstandings and those caveats that you should avoid.
Surge upgrade
Surge upgrade is the enhancement for the rolling upgrade. This is the default upgrade strategy, GKE also supports blue/green node upgrade, but this article will be focused on surge upgrade.
Surge upgrade is very important, if enabled, GKE uses surge not only for upgrade, it is also the strategy on occasions when GKE needs to recreate the nodes, for example, when the following types of changes occurs:
- version changes (upgrades)
- image type changes
- IP rotation
- Credential rotation
Why surge upgrade?
You might think that the surge upgrade is for fast upgrade. It is not entirely true, because the overall upgrade time is determined by how many nodes can be upgraded in parallel. For that purpose, surge is not necessary, because whether you create a new VM before or after the node draining, the overall parallelism is the same.
The purpose of surge upgrade is actually to make upgrade less disruptive as it helps with pods that need to be rescheduled (as soon as possible)
Two types of surge upgrade
GKE provides two surge upgrade settings to select the appropriate balance between speed and disruption for your workloads on the nodes during nodes recreation.
Surge upgrade behaviour is determined by maxSurge and maxUnavailable settings.
These are two different types of surge upgrade, they work independently of each other.
When they are used together, they determine the whole parallelism how many nodes can be upgraded at the same time.
Creating new nodes
Before diving into maxSurge and maxUnavailable, you must understand that upgrading a node always means creating a new VM and then shutting down the old VM.
The main difference between maxSurge and maxUnavailale is that when a new VM is created
- When a node upgrade is triggered by maxSurge, GKE creates a new VM before draining the old node.
- when a node upgrade is triggered by maxUnavailable, GKE creates a new VM after draining the old node
maxSurge (upgrade with Surge)
When maxSurge is set to N, then N nodes upgrade are triggered by surge upgrade. Such nodes are referenced as surge upgraded node.
This the timeline for node upgrade with surge node
maxUnavailable (upgrade with recreation)
When maxUnavilable is a non-zero value such as N, GKE will trigger the nodes recreation for N nodes. Such nodes are referenced as recreated nodes.
GKE will create a new VM only after draining the old node.
The timeline of node upgrade using recreation:
maxSurge + maxUnavailable
The upgrade behaviour is determined by the combination of these two settings. When both are set, some nodes will use surge nodes, and the rest of nodes will use recreate.
Surge takes higher precedence than recreate. E.g. if numNodes < maxSurge, all nodes will go through surge upgrade
max level of parallelism = min(20, maxSurge+maxUnavailable, numNodes)
During the whole upgrade process, the nodes in a node pool will include between (numNodes — maxUnavailable) and (numNodes + maxSurge)
A real life example
In example, we have a node pool with 4 nodes. We use the following settings for surge upgrade:
- numNodes = 5
- maxSurge = 2
- maxUnavailable = 1
At the start of upgrade
Starts rolling upgrade…
- GKE started surge upgrading node1 and node2. GKE created 2 new nodes, node6 and node7
- GKE started upgrading node3 using recreate. GKE started draining node3
- Temporary number of nodes = 7
On success, keeps rolling to the next:
- GKE successfully upgraded node1. GKE deleted node1
- node2 was still under upgrade
- GKE successfully drained node3. GKE created a node8.
- GKE started surge upgrading node4. GKE deleted node4 and created a node9
- GKE started upgrading node5 using recreate. GKE started draining node5
- Temporary number of nodes = 7 (node 2, 4, 5, 6, 7, 8, 9)
Until all nodes are upgraded — upgrade succeeded!’
- GKE successfully upgraded node2. GKE deleted node2
- GKE successfully upgraded node4. GKE deleted node4
- GKE successfully drained node5. GKE deleted node5 and created a node10.
- Upgrade succeeded! Number of nodes = 5 (node6, 7, 8, 9, 10)
What if something went wrong?
If any node failed to upgrade, GKE stops rolling immediately.
GKE waits until node upgrade return, and return errors. The remaining nodes will not not be touched.
- In this case, node1 and node2 were upgraded successfully and GKE deleted them.
- node3 failed to upgrade
- Even if GKE has the capacity to surge additional two nodes to upgrade node4 and node5, GKE will pause rolling the upgrade to node4 and node5
- GKE will restart the rolling process all over again. This time it skips node1 and node2. GKE retries to upgrade node3, and creates one additional node to upgrade node4
What else to consider?
Upgrade order
- GKE upgrades cluster node pool by node pool.
- Within a node pool, GKE upgrades one zone at a time.
- Surge upgrade settings are applicable only up to the number of nodes in the zone
- Max nodes that can be upgraded in parallel will be no higher than the sum of maxSurge+maxUnavailable
- Max nodes that can be upgraded in parallel will be no higher than the number of nodes in the zone. This means if you have only one node in a zone, the max parallelism for nodes upgrade will be reduced to 1, regardless of the value of maxSurge
PDB (pod disruption budget)
PDB limits the number of pods that can be deleted at a given time. PDB can prevent GKE from draining a node. GKE respects PDB for up to 60 mins.
Be mindful of your PDBs. Wrong PDB settings can significantly reduce the effective max parallelism for nodes upgrade.
For example, if you have a single-replica deployment called “foo”, at the same time you set the PDB for “foo” with minAvailable=1. Say the only one pod is running on node1, when draining the node1, GKE has to wait one hour before it can surge more nodes.
This could be even worse when the pod is rescheduled by GKE temporarily to another old node so that the PDB timeout could be hit more than once, making the whole upgrade process even longer.