Completely Complex Cloud Cluster Capacity Crisis: Cool as a Cucumber in Kubernetes

2019-04-08

So…. capacity. There’s never enough. This is why people like cloud computing. You can expand for some extra cash. There’s different ways to expand: scale out (add more of the same) and scale up (make the same things bigger). Normally in cloud you are focused on scale-out, but, well, you need big enough pieces to make that reasonable.

When I set up my Kubernetes on GKE, I used n1-standard-2 (2VCPU, 7.5GB ram). 3 of them to make the initial cluster. And it was ok, it got the job done. But, as soon as we started using more CI pipelines (gitlab-runner), well, it left something to be desired. So, a fourth node was pressed into service, and then a fifth. At this stage I looked at it and said, well, I’d rather have 3 bigger machines than 5 little ones. Better over-subscription, faster CI, and, its cheaper. Now, this should be easy right? Hmm, it was a bit tough, let me share the recipe with you.

First, I needed to create a new pool, out of n1-standard-4 (4VCPU, 15GB RAM). I did that like this:

gcloud container node-pools create pool-n1std4 \
  --zone northamerica-northeast1-a \
  --cluster k8s \
  --machine-type n1-standard-4 \
  --image-type gci \
  --disk-size=100 \
  --num-nodes 3

OK, that kept complaining an upgrade was in progress. So I look, and sure enough the ‘add fifth node’ never worked properly, it was hung up. Grumble. Reboot it, still the same. Dig into it, its complaining about DaemonSets (calico), and not enough capacity. Hmm. So I used

gcloud container operations list 
gcloud beta container operations cancel operation-# --region northamerica-northeast1-a

And now the upgrade is ‘finished’ 🙂

So at this stage I am able to do the above ‘create pool’. Huh, what’s this? everything resets and goes Pending. Panic sets in. The world is deleted, time to jump from the high window? OK, its just getting a new master, I don’t know why all was reset and down and is now Pending, but, well, the master is there.

Now lets drain the 5 ‘small’ ones:

kubectl drain gke-k8s-default-pool-XXX-XXX --delete-local-data --force --ignore-daemonsets

I had to use ignore-daemonsets cuz calico wouldn’t evict without it. OK, now we should be able to delete the old default-pool:

gcloud container node-pools delete default-pool --zone northamerica-northeast1-a --cluster k8s

Now, panic starts to set in again:

$ kubectl get nodes 
NAME STATUS ROLES AGE VERSION gke-k8s-pool-n1std4-XXX-XXX Ready,SchedulingDisabled <none> 21m v1.10.5-gke.0 
gke-k8s-pool-n1std4-XXX-XXX Ready,SchedulingDisabled <none> 21m v1.10.5-gke.0 
gke-k8s-pool-n1std4-XXX-XXX Ready,SchedulingDisabled <none> 21m v1.10.5-gke.0

Indeed, the entire world is down, and everything is Pending again.

So lets uncordon:

kubectl uncordon gke-k8s-pool-n1std4-XXXX-XXX

Great, they are now starting to accept load again, and the Kubernetes master is scheduling its little heart out, containers are being pulled (and getting image pull backoff cuz the registry is not up yet). OK the registry is up… And, we are all back to where we were, but 2x bigger per node. Faster CI, bigger cloud, roughly the same cost.

Cookie	Duration	Description
agilicus-lb		Used to optimise performance by routing requests to the same web server instance.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
mautic_device_id
mtc_id
mtc_sid
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga
_ga_YFMFHD6DRB

Completely Complex Cloud Cluster Capacity Crisis: Cool as a Cucumber in Kubernetes

Learn More?

Recent Articles

Operate Your Plant Virtually with Agilicus AnyX

10 Billion Reasons Shared Passwords Are Bad: RockYou2024

Get Thee From BGP Rockwell: Ethernet/IP Is not Internet

Fast, Simple, Secure: Implement CISA et al HMI (practically) recommends Agilicus AnyX

Windows Update Breaks VPN, Good Riddance #zerotrust

Industrial Supply Chain Matryoshka Risk

Resource Library