Creating the reliable cloud with unreliable components

2019-11-09

Team Agilicus has been working very hard to build a secure, reliable, economical hybrid cloud for municipalities moving infrastructure applications online. The secure part is done with a set of best practices, cryptography, etc. The reliable part and economical part, however, can often conflict. How do we achieve both, creating a reliable cloud with unreliable components?

Traditionally reliability came from making each individual component as close to infinitely reliable as possible. Servers has redundant power supplies, redundant fan, ECC memory, redundant power grids, etc. This creates great cost. But, ironically, it also creates a limit to reliability. All of those extra components themselves introduce failure modes. Eventually there is diminishing and then reducing returns.

So we have to focus on system reliability. And this is the big mental leap in cloud-native: each individual component is expected to fail often enough to observe. How do we then make the overall system reliable enough that failures are not observed?

Once you have this mindset in place, you start looking to embrace the failures. Is there something that is 10% less available and 50% cheaper? Let’s use that! Enter the concept of the preemptible node. We are using Google Cloud, and it has the concept of a preemptible VM. In a nutshell, you tell Google, hey, this node I’m running on, if you need to move that server or do maintenance in the datacentre, go ahead, 30 seconds notice is fine, pull the power. Since this allows Google greater flexibility, they make this capacity available for less money. And, since we expect nodes to fail anyway, and have designed a system that makes that unobservable, we embrace it.

Now, how often does this occur? If it occurred almost never, we would not have confidence that we handled it. Looking at the trailing 30 days of metrics for our clusters in Google Cloud Montreal, and showing a histogram of when the events occur vs time of day, we see there is a cluster around 6pm. My guess is a lot of people hit ‘git commit && push’ which cranks up their CI just before they head home, and this causes a capacity spike and rebalance.

We did have a choice. We could have chosen non-preemptible nodes, costing us (and our customers) more money. But, and this is key, we would have either reduced reliability (by assuming the non-preemptible nodes were infinitely reliable) or needed to do the same work involving Kubernetes and service meshes etc., and thus leave the cost saving behind for no reason. We chose to create a reliable cloud with unreliable components.

The moral of the story: embrace the failure, design for it, and use that to reduce your cost while increasing your reliability.

Cookie	Duration	Description
agilicus-lb		Used to optimise performance by routing requests to the same web server instance.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
mautic_device_id
mtc_id
mtc_sid
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga
_ga_YFMFHD6DRB

Creating the reliable cloud with unreliable components

Learn More?

Recent Articles

Operate Your Plant Virtually with Agilicus AnyX

10 Billion Reasons Shared Passwords Are Bad: RockYou2024

Get Thee From BGP Rockwell: Ethernet/IP Is not Internet

Fast, Simple, Secure: Implement CISA et al HMI (practically) recommends Agilicus AnyX

Windows Update Breaks VPN, Good Riddance #zerotrust

Industrial Supply Chain Matryoshka Risk

Resource Library