The degree of complexity and integration inherent in IT systems incorporating cloud computing means each new launch or upgrade can be a source of anxiety. Yet, the key to having confidence in your cloud infrastructure could lie in regular failure - introduced randomly by the 'Chaos Monkey'.
What is a 'Chaos Monkey'
The Chaos Monkey is a robotic servant employed by Netflix, the US movie rental service. To ensure that their systems work reliably, the Chaos Monkey randomly switches off services and server instances within their Amazon Web Services cloud infrastructure. Because this testing regime forces the system builders to design for resilience, reliability increases. They know that their creations need to survive unpredictable failures like those introduced by the Chaos Monkey.
Chaos Monkey and the cloud
The cloud is the natural habitat of the Chaos Monkey. Reliable cloud-based applications must tolerate changes in infrastructure while running. Cloud applications are typically assembled in tiers of multiple components, with the tiers able to 'autoscale' - growing or shrinking in response to varying demand.
Furthermore, since the individual underlying components (for example Amazon EC2 server instances) run on commodity hardware, they can and will fail occasionally. A reliable application needs to cope with such failures and careful design can help ensure nobody notices.
Netflix do this well. For example, if their personalised recommendation function is unavailable, that section of the page can be replaced with a general list of popular films; from a user's perspective the website still works and they might not notice the drop in functionality.
Outsmarting the Chaos Monkey
When PA Consulting Group recently built a large-scale cloud application for a client, we gave careful consideration to failure scenarios within our design. Statistically rare failures can become regular issues in high-volume 'web scale' solutions unless you accept that failure is inevitable and design to handle it gracefully.
One of the keys here is keeping your design as simple as possible. By minimising as much as you can, whether it's the 'chattiness' of the interfaces between your components or the complexity of your data set, you give the Chaos Monkey less chance to catch you out.
To find out more about how PA can help you build confidently in the cloud, please contact us now.