Designing to fail

It has been squeaky-bum time for the Google engineers in the last 24 hours as a bug in a software update resulted in "0.02%" of GMail users staring at empty inboxes (see http://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html). Already as a result I have seen articles claiming that this shows that cloud isn't to be trusted. (Although as a side note, people who use percentages agree usually trying to spin a message).

I can't remember if it was Nick Carr or Geoffrey Moore who I saw at an event in 2009 describe this as akin to an aeroplane crash. Airline incidents get disproportionately high coverage because "they kill people in batches". Likewise with cloud computing incidents.

Of course the reality is that people are losing data from on-premise systems every minute of every hour of every day. It is just so disaggregated that we never get to hear about it.

In fact, to go further, I know that we have lost data ourselves, and from the Google Apps system. But our loss came as a result of misunderstanding of the consequences of a request to delete a user account that was approved by a business head (and double checked), and our internal processes have been subsequently changed as a result.

This has got me thinking, though, about the risks, mitigation and contingency that this new era requires a business to consider. When I started at Imagination there was much talk about "lifeboats and get out of gaol cards". Both were, often complex, systems and processes that had been built up because systems were expected to fail. The problem with that approach is that if you design something to cope with failure, my reckoning will be that that will increase the chances of the failure happening. It is a very different thing from designing things not to fail.

As an example, all of our laptops had local admin rights so that users could get themselves out of a hole if something went wrong when on the road. Yet local admin rights seemed to be the root cause of much that went wrong with the laptops in the first place…

Coming back to the GMail issues, so what to do. Google do seem to design out failure, rather than design for it, but the consequences of a failure are then that much more catastrophic. However, maybe the bigger issue here is the reliance on single modes of communication. If I were to lose my email today for a period, whilst losing my archive, I still have my notes and to do lists in paper form. I also have three telephones, Twitter, Facebook and Linked In. I could even go and see people…

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.