I'm sure we've all heard the old saw "Failing to Plan is Planning to Fail", but that is not the type of planning to fail I'm talking about here. I'm talking about the importance of actually having a failure plan. The reality of working in IT is that everything we do is (or should be) unpredictable. When the actions and results are fully predictable, processes in IT are automated (or should be!) That means that there is at least some chance of failure (or at least an unexpected result). We should always have a plan for what to do when things fail.
Sadly, I've been involved in a number of failed projects and processes recently and as a result I've learned some new lessons about planning to fail. The first and foremost lesson is that you need to define success and define "good enough" in a specific and measurable manner according to what matters most. (By good enough, I mean something that doesn't meet the intended goals but results in a state closer to them than the original.) There are two issues here, one is determining what your priorities are and the other is being able determine whether they have been met. Also, while it might be self-evident to an application administrator that an application is working suboptimally, it might not be to a database administrator or system administrator. The same applies to the health of the database and operating system, but all three need to be tested and working to have the system as a whole work well. If this isn't formalized, you can end up having to decide whether something is working and working well enough while dealing with the stress and uncertainty of an unexpected result.
I have also learned the importance of having your failure plan be as or more formal than your plan for success. There are several reasons for this. You will be implementing it under a higher level of stress and the stakes are higher - "we had a problem, we followed the plan and solved it" is a much better state to be in than "we had a problem, we tried to solve it, we failed, …", which is likely to at the very least lead to several uncomfortable meetings. Also, often, even if the plan was carried out individually, the recovery is likely to involve additional people. Multiple people working off of a vague, uncertain or non-existent plan is highly likely to lead to a sub-optimal result. For instance, a recent plan I worked on failed and not only could we not recover fully, because of assumptions we'd made regarding how we'd recover and who knew what, we rolled back 100% of the changes when only rolling back the second half was necessary. This took extra time and resulted in our needing to redo the successful first half later. Had we planned better for failure, we would have had a partial success to go with our partially unrecoverable failure. For that matter, had we planned better, we might have been able to recover (however, in this particular case, initial failure and inability to recover came from the same false reading of initial system state and likely would have carried over to any failure plans.)
Another advantage of planing for failure is that if you have a good failure plan, you can estimate how long it will take for you to recover. This allows you to build time for recovery into your schedule. This helps avoid the "too many cooks" syndrome that often seems to accompany failures and missteps of any sort. As soon as things aren't going according to plan, there are always additional stakeholders needing to know the status and wanting to help. But if you have a plan for failure, things are still going according to plan, even if it isn't the part of the plan you were hoping to follow.
Finally, there's always the talismanic protection that when you have a plan to fail, you won't need it. This isn't really true, but there is certainly a comfort in having a plan that makes your actions surer (reducing things that can cause failure like typos) and your stress level lower, especially when the inevitable failure does hit. And while planning for failure takes time, like most things, the more you practice, the faster and better you can do it.
While even the best plans for failure won't and can't cover everything (that's why they call them "unexpected" events), they do make the process of dealing with failures and unexpected results better.
For more information on failure, check out:
The Logic Of Failure: Recognizing And Avoiding Error In Complex Situations
A great source of information on IT failure specifically is Michael Krigsman (@mkrigsman)
For more information on the quote about failing to plan, see this thread on the Quotations forum.