04 December 2009 ~ 2 Comments

The Use of Root Cause Analysis in Conducting Major Problem Reviews (Part One)

Many organizations have, from time to time, experienced significant disruption of their IT services – major incidents. This article examines how an organization might turn this to their advantage.

Very often a simple initial failure is made worse by other, unrelated, failures; these might be failures of hardware, software, people or process. The article expands on the material covered during accredited ITIL training courses and describes a systematic way of analyzing chains of events and identifying specific improvements that will address not only the original cause but also the subsequent failures.

Root Cause Analysis

The Service Operation volume of the IT Infrastructure Library recommends that every major problem should be reviewed to learn lessons for the future. However it gives little or no guidance on how this might be done. Root Cause Analysis is an excellent technique for addressing the issues identified in Service Operation:

� What was done correctly

� What was done wrong

� What could be done better in future

� How to prevent recurrence

� Whether there has been any third-party responsibility and whether follow-up actions are required

The phrase root cause analysis is often used in a general sense to describe the activity of identifying the underlying cause of an incident (and this is the sense that it appears to be used in the Glossary of Service Operation). However, the name Root Cause Analysis (RCA) is also given to a specific technique that is intended for use in investigating a series of actions or occurrences that lead to an undesired outcome.

It is particularly useful where a number of contributory causes might be involved; it helps the analyst to avoid the common mistake of becoming fixated on a single cause (usually the very first event). This technique is particularly useful in reviewing a Major Problem which might have several contributory causes, and whose impact might be made worse by the way it is handled. RCA not only helps us to identify the factors that lead to the loss of service, but also to understand how our response to the incident might have contributed to the overall impact.

RCA helps to identify not only what happened and how it happened but also why. Only by understanding why will we be able to devise workable corrective measures. For instance, suppose a network technician disconnects a working router rather than a broken one. A typical investigation might conclude that human error was the cause and recommend better training or that technicians should take more care but neither of these is likely to prevent future occurrences. RCA assumes that mistakes do not just happen but that they have specific causes, and would ask why? In the case of the poor network technician the RCA analyst might ask was the router properly labelled?, was the technician told which router was faulty?, is there a recognized procedure for deciding whether a router is working or not?, did the technician know what it was?.

Root causes have four characteristics:

1. They are specific causes: human error, for example, is too general.

2. They are causes that can reasonably be identified: RCA must be cost beneficial so the analyst must know when to stop the investigation.

3. They are within the control of the management of the organization. The analyst is looking for causes that can be addressed by the organization. Although adverse weather conditions might very well have triggered the incident, we cannot do anything to affect the weather and so that is not an appropriate root cause. We can of course do something about how we are impacted by adverse weather and perhaps our root causes might lie there.

4. They can be addressed by specific solutions. A vague recommendation such as ensure that technicians follow defined procedures probably means that more thought needs to be given to identifying a specific cause.

I shall discuss the four phases of RCA in part two.

IT Infrastructure Library is a Registered Trade Mark of the Office of Government Commerce in the United Kingdom and other countries

Popularity: 99% [?]

2 Responses to “The Use of Root Cause Analysis in Conducting Major Problem Reviews (Part One)”

  1. Outsourced IT 28 January 2010 at 10:17 Permalink

    Fresh. I like where you are coming from. ;)


Leave a Reply