The benefits of describing ITIL incidents in object-deviation format

Those of you familiar with Kepner Tregoe problem solving will have come across the standard format used to describe problems – object-deviation. An object is the “thing” that is experiencing a deviation. A deviation occurs where there is a positive or negative difference between an expected condition and a measured condition.

An example;
An Oracle financial report normally completes within 15 minutes (the measured condition). Today it is taking 45 minutes to complete.
The object is the Oracle financial report (it’s helpful to be specific – let’s call it report FIN001). The deviation is that it is taking 30 minutes longer than usual to complete.

This would be written as:
Oracle financials report FIN001 taking extra 30 minutes to run

It is unlikely that the customer would report the fault in this way. It is much more likely to be reported like;

  • Report stuck
  • Financials not working
  • Report is slow
  • Oracle is taking too long
With some quick and easy questions however (based around identifying the object and the deviation), it could quickly be turned into object deviation format;
  • Object: “What part of Oracle are you using?”
  • Object: “Which Oracle form is it?”
  • Deviation: “How long does it normally take?” (identifying expected condition)
  • Deviation: “How long is it taking today?”
(side note: it is also feasible for Service Desk agents at first point of contact to being asking some of the basic problem solving questions such as “Which other objects are not showing this fault/deviation?” and “When was the deviation/fault first observed?”, but that’s a topic for another day)

The benefits of describing the fault using object-deviation are manifold:

  • The description is specific enough to begin clear incident resolution activities. With the object-deviation description, a technical staff member is likely to first check Oracle Financials, check the host on which it’s running for load, check the job scheduler to see if FIN001 is running and check the data FIN001 is using. Because the deviation has only been reported for FIN001, it also services as a point of different for theĀ  ‘distinctions’ exercise in Kepner Tregoe.
    The technical staff member doesn’t go on a goose chase, for example checking the server on which Oracle Financials is running (although this may be a valid activity after other more likely sources are ruled out), checking disk space etc.
  • Mean time to resolution is reduced and the customer is up and running more quickly. Because the technical staff member has a clear description, resolution activities are more focussed, meaning the cause (and workaround) are applied quicker
  • Lower missclassification rate. Because the object is known, it becomes easier to classify the service and/or configuration item. This seems like a very general or vague benefit, but it goes back to the old principle of garbage in, garbage out. How reliable is your reporting or trend analysis information?
  • Easier trend analysis. Because symptomatically related service calls are recorded with similar descriptions, it becomes easier to spot and verify trends of faults
  • Easier error matching. Because problems are described in object-deviation format, it is easier to search problems for the deviation reported in the fault to see if there is a workaround

The counter argument to describing faults in object-deviation format at first point of contact with the Service Desk is that the description should use the customer’s words and language. That is, by using object-deviation format, the specification of the fault will use terms unfamiliar to the customer.

There is a false assumption here – that the fault cannot be described in both customer langugage and object-deviation format at the same time. It is quite easy to do both. Here are some examples;

Object deviation example 1

The customer calls the service desk to report that his email is not working. At first glance, this could be logged as;

  • Email not working
  • Email not being received
  • Can’t get email
If the service desk agent asks some simple questions we can be much more specific about the object and the deviation:
  • Object: “Which email program are you using?”
  • Object: “Is that on Windows or a Mac?”
  • Deviation: “What happens when you click ‘Get Mail’?”
  • Deviation: “Can you tell me the error message?”
So, we come up with something like:
Thunderbird on Mac shows error “Connection refused to company.com” when ‘Get Mail’ is clicked
Does this use the customer’s language? No. Not the language the customer used to report the fault. Is it something the customer is likely to understand? Yes. They use Thunderbird. They use a Mac. That’s the error message that appears when they click the ‘Get Mail’ button. Therefore the customer is unlikely to feel confused or that what they have reported hasn’t been acknowledged or understood.

Object deviation example 2

The customer calls the service desk to report that their fan is noisy. Again this could be reported in a number of (non-specific) ways;

  • Noisy fan
  • Computer is noisy
  • Fan is overworked

Again, by asking some simple questions we can really start to be specific about the object and the deviation.

  • Object: “What does the sound on the computer sound like?”
  • Object: “What type of computer is it?”
  • Deviation: “When does the sound start?”
  • Deviation: “Does it normally do this?”
We arrive at something like;
Generic Computer Model 999 has noisy fan for 30 seconds on startup
Does it use the language the customer reported? No. Will the customer associate with the description? Yes. They have a Generic Model 999 in front of them. It has a noisy fan. And the fan is noisy for 30 seconds on startup. This is an accurate description of what they have told the service desk agent.

Kepner Tregoe: Is it useful for ITIL Problem Management?

I’ve recently been joined by a new colleague, who will be taken over Problem Management duties while I’m seconded to something else (improving video conferencing). In chatting with him, he mentioned that Kepner Tregoe is quite a dated model, and that it isn’t all that useful. Ever since doing my KT training in 2007, I’ve tried to find ways to apply it – to the ITIL Problem Management process itself, or to everyday rational decision making. Personally I find it quite a useful tool, and so am always interested to explore further when someone quite passionately takes the opposite stance.

One of the key tenets of Kepner Tregoe is that changes cause problems. Alterations in supplier, materials, personnel, work practices, equipment, raw materials etc can lead to a defect (or in KT terms, a ‘deviation‘). This is particularly applicable in the IT world – as configuration changes, hardware changes, script changes etc can all lead to a deviation. Although we have quite a mature Change Management process, and in most cases a clear audit trail of all Changes made to a Configuration Item, there are times when the ‘changes cause problems’ mantra, while valid, is of limited value. For instance, let’s take the principle where a problem caused by a change can first occur at any point after that change. What if the change occurred a while ago and in the interim there have been several other changes? It then becomes more difficult to try and identify how a change has contributed to a problem.

This is a downside of KT problem analysis – so the question becomes how then to avoid being unable to determine which change caused a problem. ITIL Change Management can contribute partially to this – by helping to bundle changes into change windows, and eventually reduce the rate of change. If the dependencies of a Change and the Configuration Items involved are adequately identified, this can also provide better information to link a change with a problem.

Another downside we have seen with KT in recent weeks is that it is still very dependent on specialist (subject matter expert – SME) knowledge to identify the most probable cause. Even after comparing similar objects to identify distinctions, and reviewing changes to see how these could account for the defect or deviation, in many cases there is still a great deal of technical knowledge required to formulate a probable cause. One wonders if this is a true downside – as it would be fair to assume that in a specialist technical field, it will be the specialist who undertakes root cause analysis. After all, would it be appropriate to have a dentist find out why you have a pain in the stomach?

I suspect I will have many more interesting Kepner Tregoe conversations in the coming months…

Kepner Tregoe article published by itSMF

I recently had an article published by the IT Service Management Forum (itSMF) of Australia on what Kepner-Tregoe problem solving and decision making offers to ITIL Problem Management.

So far, one of the most useful concepts from KT is the use of a ‘deviation’ to determine whether you do in fact have a ‘problem’. A deviation is defined as the difference between an expected and actual condition – such as a server load twice that of normal, or a host being unresponsive when it should be up. If there is no ‘deviation’, then you must question whether you really have a ‘problem’. This is useful for distinguishing between development requests / enhancement requests and true ‘problems’.