It’s often a perplexing question on how you want to design and see your IT Operations NOC (Network Operations Centre). Especially when you are running the so called ‘mission critical systems’ such as Banking System or a an airline check-in system. The slight unavailability of the system will have huge impact on the end customer’s business and corresponding revenue. There may be two arguments
- Military Regimentation: The systems are Mission Critical and hence there should be military regimentation in the NOC. What is regiment? A regiment is a unit in the military troop with multiple battalions (specialized in multiple types of arms and ammunitions) typically commanded by a colonel. And most importantly the soldiers are not mere “warriors”. Traditionally, warriors are considered as “heroes” who would fight to death irrespective of the outcome; mostly lead the war themselves.
Well! Soldiers are different.
They fight as part of a “team”, according to a “plan” and they conform to a “discipline”. So, why the concept of regimentation need to be applied to IT Operations? Just because IT Operations for vital systems need to be operated with a well-structured “plan” by a well-trained and skilled “team” who follow streamlined processes and thus conform to a “discipline” - Emergency Triage: The “Systems are “Highly Critical” to the business. Each deployed components are like vital organs to the “Systems” and “health” of each component need to be monitored for undesirable symptoms and any degradation of health need to be “assessed” on the severity, prioritize and “treated” according to the priority and “resolve” them. The applications are human written, hardware is human designed and nobody can guarantee that the software would run without issues in all kind of business scenarios. In the real world scenario, the components of software or hardware may fail as we humans experience an ill-health.
How do we see this in comparison with hospital emergency triages?
In
the middle of a night, if one feels an illness (it can be felt so
serious such that he can’t wait till morning) and decides to go to a
nearest super speciality hospital, he should be taken the emergency
section. There is an area which is addressed as “Triage”
where a nurse will assess the condition of patients in queue. i.e the
criticality of patient’s condition Vs the need for immediate
treatment Vs their chance of benefiting from such care. For every
patient (or his/her bystander), he/she is the most eligible person to
get medical care , “at the earliest”; irrespective of the
“severity” of health condition. So, one becomes impatient,
shows panic and always alert the attenders at the Emergency on one’s
importance.
But what about the other side?
Suppose, the nurse at triage has to attend three patients at the same
time – one with fever and running nose, another one with a broken
hand and yet another one in a ventilator. The nurse “assesses”
the condition of each one of them based on the severity as
follows.
A) seriously ill/injured with a threat to life,
B)
seriously ill/injured with no immediate danger to life
C) ill with
no danger to life) and take action.
The A category will be
immediately sent to resuscitation area where immediate care of a
physician , nurses who are trained in trauma & life support and
advanced life saving equipment. The B category people will be moved
to emergency/acute care area where a physician will attend them and
recommend/execute more detailed tests/diagnosis to start the
treatment. The C category people will be moved to prompt care/minor
area where the patients will be handled by junior doctors first who
will analyse their health further to take decisions.
For those who are familiar with IT Operations would say that, a NOC area is similar to an Emergency Triage!
For all customers, their systems are “highly critical”
and who don’t need urgent care in case of issues? In the middle of
peak business hours, when a user gets an error in their screen they
submit a S1 ticket to the provider. There may be a dozen other
tickets with varying severities (2,3 etc). Everyone needs urgent
attention and resolution so that the business can run without more
interruption.
Here, the 24 X 7 people at the NOC (Level 1 Support), “assess
the impact” of incident , “real severity” (in some
cases, even a non-impacting incident can also be reported as S1) and
decide on immediate workarounds to bring back the system. They also
call/involve the Level 2 support to resolve the issue if it is a S1
or S2 incident and line managers like Incident Manager/ account
managers. While the Level 2 persons get involved in the incident,
they should be provided all the details of the incident, status and
impact by L1 support people. . One immediate similarity with health
management is that here also, no one should try to find the root
cause at this stage. The immediate priority is to “give first
aid” to the system/service and bring it back up and running. The
root cause and problem fixing should follow. The lower priority
incidents will be taken though more delayed methods within the
process since the business impact is low.
By all means, in IT
Operations, “business” is “life” (Literally or
relative to health management). The business should be live and any
disruptions to it should be carefully analysed for “business
impact”, prioritized according to severity and actions need to
be taken to recover the services to reduce/eliminate business impact.
Incident management is the “first aid and resuscitation”
for the business and problem management is the “follow-up
therapy and permanent cure”.
IT is driven by the business
and IT doesn’t have its own existence. You can be technical for
business reasons and it is a good practice to relate IT Operations to
an Emergency Triage