PaulTClark.com


Definitions


Agreed Service Time

The agreed hours when the service is to be available.

Example(s):

  • 24x7x365
  • Business Hours (M-F, 7am ET to 7 pm ET, except holidays)
  • Supplier / Customer defined


Availability

Ability of a Configuration Item or IT Service to perform its agreed Function when required. Availability is usually calculated as a percentage. This calculation is often based on Agreed Service Time and Downtime.

Availability % Downtime per year Downtime per month Downtime per week Downtime per day
90% ("one nine") 36.5 days 72 hours 16.8 hours 2.4 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours 14.4 minutes
99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes 1.44 minutes
99.99% ("four nines") 52.56 minutes 4.38 minutes 1.01 minutes 8.64 seconds
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 864.3 milliseconds


Break fix

Administrators sometimes fix applications to a temporary state until they can devise a better solution or wait to perform a change during the maintenance window.


Change Management

A process to ensure that standardized methods and procedures are used for efficient and prompt handling of all changes, in order to minimize the impact of change related Incidents upon service quality, and consequently to improve the day to day operations of the organization.


Configuration Item (CI)

Any Component that needs to be managed in order to deliver a Service. Information about each CI is recorded in a Configuration Record within the Configuration Management System and is maintained throughout its Lifecycle by Configuration Management. CIs are under the control of Change Management. CIs typically include IT Services, hardware, software, buildings, people and formal documentation such as Process documentation and SLAs.


Customer Experience (CX)

Customer experience is the product of an interaction between an organization and a customer over the duration of their relationship. This interaction is made up of three parts: the customer journey, the touchpoints the customer interacts with, and the environments the customer experiences (including digital environment) during their experience. A good customer experience means that the individual's experience during all points of contact matches the individual's expectations.


Downtime

The time when a Configuration Item or service is not available during its Agreed Service Time. The Availability of a service is often calculated from Agreed Service Time and Downtime.


Escalation Model

The process that describes who, when, and how to escalate to others. This may include under which conditions.
escalation example
(Image blurred intentionally)


Event Management

Event Management, as defined by ITIL, is the process that monitors all events that occur through the IT infrastructure. It allows for normal operation and also detects and escalates exception conditions.

An event can be defined as any detectable or discernible occurrence that has significance for the management of the IT Infrastructure or the delivery of IT service and evaluation of the impact a deviation might cause to the services. Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool.

Examples:

  • Security or network intrusions
    • failed login attempts
    • traffic from an unexpected ip
    • dos attacks
    • database access
    • edits to critical system files
      • syslog
      • /sbin/*
      • /boot
      • /etc/passwd
      • /etc/groups
  • Configuration changes
    • OS files
      • /etc/hosts
      • /etc/resolv.conf
    • OS commands
      • ifconfig
      • server restarts
    • Configuration changes to product, service or application including addition of gear
      • Edits to application configuration file
      • application restarts
      • tnsnames.ors
      • httpd.conf
      • Tomcat config file
  • Abnormalities or failures of key configuration items
    • NFS server connectivity
    • Partition usage
    • Memory usage
    • Swap usage
    • SNMP connectivity
    • ICMP connectivity
    • CPU utilization
    • network and port connectivity
    • I/O thresholds
    • Dependency availability
    • All crit , alert, emerg, and panic syslog events
    • Loss of redundacy
  • KPIs – For every KPI failure, there should be an accompanying event from another category such as abnormalities or failures of key configuration items.
    • Reliability
    • Availability
    • Time Between Failures
    • Response Time
    • VIP failures
    • Any other measure of success that defined for the Product, service or application.
  • Process effectiveness
    • Too many / few processes
    • Process taking too long
    • Process returns errors
    • Capacity
    • Software licensing usage


Impact

Impact is often based on how Service Levels will be affected. Impact and Urgency are used to assign priority.

Types:

  • High: Business Unit, floor, branch, LOB, or multiple VIPs
  • Medium: small group of users or a single VIP
  • Low: single users


Incident Management

A process to restore normal service operation within Service Level Agreement limits as quickly as possible and minimize the adverse impact of the Incident on business operations, thus ensuring that the best possible levels of service quality and availability are maintained.


ITIL

An acronym for Information Technology Infrastructure Library, is a set of detailed practices for IT service management (ITSM) that focuses on aligning IT services with the needs of business.


Intermediate Distribution Frame (IDF)

Intermediate Distribution Frame is a wiring rack located between the MDF (main distribution frame) and the intended end user devices (telephones, routers, PCs, etc.). Cables run from the outside world to the MDF and then to the IDFs.


Main Distribution Frame (MDF)

Main Distribution Frame is a wiring rack that connects outside lines with internal lines. It is used to connect public or private lines coming into the building to internal networks. In a telco central office (CO), the MDF is generally in close proximity to the telephone switch.


Mean time between failures (MTBF)

A Metric for measuring and reporting Reliability. MTBF is the average time that a Configuration Item or service can perform its agreed Function without interruption. This is measured from when the CI or service starts working, until it next fails.


Mean Time between System / Service Incidents (MTBSI)

The mean elapsed time between the occurrence of one system or service failure and the next.


Mean Time to Restore Service (MTRS)

The average time taken to restore a Configuration Item or service after a Failure. MTRS is measured from when the CI or service fails until it is fully Restored and delivering its normal functionality.


Mean Time to Repair (MTTR)

The average time taken to repair a Configuration Item or service after a Failure. MTTR is measured from when the CI or service fails until it is repaired. MTTR does not include the time required to Recover or Restore.