Skip to content

Operational Resilience Series #5: Design and running of a scenario

So far in this series we have identified important business services, designed impact tolerances and mapped the processes and resources that support these services. With that foundation in place, we can start considering scenarios: can we meet these tolerance levels if we experience disruption?

Let’s cover:

  • Types of scenarios and their use
  • Drivers to consider when designing scenarios
  • Testing your scenario
  • Varying the scenario
  • Integrating resource-based and event-based scenarios

Types of scenarios

We recommend two types of scenarios that have different applications: resource-based scenarios and event-based scenarios.

Resource-based scenarios consider what happens when an individual resource – or some combination of resources – are missing or not functional. We don’t care why it’s disrupted – we simply assume that it is no longer available. This provides quick insights into which resources impact multiple important business services.

Event-based scenarios consider the events leading up to the disruption of our resources. This requires us to map those scenarios to processes or resources. More than that, it allows us to compare the event to our existing capabilities: can we recover within our defined impact tolerance?

Severe but plausible

Particularly for regulated entities in the financial sector, but also relevant to everyone else, there is a focus on ‘severe but plausible’ scenarios. Operational resilience outcomes are based on withstanding and recovering from severe shock to operations. The focus on severe implies that of all the scenarios you can imagine, you want to focus on those that will really push your organisation’s ability to meet its impact tolerances.

When considering the ‘plausible’ side of the equation, the concept of likelihood takes a backseat. While the scenario shouldn’t be so remote as to be meaningless or impractical to respond to, the focus is on severity.

Resource-based scenarios

If you’ve used technology to consistently map your services, you can gain some quick insights into which resources may be more critical than others. For each resource, you will be able to see how many important business services are linked to it, and what each of their impact tolerances are. For some of those resources, you may have already identified a Recovery Time Objective (RTO). You can immediately identify resources whose RTO is beyond the impact tolerance and take action to rectify.

While each important business service and their related impact tolerances stand alone, this approach might identify some resources that are single points of failure for multiple services. Before even considering more detailed event scenarios, you may take action to diversify your resources.

One caveat for regulated entities: The resource-based approach provides quick insights but doesn’t meet requirements to consider severe but plausible scenarios. Its role is to enable a focus on prevention, implementation of controls and increased monitoring of the health of resources.

Event-based scenarios

Event-based scenarios rely on specific events or triggers that would result in one or more of your resources being affected. Where resource-based scenarios rely on simple logic, event-based scenarios require imagination. Understanding the potential chain of events and appreciation of second- and third-order effects is essential to developing a plausible scenario.

One way to consider scenarios and retain a consistent format is to follow a standard scenario statement template that includes key components of causes, affected resources and/or processes, disrupted services, and impact on customers. If you are using technology, you can link your scenario to the affected resources or processes to ensure consistent mapping.

Here is an example scenario statement for a financial services provider:

Transatlantic telecommunications cable broken or disrupted, resulting in offshore contact centre operations being unavailable that support phone-based transaction, resulting in customers being unable to access funds.

We will return to this example when we consider testing.

Adding additional context not only highlights how the scenario will disrupt the service, but also allows you to test existing business continuity plans or other recovery and response mechanisms against it.

Drivers to consider when establishing scenarios

Every sector and organisation will be different. Some disruption will be acute, such as a cyber-attack that instantly shuts down systems. Some will be driven by chronic factors – COVID 19 being the obvious example. We’ve listed some drivers, triggers or factors to consider that might inspire useful scenarios:

  • Cyber events that disrupt systems or restrict access to critical data
  • Third party concentration risks
  • Economic drivers such as inflation or interest rate changes
  • Unavailability or significant delay of key products or services across the supply chain
  • Weather events or natural hazards
  • Specific skills shortages
  • Incapacitation of a key person, or groups of people
  • Failure of major utility providers, such as power or telecommunications
  • Failure of infrastructure that your entire sector relies on
  • Insider threats
  • System misconfiguration and cascading systems failures

This is just a small list to get you started – you will need to consider what is appropriate for your organisation. Don’t forget to consider whether these would affect your critical vendors with impacts that would flow on to you, and whether some of the drivers are correlated or could occur at the same time.

Testing the scenario

Once you’ve established the scenario, it is time to test it. If this scenario were to occur, would our internal capabilities allow you to recovery potentially affected services within defined impact tolerance?

Testing can be conducted in several ways and will depend on the maturity of your operational resilience program.

  • Desktop Review – This can be a preliminary step, where individual stakeholders independently review the scenario and challenge assumptions. Their feedback can be incorporated into further tests to increase their realism.
  • Desktop Walkthrough – This involves walking through the scenario with key stakeholders likely affected by the scenario, who would step through how they would respond, and validating that crisis response, business continuity plans or other contingency arrangements will be effective given the scenario. This may include activation of any business continuity or contingency arrangements. This discussion may highlight flaws in current response that can be rectified, and clarifies roles and responsibilities.
  • Simulated Exercises – This involves responding as though the scenario is actually occurring. This might include real-time activation of business continuity plans and contingency plans. This can help identify unexpected outcomes or interdependencies, or where resources are not sufficient.
  • Full Exercise – A full exercise attempts to simulate the scenario. This might include actively removing or impacting the resources required to deliver the services.

Remember that however you test the scenario, the objective is not to assess whether crisis or business continuity plans are effective (though that might also be assessed); it’s to determine whether impact tolerances can be achieved. You may find that for some severe but plausible scenarios, the Recovery Time Objective of those plans cannot be achieved, but impact tolerance can.

Let’s return for a moment to our previous example:

Transatlantic telecommunications cable broken or disrupted, resulting in offshore contact centre operations being unavailable that support phone-based transaction, resulting in customers being unable to access funds.

If we run this scenario against our capability via a desktop walkthrough, we might identify that we have business continuity plans for failure of this call centre, which is to activate contingency arrangements with a third party in the same region. In this scenario, that third party would also be affected by the disruption. Further discussion during the walkthrough identifies that without a pre-arranged contingency in place with an onshore backup, we are unlikely to meet our defined impact tolerance.

Varying the scenario

You may have a ‘primary version’ of each scenario that you initially test, but particularly as you mature, you should then consider variations of the scenario. If your initial test meets your impact tolerance, what plausible variations might take you longer to recover? A few points to consider:

  • What if the event occurred in a different location? What if it covered multiple sites or regions that we operate in at the same time?
  • What if additional resources were also affected by the scenario? Would this impact additional services, and could we recover all of them within their impact tolerances?
  • Is it plausible that the initial event could last longer than we’ve just tested?
  • What if our third parties we rely on for contingency arrangements fail at the time we need them, or the resources required to enact them are not available due to supply chain issues?
  • If our competitors are also affected, are we competing for the same resources to recover?
  • How severe would the initial drivers need to be (or escalate to) to breach our impact tolerance? Is that level of severity plausible?

The principle is to keep pushing boundaries and assumptions until the scenario is no longer plausible. All this variation and testing requires resources, so you need to balance your approach. You may conduct a more thorough test of the primary scenario, followed by desktop walkthroughs of variations to identify potential weaknesses.

Integrating resource-based and event-based scenarios

The outcome of operational resilience is to ensure we can restore our important business services within defined impact tolerance. Running resource-based scenarios quickly highlights which resources, or combinations of resources, are going to affect important business services. The nature of impact tolerances makes it difficult to aggregate the effect of multiple services being impacted, but a rule of thumb is to focus on small groups of resources that affect the highest number of services – single points of failure or small clusters of failure.

Once you’ve sorted those groups into a rough order of importance, match them to your event-based scenarios. If you don’t have an existing scenario that maps to those combinations, it may be an area for further development.

About this series

We’ve covered some of the key points of designing and testing your scenario, but the results of those tests need to lead somewhere. Next in this series we will cover the identification of weaknesses and actions in your operational resilience.

Next steps for your organisation

Protecht recently launched the Protecht.ERM Operational Resilience module, which
helps you identify and manage potential disruption so you can provide the critical
services your customers and community rely on.

Find out more about operational resilience and how Protecht.ERM can help:

Note on regulation and terminology

While this series primarily discusses regulated entities, the guidance can apply to any organisation seeking to improve their operational resilience by looking through an external stakeholder lens, whether they operate in financial services, critical infrastructure, healthcare or indeed any other industry.

We use the term ‘important business services’, which aligns with the UK’s Financial Conduct Authority/Prudential Regulation Authority terminology but can and should be adapted to different regions and sectors. For Australian financial service providers, we recommend replacing ‘important business services’ with ‘critical operations’, and impact tolerance with ‘tolerance levels’ to align with APRA draft standard CPS 230 on Operational Risk.

We use the term ‘customer’ in this blog, which can include direct consumers, business to business relationships, patients in health care settings, or recipients of government services. The defining factor is that they are external recipients of the services you provide.

About the author

Michael is passionate about the field of risk management and related disciplines, with a focus on helping organisations succeed using a ‘decisions eyes wide open’ approach. His experience includes managing risk functions, assurance programs, policy management, corporate insurance, and compliance. He is a Certified Practicing Risk Manager whose curiosity drives his approach to challenge the status quo and look for innovative solutions.