Root cause analysis: What can the FAA disruption teach us? USA

On January 11, 2023, the FAA suffered an outage that grounded flights across the US – the first since September 11, 2001. The main cause cited at the time of publication (February 2023) is a technical glitch caused by contractors, but perhaps there is more to learn.

In this blog we will cover:

What happened
What we know so far
Key things to consider in root cause analysis

What happened on and after January 11?

The Notice to Air Missions (NOTAM) system operated by the FAA experienced an outage on January 11. This system distributes real-time data to pilots, which can include information relevant to that specific flight, such as deviations from flight paths due to identified hazards, local airport conditions or other factors that can influence the flight. Pilots are required to check NOTAM reports before flights.

At 7:15am EST, the FAA released a statement that it was grounding flights until 9am. At this time they were already working to resolve the problem. In a further update at 8:15am, they stated that they were restoring the NOTAM system following an overnight outage. This was followed by a statement at 8:50am that flights were resuming.

As a result, thousands of flights were either cancelled or delayed that day – and of course, people impacted by the disruption. There were no safety implications for flights already in the air.

On January 25, the House passed a Bill which would require the creation of a task force to study the NOTAM system.

What do we know so far?

What is interesting about FAA’s statements so far are what is said and what goes unsaid. By the end of day on January 11, the FAA stated that:

“The FAA is continuing a thorough review to determine the root cause of the Notice to Air Missions (NOTAM) system outage. Our preliminary work has traced the outage to a damaged database file”.

The FAA’s next statement on its website (and final statement at the time of this writing), dated January 19, includes:

“A preliminary FAA review of last week’s outage of the Notice to Air Missions (NOTAM) system determined that contract personnel unintentionally deleted files while working to correct synchronization between the live primary database and a backup database.”

While they don’t use the phrase ‘root cause’ in the second statement, the perhaps not-so-subtle inference is that the contract personnel are responsible for this outage. They do cover their base somewhat in the same statement, where they also say:

“The FAA made the necessary repairs to the system and has taken steps to make the NOTAM system more resilient. The agency is acting quickly to adopt any other lessons learned in our efforts to ensure the continuing robustness of the nation’s air traffic control system.” [our emphasis added]

While we might not expect the FAA to say everything that is happening behind closed doors, it’s a broad statement. In the absence of more detail, I’m left asking:

How was the system made more resilient?
Have they already learned the lessons that need to be adopted, or are there still lessons that need to be identified?
Given the leap from specifying contractors deleted files to ‘making the system more resilient’, were there any interim findings on root cause?

Elsewhere in the news, an email from the FAA to lawmakers stated that Spatial Front, the contractor in question, has lost access to FAA buildings and systems while the investigation is completed. Is this the action that has ‘made systems more resilient’?

What should you consider in root cause analysis?

Given the scrutiny the FAA is under from government regarding the incident, I’d bet that the FAA is not paying lips service and is actively undertaking further analysis. Regardless, it’s important to recognize that a contractor deleting a file is not the root cause.

At Protecht when we help our customers develop taxonomies for their cause libraries, ‘people’ is an almost universally adopted category. However, I always caution stopping at ‘people’ as the root cause soon as you get there. What environment were those people in? What other factors may have contributed? What was within our control?

For this incident, why were the contactors able to delete the file in the first place? Why was there no failsafe or quick recovery even if they were able to? The FAA have since removed access to the contractors (does this introduce downsides?), which might reduce the likelihood of similar re-occurrence, but doesn’t address those other questions which are arguably also within the FAA’s control. The underlying design of the NOTAM system, and related disaster recovery and business continuity procedures, look like opportunities for improvement.

Some of these underlying root causes can also affect more than one business objective. While the NOTAM system is essential to provide pilots information about their flight, it is also antiquated. It uses shorthand code developed over decades, combined with information overload - some NOTAM reports can be dozens of pages. While the NOTAM system was designed to keep flights safe, in one case the data overload was considered a contributing cause to a near-collision.

In summary

Here are some take-aways:

If someone in your organization points to people as the root cause of a problem, check whether there were other contributing factors within your control
Recognize causes that can influence multiple risks or result in multiple types of impacts – you might be able to solve multiple problems at once
More tactically, if you have critical systems that require data to operate, make sure they are adequately protected from accidental or malicious deletion
Ensure that you are operationally resilient to disruption caused by deletion, either through redundancy or effective recovery

Next steps for your organization

Protecht recently launched the Protecht ERM Operational Resilience module, which helps you identify and manage potential disruption so you can provide the critical services your customers and community rely on.

Find out more about operational resilience and how Protecht ERM can help:

Back to list

Product

Solutions

Capabilities

Useful information

Industries

Industries

Useful information

Knowledge hub

Knowledge hub

Useful information

Root cause analysis: What can the FAA disruption teach us?

What happened on and after January 11?

What do we know so far?

What should you consider in root cause analysis?

In summary

Next steps for your organization

Operational Resilience Series #3: Designing your impact tolerances

Operational Resilience Series #2: What are your important business services?

Can you implement controls for upside risk?