Industrial Alarm Triage + Decision Support at 100k+ Tag Scale

___________________________________________________________________________________

Case Study Snapshot

Role: Lead Designer (UX strategy + IA + interaction + validation + rollout)
Product: Real-time decision interface for industrial operations (DCS/SCADA + historian + alarms)
Users: Console ops, supervisors, engineers, field + on-call mobile
Scale: 100k+ tags; sub-second requirements
Timeline: 18 months; team: 2 PDs + 1 researcher + 10 engineers
Outcome: “Reduced time-to-diagnosis and escalations; improved shift handoffs”

___________________________________________________________________________________

The situation:

In dangerous industrial environments—refineries, chemical plants, pharma production, power generation, and mining—operators don’t have the luxury of “digging around.” They’re managing sub-second signals where a pressure spike or compressor surge can cascade into safety incidents, environmental releases, equipment damage, quality loss, or full shutdowns.

The reality inside most control rooms is brutal: too many screens, too many clicks, alarm floods, buried data, and inconsistent mental models across tools and teams. When alarms are constant and prioritization is unclear, the system trains people to ignore it—exactly when they can’t afford to.

Users and operating context:

This application supported a mixed set of users operating across multiple environments
Console operators in the control room (DCS/SCADA context) handling live operations
Shift supervisors coordinating response and accountability
Process engineers diagnosing root causes and stabilizing conditions
Safety and maintenance supporting incident prevention and recovery
On-the-go field teams using tablets
On-call staff using mobile for escalation and triage
Expertise ranged from new operators to 20-year veterans, which meant the UI had to work for both: fast pattern recognition for experts, and safe clarity for newer staff.

The system brought together:

DCS/SCADA tags
Plant historians (PI and similar)
Alarms/events
Maintenance systems
IoT sensor inputs

The scale was massive: 100,000+ tags and alarms occurring daily, with sub-second real-time performance requirements. This wasn’t “a dashboard”—it was a decision interface sitting on top of high-velocity operational reality.

The problem we set out to solve:

Operators needed to answer three questions instantly:

(1) What matters right now? (signal vs noise)

(2) What’s causing it? (diagnose without hunting through screens)

(3) What should we do next? (act confidently, communicate cleanly)

The prior experience made those questions harder than they should be—especially during escalations. Alarm floods and fragmented workflows forced people into “tab-and-pray” behavior: bouncing across screens, correlating trends manually, and relying on tribal knowledge instead of shared operational clarity.

The solution:

I led the design of an application that combined situational awareness, alarm response, troubleshooting, and decision support into one coherent experience—built to distill complexity into immediate clarity.

At a high level, the system was designed around a Detect → Diagnose → Decide loop:

“My leadership contributions”

Set product UX strategy + success metrics
Defined IA + interaction model (Detect/Diagnose/Decide)
Ran/partnered on contextual research + drills
Established design standards for ISA-101 constraints
Drove alignment across eng/safety/process SMEs
Led rollout + adoption enablement (training, patterns, governance)

1) Detect: clarity under alarm volume

Hierarchical information architecture so operators saw the plant state first, then area/unit, then asset, then signal
Alarm rationalization + risk scoring to surface what’s critical, not what’s loud
Contextual alarms tied to the relevant process view, trends, and likely drivers

2) Diagnose: reduce cognitive load, speed root cause

Progressive disclosure (high signal at the top, depth available when needed)
Trend overlays and sparklines to show directionality at a glance
Anomaly detection to flag deviations before they became events
Thresholds and confidence indicators so users could trust what the system was telling them (or know when not to)

3) Decide: support confident action and clean escalation

Actionable troubleshooting workflows built around real operational scenarios (pressure spikes, compressor surge)
Shift handoff support so the “state of the plant” and “what we did / why” didn’t live in someone’s head
Designed for multiple contexts: control room, field tablet, mobile on-call—without dumbing down the core experience

Design constraints and standards:

This wasn’t a consumer app, and it couldn’t “look modern” at the cost of safety. We designed within strict constraints:

ISA-101 / high-performance HMI standards
Color and alarm rules (no creative color abuse)
Accessibility requirements
Legacy systems realities
Latency and reliability
Regulatory constraints
The goal wasn’t novelty. It was predictable, readable, fast, and safe.

Research and validation approach:

We didn’t guess. We earned the right to simplify.

We used a combination of:

Shadowing and contextual inquiry
Interviews across roles (operators, supervisors, engineers, safety, maintenance)
Task analysis and workflow mapping
Incident reviews and postmortems
SME workshops to validate operational logic
Log analysis to understand alarm/behavior patterns

Validation wasn’t “click tests.” We ran extensive usability tests and simulations, including scenario-based drills designed to mirror real plant conditions. That’s where the design proved itself: under time pressure, noise, and cognitive load.

Team and execution:

I served as Lead Designer, working with:

Two UX Product Designers
One UX Researcher
Ten developers/engineers
Delivered end-to-end over 18 months, from discovery through production rollout and adoption.

Results:

The outcome was a system that delivered immediate operational clarity at industrial scale. Post-launch impact included:

Cut time-to-diagnosis by ~40% in scenario drills
Reduced nuisance-alarm handling time by ~30%
Reduced cognitive load during escalations
Decreased unnecessary escalations by ~60%
Before: operators had to jump across 12 screens to correlate alarms + trends. After: 1 consolidated view handled the top scenarios
Before: shift handoffs were informal. After: structured handoff captured plant state + actions + rationale.
Strong adoption across roles and contexts
“I don’t have to hunt anymore — I can see the story of the alarm.”

Even without publishing sensitive operational metrics publicly, the behavior change was the point: operators were spending less time hunting, and more time stabilizing.

Why this matters:

In high-risk environments, UX isn’t polish—it’s risk control. This work wasn’t about making complex systems feel “simple.” It was about building a decision interface that performs under real operational stress, where every second counts and the wrong interpretation has real consequences.