top of page

Industrial Alarm Triage + Decision Support at 100k+ Tag Scale

___________________________________________________________________________________


Case Study Snapshot


  • Role: Lead Designer (UX strategy + IA + interaction + validation + rollout)

  • Product: Real-time decision interface for industrial operations (DCS/SCADA + historian + alarms)

  • Users: Console ops, supervisors, engineers, field + on-call mobile

  • Scale: 100k+ tags; sub-second requirements

  • Timeline: 18 months; team: 2 PDs + 1 researcher + 10 engineers

  • Outcome: “Reduced time-to-diagnosis and escalations; improved shift handoffs”


___________________________________________________________________________________


The situation:


In dangerous industrial environments—refineries, chemical plants, pharma production, power generation, and mining—operators don’t have the luxury of “digging around.” They’re managing sub-second signals where a pressure spike or compressor surge can cascade into safety incidents, environmental releases, equipment damage, quality loss, or full shutdowns.


The reality inside most control rooms is brutal: too many screens, too many clicks, alarm floods, buried data, and inconsistent mental models across tools and teams. When alarms are constant and prioritization is unclear, the system trains people to ignore it—exactly when they can’t afford to.


Users and operating context:

  • This application supported a mixed set of users operating across multiple environments

  • Console operators in the control room (DCS/SCADA context) handling live operations

  • Shift supervisors coordinating response and accountability

  • Process engineers diagnosing root causes and stabilizing conditions

  • Safety and maintenance supporting incident prevention and recovery

  • On-the-go field teams using tablets

  • On-call staff using mobile for escalation and triage

  • Expertise ranged from new operators to 20-year veterans, which meant the UI had to work for both: fast pattern recognition for experts, and safe clarity for newer staff.


The system brought together:

  • DCS/SCADA tags

  • Plant historians (PI and similar)

  • Alarms/events

  • Maintenance systems

  • IoT sensor inputs


The scale was massive: 100,000+ tags and alarms occurring daily, with sub-second real-time performance requirements. This wasn’t “a dashboard”—it was a decision interface sitting on top of high-velocity operational reality.


The problem we set out to solve:

Operators needed to answer three questions instantly:

(1) What matters right now? (signal vs noise)

(2) What’s causing it? (diagnose without hunting through screens)

(3) What should we do next? (act confidently, communicate cleanly)


The prior experience made those questions harder than they should be—especially during escalations. Alarm floods and fragmented workflows forced people into “tab-and-pray” behavior: bouncing across screens, correlating trends manually, and relying on tribal knowledge instead of shared operational clarity.


The solution:

I led the design of an application that combined situational awareness, alarm response, troubleshooting, and decision support into one coherent experience—built to distill complexity into immediate clarity.


At a high level, the system was designed around a Detect → Diagnose → Decide loop:


“My leadership contributions”

  • Set product UX strategy + success metrics

  • Defined IA + interaction model (Detect/Diagnose/Decide)

  • Ran/partnered on contextual research + drills

  • Established design standards for ISA-101 constraints

  • Drove alignment across eng/safety/process SMEs

  • Led rollout + adoption enablement (training, patterns, governance)


1) Detect: clarity under alarm volume

  • Hierarchical information architecture so operators saw the plant state first, then area/unit, then asset, then signal

  • Alarm rationalization + risk scoring to surface what’s critical, not what’s loud

  • Contextual alarms tied to the relevant process view, trends, and likely drivers


2) Diagnose: reduce cognitive load, speed root cause

  • Progressive disclosure (high signal at the top, depth available when needed)

  • Trend overlays and sparklines to show directionality at a glance

  • Anomaly detection to flag deviations before they became events

  • Thresholds and confidence indicators so users could trust what the system was telling them (or know when not to)


3) Decide: support confident action and clean escalation

  • Actionable troubleshooting workflows built around real operational scenarios (pressure spikes, compressor surge)

  • Shift handoff support so the “state of the plant” and “what we did / why” didn’t live in someone’s head

  • Designed for multiple contexts: control room, field tablet, mobile on-call—without dumbing down the core experience


Design constraints and standards:

This wasn’t a consumer app, and it couldn’t “look modern” at the cost of safety. We designed within strict constraints:

  1. ISA-101 / high-performance HMI standards

  2. Color and alarm rules (no creative color abuse)

  3. Accessibility requirements

  4. Legacy systems realities

  5. Latency and reliability

  6. Regulatory constraints

  7. The goal wasn’t novelty. It was predictable, readable, fast, and safe.


Research and validation approach:

We didn’t guess. We earned the right to simplify.

We used a combination of:

  • Shadowing and contextual inquiry

  • Interviews across roles (operators, supervisors, engineers, safety, maintenance)

  • Task analysis and workflow mapping

  • Incident reviews and postmortems

  • SME workshops to validate operational logic

  • Log analysis to understand alarm/behavior patterns


Validation wasn’t “click tests.” We ran extensive usability tests and simulations, including scenario-based drills designed to mirror real plant conditions. That’s where the design proved itself: under time pressure, noise, and cognitive load.


Team and execution:

I served as Lead Designer, working with:

  • Two UX Product Designers

  • One UX Researcher

  • Ten developers/engineers

  • Delivered end-to-end over 18 months, from discovery through production rollout and adoption.


Results:

The outcome was a system that delivered immediate operational clarity at industrial scale. Post-launch impact included:

  • Cut time-to-diagnosis by ~40% in scenario drills

  • Reduced nuisance-alarm handling time by ~30%

  • Reduced cognitive load during escalations

  • Decreased unnecessary escalations by ~60%

  • Before: operators had to jump across 12 screens to correlate alarms + trends. After: 1 consolidated view handled the top scenarios

  • Before: shift handoffs were informal. After: structured handoff captured plant state + actions + rationale.

  • Strong adoption across roles and contexts

  • “I don’t have to hunt anymore — I can see the story of the alarm.”


Even without publishing sensitive operational metrics publicly, the behavior change was the point: operators were spending less time hunting, and more time stabilizing.


Why this matters:


In high-risk environments, UX isn’t polish—it’s risk control. This work wasn’t about making complex systems feel “simple.” It was about building a decision interface that performs under real operational stress, where every second counts and the wrong interpretation has real consequences.

© 2025 by Jack Shapiro. 

bottom of page