Project2022 — 2024

Realtime ops platform

One pane of glass for incidents — live signals, runbooks, and actions.

Tech lead

Unified dashboard and automation layer cutting incident response time for distributed teams.

Platform SRE Realtime

ReactNodeGraphQLRedisAWS

Funding & structure

Employer / product org

Platform team

Why

Incidents were losing time to context switching; leadership needed fewer handoffs and a shared picture every responder could trust under pressure.

Pain points

Mean time to acknowledge varied wildly because context lived in chat threads, tickets, and bespoke dashboards.
Different teams used different vocabularies for the same incidents, slowing coordination.
Automation existed but was not discoverable — responders repeated manual steps during stress.

Overview

Operations teams were juggling several tools during incidents. This platform centralized live signals, runbooks, and actions so responders could see the same picture and trigger automations from one place — with permissions that matched on-call reality.

Architecture

A GraphQL aggregation layer normalized events, entities, and actions from upstream systems. Real-time updates used WebSockets with Redis pub/sub so dashboards stayed coherent during spikes. Permissions mirrored paging and team ownership so automation stayed safe.

Diagrams

Realtime fan-out

Technical deep dive

React front end with a Node/GraphQL API layer; Redis for pub/sub and short-lived state; AWS for deployment and managed services. Focus on predictable performance during spikes — incident traffic isn’t steady-state traffic.

What I did

Defined the event and GraphQL schema so UIs and integrations stayed consistent.
Worked with SREs on alert routing, deduplication, and noise reduction.
Shipped real-time updates (WebSockets + Redis) so dashboards stayed in sync under load.
Balanced build vs buy for integrations with existing ticketing and paging tools.

Outcomes

Faster alignment during incidents: one place for status, owners, and recent changes.
Automation hooks reduced repetitive manual steps for common failure classes.
Easier onboarding for new responders thanks to consistent navigation and terminology.

Incident response workflows measurably faster for participating teams; qualitative wins in postmortems and retros.

Want to go deeper on architecture, trade-offs, or a similar build?

Get in touch