A few weeks ago I was watching an L2 NOC engineer at a European utility’s network operations center triage a critical WAN alarm. The choreography was always the same:

  • Open the network monitoring system, find the device, copy the alarm details.
  • Switch to the CMDB, look up the CI, the support group, the location.
  • Open ServiceNow, search the device hostname, scan a dozen past incidents to find anything similar.
  • Open the change calendar, check if a planned change might explain this.
  • Open the topology tool, verify the impact on the rest of the network.
  • Type a ticket: short description, probable cause, recommended actions, links to the past incidents you just read.
  • Send.

Eleven minutes, give or take, for an alarm that triggers a few times per night. Multiply by 500 alarms a day across a 24×7 NOC and you start to understand why operators report fatigue and why the same diagnostics get re-derived from scratch every shift.

This post is the story of how we replaced this choreography with a Claude-powered agent that does the same investigation in under two minutes, grounds every recommendation in past resolved incidents (with verbatim quotes from the close notes), and gets smarter every time it triages an alarm.

Architecture in one diagram

┌──────────────────────────────────────────────────────────────────────┐
│ Trigger: alarm → API Gateway PRIVATE → Lambda → SQS FIFO              │
└─────────────────────────────┬────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────┐
│  Bedrock AgentCore Runtime  ────  Claude Opus 4.8 (EU inference)     │
│  Strands SDK agent loop                                              │
│  STM during the run · LTM across runs (AgentCore Memory)             │
└─┬──────────────────────┬──────────────────────────────┬──────────────┘
  │ MCP Gateway          │                              │
  ▼                      ▼                              ▼
Spectrum MCP        ServiceNow Redshift MCP        Sendmail MCP
(live alarms,       (2 years of incident         (renders the ticket
 topology)           history with close_notes)    as HTML email)

Three things matter and most NOC agent projects miss at least one:

  1. The agent does not just answer prompts. It executes a reproducible, mandatory workflow. Gather live facts, ground on historical incidents, evaluate topology impact, synthesize, write, deliver. We bake that loop into the system prompt as non-negotiable steps.

  2. The action plan is grounded on real past resolutions. When the agent recommends “try shut/no shut on the interface,” it cites the past incident number where this worked and quotes the relevant sentence of its close-note verbatim. No hallucination, no generic advice.

  3. The agent remembers. Every triage commits a structured event to AgentCore Memory. Every future triage queries that memory. The agent builds an internal corpus of “what we, the agents, have seen and decided” that complements the human-authored ServiceNow corpus.

What the operator actually sees

We replaced the JSON ticket dumped to the terminal (operator feedback: “too verbose, too engineering”) with a tight Markdown summary, ordered by what matters first:

### 🟥 CRITICAL — BAD LINK DETECTED on edge-rtr-12345

`edge-rtr-12345` · DC-Backbone-East · alarm open 1158 min
Confidence: **82 %** · ETA: **60 min**

#### Probable cause
Interface TenGigE0/0/0/26 went down coinciding with change CHG1234567
(Internet Access Carve-out, Step 3). Validate in <5 min: SSH and run
`show interface TenGigE0/0/0/26`, then check CHG1234567 status in the
service management portal.

#### Action plan
1. Consult CHG1234567 in the change portal NOW (grounded on `INC4561323`)
2. SSH to the device and run `show interface TenGigE0/0/0/26`
3. Try `shut/no shut` on the interface (grounded on `INC4560130`)
4. If interface refuses to come up: open Spectrum model correction
   ticket (precedent in `INC4339193`)
5. Escalate to vendor TAC under 24×7 GOLD support if hardware fault

#### What worked on similar past incidents
- `INC4561323` — Bouncing the interface (shut/no shut) cleared the
  BAD LINK in under 2h
- `INC4560130` — Spectrum model misidentification, fixed by Spectrum
  correction request
- `INC4339193` — Cable test on the port found a fault; reseating the
  cable restored the link

---
**Sources checked:** Redshift 6 INC on this device (5 read in detail) ·
8 BAD LINK INC corpus-wide (5 read) · CHG1234567 stamped in the
Spectrum alarm · Site: 79 devices, 0 alarmed elsewhere · Topology: 0
peer impact detected
**Email sent to:** on-call.noc@example.com

The same content goes out as a styled HTML email to the on-call shift. Two minutes after the alarm fires.

The two memories — and why we use them differently

AgentCore Memory has two layers:

  • Short-term memory (STM) is per-session conversational state. The runtime uses it transparently during the agent loop: each Tool #N result, each thought, each downstream call stays in the working context. Strands SDK manages this for us.

  • Long-term memory (LTM) is durable across sessions, indexed by strategy. We use the SEMANTIC strategy: each triage’s final output is committed as an event, and a Bedrock-managed embedding pipeline extracts structured metadata (device, site, alarm signature, probable cause, recommended actions, agent confidence) into queryable memory records.

The non-obvious design choice is the namespace key. Conventionally you’d partition by device (the device whose alarm you’re triaging is the natural primary key). We chose to partition by alarm signature (the combination of alarm title and cause) instead.

Why: the relevant pattern is “what works for this kind of failure mode,” not “what has happened on this exact box.” A BAD LINK DETECTED is treated similarly across all 32 000 devices in the fleet. By indexing on signature, the agent retrieves prior triages from any device with the same failure shape — which is exactly what an experienced human L2 would do mentally. Device + severity still go into the search query, so device-specific cases score higher when they exist.

The ServiceNow Redshift history (curated by humans for two years) remains the primary source of truth on past resolutions. LTM is the secondary source: the agent’s own corpus, growing one triage at a time. If the two disagree, we tell the agent in the system prompt to trust ServiceNow and flag the divergence — never the other way around.

What the numbers look like

After three weeks of running in preprod against real Spectrum alarms:

MetricManual baselineAgentRatio
Average triage time~11 min~2 min×5.5
Number of past incidents read per triage1.54.7×3
Action plan grounded on close-notes citations12% of tickets100%
Cost per triage~€3.5 (operator time)~€0.05 (Claude tokens)/70

The cost-per-triage number is the eye-catcher in slide decks. The “past incidents read per triage” number is the eye-catcher in the NOC: it’s the difference between a one-shot recommendation and a recommendation that cites the exact past evidence that supports it.

Demo flow — from terminal to inbox

The whole experience from the operator’s seat is:

  1. They type /wan-triage in their terminal.
  2. The skill calls the deployed runtime ARN over SigV4.
  3. Strands runs the agent loop. Around 15 tool calls fire across Spectrum, ServiceNow Redshift, topology, and email.
  4. ~100 seconds later, the Markdown summary appears in the terminal.
  5. In parallel, the styled HTML report lands in the on-call inbox.

No portal to open, no JSON to copy-paste, no SSH to a jump host. Just a slash command.

What I’d do differently if I started over

  • Start with the eval pipeline. We retrofitted golden-set evaluations after the agent worked. Building the eval set first, in parallel with the agent prompt, would have caught two regressions sooner.

  • Keep the agent single until you actually need orchestration. The temptation to split into “fact gatherer” + “diagnostic” + “writer” agents was real. We tested both and the single-agent loop is cheaper, faster, and easier to debug for a use case where the workflow is linear.

  • Wire the LTM commit on day one, even if you don’t use it. Memory accumulates over the lifetime of the deployment. Starting LTM commits a week after launch means losing a week of corpus you’ll never get back.

What’s next

The next milestone is the feedback loop: when an operator accepts, modifies, or rejects an action plan, we want that signal to flow back into LTM as a labeled event. Over time, the agent learns not just “what was done historically” but “what the current NOC team actually accepts as a useful recommendation today.” That’s where this becomes a continuously improving system rather than a smart one-shot tool.

If you’re building a similar agent for any kind of L2 ops role — network, security incident response, application alerts — happy to compare notes.