Review & QA in the Age
of AI-Assisted Development

How we gave our automated reviewer a memory
and turned PR comments into permanent infrastructure

Built at Octostar · open-sourced as pr-war-stories

1 / 15

Three Tiers of Review Today

🤖 The Generic Bot

Runs on every PR, fully automatic

Generic advice: “add error handling”

Starts from zero every run

Zero memory of what went wrong

TIME

MEMORY

+

🧑‍💻 The New Manual

Asks LLM the right questions

Remembers the war stories

Still manual — every PR

Has no time

TIME

MEMORY

↓

🏆 The Bot With Memory

Automatic + remembers your war stories + scoped to each module

TIME

MEMORY

CONTEXT

We can't create time for people.
We can create memory for the bot.

2 / 15

Three Memory Layers

🤖

Layer 1 — Bot memory

.cursor/BUGBOT.md

Read at review time by Bugbot

⚠️ Rule enforced on every diff:

"Never use Promise.all on unbounded arrays. Use a concurrency limiter (max 3)."

Hierarchical — bot traverses upward, collecting rules at each directory level

💻

Layer 2 — Developer memory

LESSONS.md

Read at write time by IDE assistant

📚 Universal lesson:

"In-memory state is a cache, not a source of truth. Always query the DB for conflict detection."

The "why" behind patterns — things the bot can't enforce but devs should know

📄

Layer 3 — File memory

Inline Comments

Visible in diff when file is changed

📌 Pinned to one file:

// WARNING: Runtime-evaluated JS. // Cannot use imports. Do not DRY. // (See PR #760)

Most targeted — zero token waste, bot sees it only when that file is in the PR

3 / 15

Hierarchical Scoping

Deeper modules get more context. Simple changes get only relevant rules. (Example from the Octostar frontend repo.)

.cursor/BUGBOT.mdEvery PR
apps/octostar/.cursor/BUGBOT.mdOctostar PRs
apps/octostar/.../linkchart-nt/.cursor/BUGBOT.mdLinkChart PRs
apps/search-app/.cursor/BUGBOT.mdSearch PRs
packages/.cursor/BUGBOT.mdPackage PRs

Token Budget per Scenario

LinkChart change

~1,684

Octostar change

~1,250

Package change

~859

Search App change

~728

Target: <400 words per file, <2,000 tokens worst case. Fewer rules = more attention per rule.

4 / 15

The Automated Feedback Loop

1

PR Merged

Developer's code lands on main

→

2

Action Fires

harvest-lessons.yml extracts human review comments

→

3

Summary Posted

Structured harvest proposal on the merged PR

→

4

Classify & Commit

Developer runs /pr-war-stories harvest

→

↗

5a

Reviewer learns

BUGBOT.md rules on next review

↘

5b

Coder learns

LESSONS.md read before writing

✓ Next PR catches more — loop continues

5 / 15

Rule Classification System

Reviewable

Bot can check it

→ BUGBOT.md

"Don't use Promise.all on unbounded arrays"

Educational

Informs devs

→ LESSONS.md

"Build order requires packages before apps"

Single-File

One function

→ inline comment

"This adapter uses reference equality intentionally"

Overlapping

Duplicate rule

→ merge

Three stale-state rules merged into one

Stale

Pattern fixed

→ remove

API migrated, old workaround no longer needed

⚠️

The #1 failure mode: dumping everything into BUGBOT.md. This classification prevents context bloat and ensures each rule is placed where it's most effective.

6 / 15

Real War Stories from Our PRs

💥

PR #781

Promise.all on 200 files = OOM

BUGBOT.md

- await Promise.all(files.map(upload))

+ await asyncPool(3, files, upload)

🚫

PR #735 · 42 author-dismissals, one recurring theme

React useState setters are stable — don't flag as deps

BUGBOT.md

- useEffect(() => {...}, [data, setData]) // bot: "missing dep!"

+ author-dismissal → 1 scope rule → bot stops re-flagging

📌

PR #760

Runtime-evaluated JS string — cannot DRY

Inline

// reviewer: "can't you extract this duplication?"

// inline: "runtime-eval'd CustomTemplate, can't import TS"

⏱

PR #741

useState is 1 frame late

LESSONS.md

- const [val, setVal] = useState(x)

+ const valRef = useRef(x) // sync capture

7 / 15

What We Shipped

1,670

Merged PRs mined
3 production repos

20

BUGBOT.md files
112 rules total

27

Lessons in LESSONS.md
across 3 repos

30–52%

of rules from
author-dismissals

🎯

The skill audits itself

Running the v0.7 harvest on 3 repos (TS/React, Python/FastAPI, Scala/JVM) surfaced: (1) one module had 442 of 1,067 substantive review comments but no scope file — intuition-based hierarchy was wrong; (2) the bot filter would have misclassified 308 GitHub-Copilot comments as “human” because Copilot's login lacks a [bot] suffix; (3) author-dismissals yielded 30–52% of all rules across stacks — pattern replicates. Caught and fixed before the skill saw public install. All findings became new doctrine.

/pr-war-stories setup — available as a Claude Code skill for any repository

8 / 15

Two Kinds of Knowledge

📐

Architectural Knowledge

What was decided and why

Written at decision time

Forward-looking

Predictable consequences

Captured in ADRs, design docs

Audience: future architects

Example ADR

"We chose an epoch-based concurrency guard in expandQueue to prevent race conditions during graph mutations."

vs

👻

Shadow Knowledge

What went wrong and what we learned

Emerges after the fact

Backward-looking

Unknowable at decision time

Lives in PR comments, people's heads

Audience: future developers

Example war story

"If you bypass the epoch check or call markDone() directly, pending counts go negative and the UI shows stale loading forever."

ADRs document the skeleton. Shadow knowledge is the muscle memory.
It only exists in the heads of people who've been burned. And it walks out the door when they leave.

9 / 15

The Dual Flywheel

💻

Coding gets smarter

IDE assistants read LESSONS.md before writing code. They stop suggesting patterns your team already learned are dangerous.

+

🤖

Reviewing gets smarter

Bugbot reads BUGBOT.md rules scoped to each directory. It catches what no human has time to check on every PR.

=

🚀

Compounding intelligence

Every mistake becomes a permanent rule. Every review comment becomes organizational memory. Knowledge survives turnover.

The crisis we're solving

AI tools have 10x'd code production but review throughput hasn't scaled.
Senior engineers are the bottleneck — drowning in PRs, rubber-stamping what they should scrutinize.
We're producing code faster than we can safely review it.

What must change

1Write comments for the bot, not a colleague. Comments become rules.

2A detailed rejection beats a quick approval.

3Junior mistakes are raw material for new rules, not failures.

10 / 15

OPEN SOURCE

We open-sourced it.

Everything we built is now a reusable Claude Code skill
that works on any repository with PR history.

sscarduzio/pr-war-stories

github.com/sscarduzio/pr-war-stories

Install claude install-skill sscarduzio/pr-war-stories

Bootstrap /pr-war-stories setup

Harvest /pr-war-stories harvest

Audit /pr-war-stories audit

★ Mines your merged PR history for war stories

★ Creates hierarchical BUGBOT.md rules

★ Installs automated harvest GitHub Action

★ Token-budgeted, classification-driven

Works with Cursor Bugbot, any GitHub repo, any language.

11 / 15

The Skill Commands

Six commands + one workflow, each for a different moment in the lifecycle

/pr-war-stories setup Once per repo

Mines the last 50 merged PRs (up to 200 with --limit), creates hierarchical BUGBOT.md files, LESSONS.md, installs the harvest GitHub Action, wires CLAUDE.md. Full bootstrap.

harvest-lessons.yml Auto — every PR merge

GitHub Action fires automatically. Extracts substantive human review comments, posts a structured harvest summary on the merged PR. No human trigger needed.

/pr-war-stories harvest When harvest comments appear

Reads harvest summaries, classifies each lesson (reviewable / educational / single-file), places it in the right layer. Prioritises author-dismissals of bot findings first — highest-yield rule material.

/pr-war-stories rebalance After audit flags mismatch

Ranks modules by review-comment density. Promotes hot unscoped modules to their own BUGBOT.md, demotes cold scopes into the parent. Fixes intuition-based hierarchy with real data.

/pr-war-stories recheck After big refactors

Greps every rule for path and function references, verifies they still exist. Flags stale scopeRules prefixes and rules never cited by the bot. Reports but does not auto-fix.

/pr-war-stories audit Quarterly

Checks Bugbot hit rate on last 20 PRs. Removes rules that never triggered. Merges duplicates. Graduates rules that can now be linted. Prevents rot.

Automated Human-triggered

12 / 15

Questions you're probably thinking

Anticipated from senior engineers, architects, and tech leads

Won't it rot?

Quarterly audit removes stale rules. Harvest Action keeps fresh ones coming in.

Why not just lint it?

If it can be linted, it should — then remove it from BUGBOT.md. BUGBOT is for contextual knowledge only.

Works with CodeRabbit / Copilot?

Cursor Bugbot reads .cursor/BUGBOT.md natively. Other bots need their own wiring to find it — but inline comments work everywhere, and LESSONS.md works with any IDE assistant (Claude Code, Cursor).

What about big refactors?

Run /pr-war-stories recheck — flags stale paths, does not auto-fix.

13 / 15

Post Scriptum — Future Research

LLM Vision-Based Automated Testing

Describe tests in plain English. An AI model looks at the screen and does it.

Automation SDK

Midscene.js

Plugs into Playwright. No selectors, no XPath, no data-testid.

Vision Model

UI-TARS

Open-source model that sees screenshots and executes UI workflows.

Both run on our hardware. No cloud. No licensing. No data leaves the network.

14 / 15

Post Scriptum — Future Research

Promising, But Needs Validation

Demo-grade works. Production-grade is the question.

What looks good

Runs on our RTX 6000 PRO — no new infra
Handles drag & drop and complex UIs
Tests survive UI refactors — no selectors to break

What we need to test

Speed — full regression in CI time?
Determinism — same result every run?
Our UIs — link charts, record viewers

15 / 15

Review & QA in the Ageof AI-Assisted Development

Three Tiers of Review Today

Three Memory Layers

.cursor/BUGBOT.md

LESSONS.md

Inline Comments

Hierarchical Scoping

Token Budget per Scenario

The Automated Feedback Loop

PR Merged

Action Fires

Summary Posted

Classify & Commit

Reviewer learns

Coder learns

Rule Classification System

Bot can check it

Informs devs

One function

Duplicate rule

Pattern fixed

Real War Stories from Our PRs

What We Shipped

The skill audits itself

Two Kinds of Knowledge

Architectural Knowledge

Shadow Knowledge

The Dual Flywheel

Coding gets smarter

Reviewing gets smarter

Compounding intelligence

We open-sourced it.

The Skill Commands

Questions you're probably thinking

LLM Vision-Based Automated Testing

Midscene.js

UI-TARS

Promising, But Needs Validation

What looks good

What we need to test

Review & QA in the Age
of AI-Assisted Development