AI Product Development Lifecycle
Took a forty-year Clarion codebase serving 1,500 hotels worldwide and rebuilt it as a cloud-native PMS using a 55-agent AI Product Development Lifecycle. Seven major launches in six months at roughly ten times conventional velocity.
- Launches in 6 months
- 7+
- Velocity vs conventional
- ~10×
- Agents in pipeline
- 55
- Person core team
- 4
Problem
roomMaster is a 40-year-old Property Management System running on a Clarion codebase, deployed across more than 1,500 hotels worldwide. The schema had been denormalized and patched for two decades. There were no architecture documents. The discovery-to-ship loop for a single feature was 12 to 18 months.
The strategic ask wasn’t a refactor. It was a cloud-native successor — roomMaster Nova — that customers could migrate to without losing their workflows, ahead of competitors who were already shipping AI-native PMS products. Conventional team scaling wouldn’t get there: hiring fifty engineers takes nine months and won’t fix the bottleneck, which is the cycle time of every decision in the pipeline.
Approach
I designed and shipped what we now call the AI Product Development Lifecycle (AI PDLC): an orchestrated multi-agent pipeline that takes a market signal and walks it through discovery, PRD generation, code synthesis, QA, release, and post-launch analysis with typed handovers and quality gates between each stage.
The AI PDLC runs two tracks on the same agentic foundation: a greenfield track that builds roomMaster Nova from zero (the launches below), and a brownfield track — the Parity Pipeline — that web-izes the 25-year-old rw5main product without losing parity (further down). Same discipline (typed handovers, hard quality gates, hook-enforced correctness), different job.
- Five orchestrators.
Divide the lifecycle into discovery, design, build, ship, and learn. Each owns its outputs as typed artifacts (intent docs, PRDs, ADRs, PRs, runbooks, dashboards). Handovers are schema-validated; a failed gate kicks back to the previous orchestrator with a structured error.
- Twenty-six skills.
The reusable atomic units the orchestrators compose: competitive analysis, JTBD synthesis, schema diff, migration generator, telemetry plan, etc. Each skill is a prompt + tool contract pinned to a model and a budget.
- Fifty-five agents.
Execute the skills with role-specific system prompts, retrieval contexts, and evaluation harnesses. The agent population is small enough to hold in your head but large enough to specialize.
- A coding PM, not a spec-writer.
I ship production PRs to the AI Support Agent platform daily. The pipeline can't be designed by someone who hands work over the wall; it has to be designed by someone who feels the round-trip cost of every gate.
- Instrument everything.
Every agent run, prompt, eval score, and human-in-the-loop intervention writes to LangFuse. The pipeline isn't a black box — it's a profilable system you can A/B test like any other product.
AI PDLC is not a productivity tool story. It is a new operating architecture for how software gets built.
How the pipeline ingests a legacy system
The hardest part of replatforming wasn’t writing modern code — it was extracting 40 years of encoded business logic out of three sources no single person could hold in their head at once. The Definition Stage pulls all three into one repository for the first time, then runs three ingestion streams in parallel:
- Stream 1 — Knowledge Base extraction.
Hundreds of unstructured wiki pages synthesized into coherent domain knowledge, organized by module (reservations, housekeeping, channel management, billing, admin). The business-logic layer: what the system is supposed to do, independent of how Clarion happened to implement it.
- Stream 2 — Database reverse-engineering.
The denormalized MariaDB schema described structurally — tables, columns, implied relationships, accumulated technical debt — then used to derive a clean normalized target schema. The legacy structure was never inherited; it was reasoned about and replaced.
- Stream 3 — Source-code parsing.
The Clarion source was pulled into the shared repository and fed to agents for behavioral extraction — not line-by-line translation (which inherits every legacy bug), but intent: what does this module do, under what conditions, with what inputs and outputs.
Those streams feed a seven-stage flow — Discovery → Definition → Design → PRD → Prompt Engineering → Requirements Extraction → Requirements Incorporation → Engineering. Requirements Extraction is where precision is earned: it runs four versioned passes — v1 a rough extraction, v2 adding LangGraph + in-context refinement, v3 covering adjacent-system concerns, v4 producing the final clean artifact set. Routing logic only promotes an artifact once it clears a quality threshold, so engineering receives requirements with explicit boundaries, contracts, edge cases, and surfaced “clean questions” — not ambiguous tickets that become rework.
What shipped
The AI PDLC’s output is roomMaster Nova — the cloud-native successor, rebuilt from zero on Node.js / AWS: multi-region, API-first, browser-based, carrying none of roomMaster Cloud’s 40 years of technical debt. The denormalized schema wasn’t ported; it was replaced with a clean normalized model derived from what the data actually represented. The Clarion business logic wasn’t translated; it was re-specified from intent. Every Nova feature ships with the same instrumentation, release runbook, and SLO scaffolding — because the pipeline produces them, not separate teams writing separate conventions.
Module deep-dive: AI Support Agent
One module is worth opening up — the AI Support Agent, the platform I ship production PRs to daily and the clearest showcase of the engineering bar the pipeline enforces. A multichannel agent for the 1,500 hotel operators on roomMaster Nova who previously waited 15–60 minutes for software-question answers at $8–$15 per ticket in human-agent time. The PRD specified a two-stage architecture: Agent I answers from documentation via RAG; Agent II (Phase 2) has typed read/write tools against the production DB and is gated by an AI-Judge before any response leaves the system.
- 5-stage RAG with explicit quality controls.
Preprocessor strips noise; Qdrant cosine search across 2,503 doc chunks; Cohere Rerank v3.5 at ~200ms (vs ~2s LLM-rerank); generator forced to cite source articles by title — fabrication is structurally blocked, not warned against.
- LangGraph state machine, three nodes.
router (Haiku, temp 0, deterministic) → tool-call (
search_docs/verify_customer) → generate (Sonnet for response). Every tool call, intent, and model invoked is captured inAgentStateand traced into LangFuse. - 5-layer safety pipeline, non-negotiable.
(1) Input validation + 8 prompt-injection patterns. (2) PCI input guard (Luhn over Visa/MC/Amex, redact pre-LLM). (3) PII redactor on logs/traces. (4) PCI output guard. (5) Hallucination guard scoring grounding 0–1.0 vs retrieved sources; <0.7 → disclaimer + trace flag.
- Three modalities, three personas.
Chat ≤200 words, bullets. Email: greeting + sign-off. Voice: no markdown, "step one / step two" — Sadie reads markdown literally. Persona constraints are a product safety requirement, not cosmetic.
- Operational loops wired in from day one.
A flywheel loop pushes resolved tickets back as training signal. A Status Page integration pipes current outage state into the agent — answers reflect present reality, not stale docs. Redis-backed call state survives restarts; every call ends with a Zendesk ticket (transcript + AI summary).
The reranking decision — why Cohere over an LLM. The corpus is 2,503 preprocessed chunks of roomMaster Nova help docs, KB articles, and release notes — section-aware splits carrying source/section/version metadata so retrieval filters on product version and language. Qdrant returns the top-10 candidates; something has to rerank them to a top-5 the generator can trust.
The legacy dm-support-agent used LLM-based reranking — roughly 2 seconds per query, with inconsistent ordering run-to-run. I benchmarked Cohere Rerank v3.5 against it on the actual Nova corpus before committing: ~200 ms on the same candidate set — 10× faster, more deterministic relevance scoring, and an order of magnitude cheaper per query. The quality delta was within tolerance for a support-agent use case; the latency delta was the difference between a chat that feels synchronous and one that feels like an outage. Reranking is the kind of decision that’s invisible in a demo and decisive in production.
The hallucination guard, the PCI pipeline, the source-citation requirement, the escalation detection, and the LangFuse tracing are not defensive additions. They are the product.
The same architectural disciplines — typed handovers, structural fabrication blocks, day-one observability, env-tunable thresholds — are what every other launch in the pipeline above inherits, applied to its own domain. Agent-specific outcome metrics roll up into the consolidated Outcomes block below.
The brownfield track: Parity Pipeline
The AI PDLC builds the new product. But 1,500 hotels still run the old one — rw5main, 25 years of Clarion: ~1.25M lines across 1,207 .clw files, 800+ DB objects, a positional-encoded mega-config table nobody documented. Nova is the long-term answer; it can’t be the only one. So the same agentic discipline was pointed at a different problem — web-izing the legacy product without losing operator muscle memory. I designed an audit-first pipeline where the unit of work isn’t a feature; it’s an audit, and code is a transcription of the audit.
- Skills orchestrate, agents execute, rules govern.
22 skills (auto-loaded by file context), 31 single-purpose agents, 22 non-negotiable rules, and an 11-persona Expert Council that votes on architectural decisions. Rule Zero: correctness over speed — verify every Clarion column from source + DDL, never from memory; use exact label text; no placeholders, no inferred behavior.
- Audit-as-contract, deterministic parity scoring.
A feature is PARITY only at ≥ 90% on a weighted seven-component rubric (Functional Logic 30%, Data 20%, UI 15%, Role 15%, Admin 10%, Validation 5%, Dependencies 5%), with zero P0/P1 gaps and zero fabricated DB objects. A fabricated object zeroes its entire component. The audit is the source of truth — not the agent’s confidence.
- Two hooks block bad code at the keystroke.
A post-write hook fires after every edit and blocks CRITICAL violations with exit code 2; a pre-push gate runs all 38 quality-gate rules (QG-001 → QG-038) plus
tsc --noEmitbefore anything reaches the remote. The enforcement layer catches drift before it reaches a pull request — which is the only way a four-person-equivalent agentic system survives a million-line port.
The result: ~93–95% rw5main parity shipping against a live customer base — Night Audit, Quick Room, and Virtual Day Function at 100%, Forecast ~99%, New Reservation ~95%, 326/327 tests passing. The Express chain was formally deprecated 11/11 by the Expert Council, zero dissent. The same pipeline already proved reusable on a second product (roommaster-corp). Legacy migration at this scale turned out to be a product-design problem — what counts as parity, who decides, how it’s verified — not just an engineering one.
Outcome
PDLC · Cycle time
Weeks → days
PRD-to-merged-PR median dropped from weeks to days. Single-feature ideas move from intent doc to production behind a flag inside one calendar day.
PDLC · Headcount efficiency
4 people, ~10× velocity
Four-person core team outpacing the conventional baseline we measured against on the legacy product by roughly an order of magnitude — 7+ launches in 6 months.
PDLC · Quality bar
Zero critical incidents
No critical post-release incidents on the launches above. 5-layer safety pipeline on the AI Support Agent passes 45+ unit tests per change.
Agent · Auto-resolution
30% → 80%
AI Support Agent — Agent I (read-only RAG) up to 30% at launch, 40%+ after persona iteration; Agent II (DB access, AI-Judge gated) up to 80% on covered categories.
Agent · Latency
15–60 min → <5 s
First response on covered categories collapses from human-queue minutes to LLM seconds, 24/7 across chat / voice / email.
Agent · Unit cost
$10 → $0.10
Cost per interaction drops from $8–$15 in human-agent time to $0.05–$0.10 in LLM tokens — roughly 100× cost reduction on covered traffic.
What I’d do differently
- Invest in agent observability earlier. LangFuse went in at month three. The two months before that, we were debugging by re-reading logs. Profile from day one.
- Pin model versions per skill, not globally. A model upgrade is a regression risk. Treat models like any other dependency.
- Build the eval harness before the skill, not after. Backfilling evals into already-shipped skills is twice the work.
Related work







