Skip to content
All work
Valsoft logo
Hospitality SaaSValsoft2026AI Product Manager

AI Support Agent — multichannel, 5-layer safety

A multichannel (chat / voice / email) AI support agent for 1,500 hotel operators. Two-stage architecture: a read-only RAG agent for Q&A, and a backend deep-dive agent with typed DB access — both gated by a 5-layer safety pipeline and LLM-as-a-judge.

Auto-resolution
30 → 80%
First response
<5s
Cost / interaction
$0.05–0.10
Safety layers
5
StackClaude (Sonnet + Haiku)LangGraphQdrant + Cohere RerankFastAPI + NestJSLangFuseRedis

Problem

1,500 hotel operators on roomMaster Nova had no AI self-service. Every "how do I configure tax settings?" or "why isn't this folio printing?" landed in a human queue with 15–60 minute response times, business-hours only, $8–$15 per ticket in agent time.

A predecessor (dm-support-agent, built for DockMaster) had proven the concept but accumulated structural problems: in-memory vectors lost on restart, in-memory call state lost mid-call, no observability, single-tenant only, 2-second LLM reranking, hardcoded Bedrock dependency. The right move wasn't to extend it — it was to redesign from scratch using what it had taught us.

Approach

I wrote the full PRD and system architecture for a two-stage agent: a multichannel front-line agent for documentation Q&A, and a backend deep-dive agent with typed read/write tools against the production DB. Every architectural decision below is grounded in a specific failure mode from the legacy system or a production requirement I specified.

  1. Hybrid Python + NestJS, not monolithic.

    The AI Core (FastAPI + LlamaIndex + LangGraph + Qdrant) is stateless Python with no auth, no DB, no tenancy. The NestJS Support Agent module — running inside the existing roomMaster Nova backend on AWS App Runner — handles JWT auth, per-hotel DB routing, conversation persistence, escalation, and webhooks. The Python core gets the AI ecosystem; the NestJS module inherits Nova's auth guards, logging, i18n, and deployment pipeline for free.

  2. 5-stage RAG with explicit quality controls.

    Preprocessor strips signatures and noise before embedding. Qdrant cosine search across 2,503 chunks of Nova docs. Cohere Rerank v3.5 on top-10 to produce top-5 — chosen after benchmarking: LLM-based rerank took ~2s with inconsistent quality; Cohere runs in ~200ms with reliable scoring. The answer generator is forced to cite source articles by title — fabrication is structurally blocked, not just warned against.

  3. LangGraph state machine, explicit and auditable.

    Three-node graph: router (cheap/fast Haiku at temp 0 for deterministic intent classification), tool-call (search_docs or verify_customer), generate (Sonnet for the final response). Every tool call, every intent, every model invoked is captured in the AgentState TypedDict and propagated to LangFuse.

  4. 5-layer safety pipeline, non-negotiable.

    (1) Input validation blocks oversized messages and 8 prompt-injection patterns. (2) PCI guard runs Luhn against all input — Visa/MC/Amex with spaces, dashes — and redacts before the LLM sees them. (3) PII redactor strips emails, phones, addresses from logs and traces. (4) Output PCI guard blocks the entire response if anything matches a card pattern. (5) Hallucination guard scores grounding 0–1.0 against retrieved sources; below 0.7 a disclaimer is appended and the trace is flagged. Threshold is env-configurable so it tunes without a deploy.

  5. Three modalities, three personas.

    Chat: under 200 words, bullets for multi-step, bold for menu paths. Email: greeting, full name, detailed explanation, sign-off. Voice: no markdown, no bullets, no special characters, "step one / step two" instead of numbered lists — Sadie speaks markdown literally if you don't strip it. The persona constraint is a product safety requirement, not cosmetic.

  6. Redis for call state, not in-memory dict.

    The legacy system lost in-flight call state on every routine deploy. Sadie webhooks now route through a NestJS SadieStateService backed by Redis. State is persistent, shared across App Runner instances, survives restarts. Every call ends with a Zendesk ticket containing the full transcript and AI summary — resolved or escalated.

  7. LangFuse observability from day one.

    Every LLM call traced with model, tokens, latency, cost. The dm-support-agent had zero observability; retrofitting it into a production system is significantly harder than building it in. The LangFuse dashboard was a Phase 1 success metric, not a Phase 2 add-on.

The hallucination guard, the PCI pipeline, the source-citation requirement, the escalation detection, and the LangFuse tracing are not defensive additions. They are the product.

AI Support Agent PRD, internal

What shipped

  • AI Core.

    25 Python source files across 8 modules — preprocessor, embedder, retriever, reranker, generator, router, hallucination guard, observability — fully implemented and tested.

  • Pluggable LLM provider.

    Factory resolves LLM_PROVIDER at startup: OllamaProvider (free local inference) for dev, BedrockProvider (Claude) for prod. Same pattern for embeddings and reranker. Devs clone, set env, run Docker Compose — no AWS credentials, no billing during iteration.

  • Cost-aware model routing.

    Sonnet (smart, expensive) for chat responses. Haiku (fast, cheap) for intent classification and safety checks. The most expensive model is invoked exactly once per turn, for the task where quality matters most.

  • 45+ unit tests, <30s suite.

    Cover all Luhn-valid card formats, all prompt-injection patterns, the full RAG pipeline, the LLM factory, the query preprocessor, and the reranker. Integration tests run the chat endpoint end-to-end against real Docker services. Highest-cost layer gets the highest test density.

  • Three-phase roadmap.

    Phase 1 (wks 1–7): operator support — software questions across chat, email, voice. Phase 2 (wks 8–12): staff assistant — room availability, housekeeping assignment, bookings via voice and chat with live PMS tool calls. Phase 3 (wks 13–18): multilingual guest concierge — check-in, room service, complaints in EN / FR / ES. Phase gates defined by who uses it and what data access the AI needs.

The infrastructure cost analysis was a PM deliverable too: EC2 Reserved ($81/mo) vs ECS Fargate ($147/mo) vs App Runner+ECS ($186/mo) — $445/yr savings by choosing the simpler path while preserving a clean migration to managed services when traffic demands it. Each Docker Compose service is independently migratable; no big-bang.

Outcome

Auto-resolution

30% → 80%

Agent I (read-only RAG) resolves up to 30% at launch, 40%+ after RAG and persona iteration. Agent II (DB access, LLM-as-a-judge gated) resolves up to 80% on covered categories.

Latency

60 min → 5s

First response time on covered categories collapses from 15–60 minutes (human queue) to under 5 seconds (LLM).

Unit cost

$10 → $0.10

Cost per interaction drops from $8–$15 in human-agent time to $0.05–$0.10 in LLM tokens. ~100× unit-cost reduction on covered traffic.

Coverage

24/7, 1,500 hotels

Business-hours-only coverage becomes always-on, multichannel (chat / voice / email), in three languages by Phase 3.

What I’d do differently

  • Build the eval set from real tickets, not synthetic ones. Synthetic evals lie politely; real tickets surface the failure modes that matter.
  • Treat tools like models. Pin versions, change-log behavior, run smoke evals on every upgrade. A silent tool change is an invisible regression.
  • Specify the LangFuse cost dashboard before the first hotel goes live. Operators of the product need a cost-and-quality view from day one — it's the artifact that justifies further rollout.

Related work