Gap Analysis: Stack B vs Virbe¶
What Stack B Already Has (KEEP)¶
Local AI Stack (Advantage over Virbe)¶
- STT: Whisper (local, no cloud dependency, no per-request cost)
- TTS: Kokoro (local, fast, no latency to cloud)
- LLM: Ollama + qwen2.5:14b (local, private, customizable)
- RAG: Qdrant vector DB + nomic-embed-text (local, fast)
- Avatar: UE5 MetaHuman with Pixel Streaming (local rendering)
Infrastructure (Advantage)¶
- Self-hosted: No vendor lock-in, full data sovereignty
- GPU local: RTX 3060 on node1, RTX 5070 Ti on uterrie
- Container-native: Podman, Helm, Fleet GitOps
- CI/CD: Gitea Actions, Harbor registry
What Stack B is MISSING (BUILD)¶
P0 — Critical (blocks basic kiosk demo)¶
1. Conversation Pipeline Router¶
Virbe has: Visual flow editor with Core → Main → Agent routing, If/Else branching, Go To Flow nodes. Stack B has: Single hardcoded pipeline in orchestrator. Build: State machine in orchestrator with configurable flow routing.
Proposed: YAML-based flow definitions loaded at startup
Future: Visual editor in admin UI
2. Greeting Flow¶
Virbe has: Dedicated Greet flow triggered on conversation-start. Stack B has: Nothing — conversation starts cold. Build: conversation-start signal handler in orchestrator → TTS greeting → wait for input.
3. Processing Feedback ("Please Wait")¶
Virbe has: "Report long processing time" with configurable timeout. Stack B has: Nothing — avatar freezes during RAG+LLM processing. Build: WebSocket progress event from orchestrator → signalling → kiosk UI shows animation.
4. Admin Dashboard¶
Virbe has: Full dashboard (conversations, analytics, KB management, flow editor). Stack B has: Basic /health endpoints, no UI. Build: Extend admin service with: - Conversation log viewer - Knowledge base CRUD - Prompt/system instruction editor - Real-time conversation monitor
P1 — Important (quality of kiosk experience)¶
5. Handler System (Event Pre-Processing)¶
Virbe has: 5 handlers (On User Input, Language Change, conversation-start/stop, face-detected). Stack B has: None. Build: Event hooks in orchestrator pipeline:
@handler("conversation-start")
async def on_start(ctx): greet(ctx)
@handler("face-detected")
async def on_face(ctx): focus(ctx)
@handler("user-input")
async def on_input(ctx): log(ctx), detect_language(ctx)
6. Face Detection → Conversation Trigger¶
Virbe has: On face-detected handler starts conversation. Stack B has: Manual start only. Build: Camera feed → face detection model → WebSocket signal → orchestrator starts conversation.
7. Focus/Defocus State Machine¶
Virbe has: Configurable timeouts (defocus after 60s, stop listening after 20s), new user on focus. Stack B has: No idle management. Build: State machine in signalling:
IDLE → (face detected) → FOCUSED → (speech) → LISTENING → (silence 20s) → FOCUSED → (idle 60s) → IDLE
8. TTS Output Quality¶
Virbe has: Azure TTS (high quality, natural French). Stack B has: Kokoro (decent but less natural). Build: Keep Kokoro as default, add option for ElevenLabs API as premium TTS (key already in keychain).
P2 — Nice to Have (polish)¶
9. Version Management¶
Virbe has: Draft/Published/Retired lifecycle. Stack B has: Git-based (implicit versioning). Build: Config versioning via Git tags + admin UI version selector. Lower priority since Git handles this.
10. Multi-Profile Support¶
Virbe has: Multiple profiles with different personas, languages, settings. Stack B has: Single configuration. Build: Profile configs in admin DB, selectable at conversation start.
11. Knowledge Base Management UI¶
Virbe has: Collection/Document CRUD with rich text editor, embedding status. Stack B has: Manual Qdrant ingestion. Build: Admin UI → upload docs → chunk → embed → Qdrant. Preview RAG results.
12. Conversation Analytics¶
Virbe has: Full conversation logs with flow transitions, topic/summary. Stack B has: None. Build: Log all conversations to DB, display in admin UI with search/filter.
13. Display UI / Overlays¶
Virbe has: Cards, images, product display during conversation. Stack B has: Avatar only. Build: WebSocket overlay commands → signalling → kiosk HTML overlay layer.
Architecture Principle¶
Virbe is a monolithic SaaS with everything integrated. Stack B should be modular microservices with clear APIs:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ signalling │←→│ orchestrator │←→│ speaches │
│ (WebRTC + │ │ (pipeline + │ │ (STT/TTS) │
│ kiosk UI) │ │ RAG + LLM) │ │ │
└─────────────┘ └──────┬───────┘ └─────────────┘
│
┌──────┴───────┐
│ admin │
│ (dashboard) │
└──────────────┘
Each service owns its domain. Communication via WebSocket (real-time) and REST (config/admin).