Skip to content

Gap Analysis: Stack B vs Virbe

What Stack B Already Has (KEEP)

Local AI Stack (Advantage over Virbe)

  • STT: Whisper (local, no cloud dependency, no per-request cost)
  • TTS: Kokoro (local, fast, no latency to cloud)
  • LLM: Ollama + qwen2.5:14b (local, private, customizable)
  • RAG: Qdrant vector DB + nomic-embed-text (local, fast)
  • Avatar: UE5 MetaHuman with Pixel Streaming (local rendering)

Infrastructure (Advantage)

  • Self-hosted: No vendor lock-in, full data sovereignty
  • GPU local: RTX 3060 on node1, RTX 5070 Ti on uterrie
  • Container-native: Podman, Helm, Fleet GitOps
  • CI/CD: Gitea Actions, Harbor registry

What Stack B is MISSING (BUILD)

P0 — Critical (blocks basic kiosk demo)

1. Conversation Pipeline Router

Virbe has: Visual flow editor with Core → Main → Agent routing, If/Else branching, Go To Flow nodes. Stack B has: Single hardcoded pipeline in orchestrator. Build: State machine in orchestrator with configurable flow routing.

Proposed: YAML-based flow definitions loaded at startup
Future: Visual editor in admin UI

2. Greeting Flow

Virbe has: Dedicated Greet flow triggered on conversation-start. Stack B has: Nothing — conversation starts cold. Build: conversation-start signal handler in orchestrator → TTS greeting → wait for input.

3. Processing Feedback ("Please Wait")

Virbe has: "Report long processing time" with configurable timeout. Stack B has: Nothing — avatar freezes during RAG+LLM processing. Build: WebSocket progress event from orchestrator → signalling → kiosk UI shows animation.

4. Admin Dashboard

Virbe has: Full dashboard (conversations, analytics, KB management, flow editor). Stack B has: Basic /health endpoints, no UI. Build: Extend admin service with: - Conversation log viewer - Knowledge base CRUD - Prompt/system instruction editor - Real-time conversation monitor

P1 — Important (quality of kiosk experience)

5. Handler System (Event Pre-Processing)

Virbe has: 5 handlers (On User Input, Language Change, conversation-start/stop, face-detected). Stack B has: None. Build: Event hooks in orchestrator pipeline:

@handler("conversation-start")
async def on_start(ctx): greet(ctx)

@handler("face-detected")  
async def on_face(ctx): focus(ctx)

@handler("user-input")
async def on_input(ctx): log(ctx), detect_language(ctx)

6. Face Detection → Conversation Trigger

Virbe has: On face-detected handler starts conversation. Stack B has: Manual start only. Build: Camera feed → face detection model → WebSocket signal → orchestrator starts conversation.

7. Focus/Defocus State Machine

Virbe has: Configurable timeouts (defocus after 60s, stop listening after 20s), new user on focus. Stack B has: No idle management. Build: State machine in signalling:

IDLE → (face detected) → FOCUSED → (speech) → LISTENING → (silence 20s) → FOCUSED → (idle 60s) → IDLE

8. TTS Output Quality

Virbe has: Azure TTS (high quality, natural French). Stack B has: Kokoro (decent but less natural). Build: Keep Kokoro as default, add option for ElevenLabs API as premium TTS (key already in keychain).

P2 — Nice to Have (polish)

9. Version Management

Virbe has: Draft/Published/Retired lifecycle. Stack B has: Git-based (implicit versioning). Build: Config versioning via Git tags + admin UI version selector. Lower priority since Git handles this.

10. Multi-Profile Support

Virbe has: Multiple profiles with different personas, languages, settings. Stack B has: Single configuration. Build: Profile configs in admin DB, selectable at conversation start.

11. Knowledge Base Management UI

Virbe has: Collection/Document CRUD with rich text editor, embedding status. Stack B has: Manual Qdrant ingestion. Build: Admin UI → upload docs → chunk → embed → Qdrant. Preview RAG results.

12. Conversation Analytics

Virbe has: Full conversation logs with flow transitions, topic/summary. Stack B has: None. Build: Log all conversations to DB, display in admin UI with search/filter.

13. Display UI / Overlays

Virbe has: Cards, images, product display during conversation. Stack B has: Avatar only. Build: WebSocket overlay commands → signalling → kiosk HTML overlay layer.

Architecture Principle

Virbe is a monolithic SaaS with everything integrated. Stack B should be modular microservices with clear APIs:

┌─────────────┐  ┌──────────────┐  ┌─────────────┐
│  signalling │←→│ orchestrator │←→│  speaches   │
│  (WebRTC +  │  │  (pipeline + │  │  (STT/TTS)  │
│   kiosk UI) │  │  RAG + LLM)  │  │             │
└─────────────┘  └──────┬───────┘  └─────────────┘
                        │
                 ┌──────┴───────┐
                 │    admin     │
                 │  (dashboard) │
                 └──────────────┘

Each service owns its domain. Communication via WebSocket (real-time) and REST (config/admin).