Deployment Model¶

Current Stack B Deployment¶

Gitea (engineering-*) → Gitea Actions (uterrie) → Harbor → Fleet GitOps → node1 (RKE2)

Build: Podman/Buildah on uterrie
Registry: Harbor (harbor.cledorze.lan)
Deploy: Fleet watches engineering-infra, deploys Helm charts to node1
Target: node1 with praesenz.io/role=qa label + NoSchedule taint

Container Images¶

Image	Registry Path	Source Repo
u5-orchestrator	harbor.cledorze.lan/library/u5-orchestrator	engineering-ai
u5-signalling	harbor.cledorze.lan/library/u5-signalling	engineering-frontend
u5-admin	harbor.cledorze.lan/library/u5-admin	engineering-frontend
u5-avatar	(external or local build)	engineering-frontend
speaches	harbor.cledorze.lan/library/speaches	engineering-ai

Target Deployment (Per Store)¶

Kubernetes Resources¶

# Per kiosk instance
resources:
  orchestrator:
    requests: { cpu: "500m", memory: "512Mi" }
    limits: { cpu: "1000m", memory: "1Gi" }
    replicas: 1
    gpu: false

  signalling:
    requests: { cpu: "250m", memory: "128Mi" }
    limits: { cpu: "500m", memory: "256Mi" }
    replicas: 1  # per kiosk
    gpu: false

  speaches:
    requests: { cpu: "1000m", memory: "2Gi" }
    limits: { cpu: "2000m", memory: "4Gi" }
    replicas: 1
    gpu: true  # nvidia.com/gpu: 1

  avatar:
    requests: { cpu: "2000m", memory: "4Gi" }
    limits: { cpu: "4000m", memory: "8Gi" }
    replicas: 1  # per kiosk
    gpu: true  # nvidia.com/gpu: 1

  admin:
    requests: { cpu: "250m", memory: "256Mi" }
    limits: { cpu: "500m", memory: "512Mi" }
    replicas: 1
    gpu: false

# Shared services (per store, not per kiosk)
  ollama:
    gpu: true
    models: ["qwen2.5:14b", "nomic-embed-text"]

  qdrant:
    storage: 10Gi
    gpu: false

Helm Chart Structure¶

stack-b/helm/
  ├── Chart.yaml
  ├── values.yaml
  ├── values-qa.yaml        # node1 overrides
  ├── values-store1.yaml    # store-specific overrides
  ├── fleet.yaml
  └── templates/
      ├── orchestrator.yaml
      ├── signalling.yaml
      ├── speaches.yaml
      ├── avatar.yaml
      ├── admin.yaml
      ├── configmap-flows.yaml    # NEW: conversation flow definitions
      ├── configmap-prompts.yaml  # NEW: system prompts
      ├── configmap-kb.yaml       # NEW: KB config (collections, Qdrant endpoint)
      └── pvc-data.yaml

Configuration Management¶

Static config (Helm values): - Service endpoints, ports - GPU allocation - Resource limits - Store-specific settings (branding, language)

Dynamic config (Admin API → DB/ConfigMap): - Conversation flows (YAML) - System prompts - Knowledge base documents - Profile settings (timeouts, TTS voice)

GitOps Flow¶

Developer pushes code → Gitea Actions builds container
                       → pushes to Harbor
                       → updates image tag in engineering-infra values.yaml

Fleet detects change → deploys to node1
                     → rolling update (zero downtime)

Zero-Downtime Updates¶

For kiosk service: 1. New pod starts alongside old pod 2. New pod passes health check 3. WebSocket connections on old pod drain (30s grace) 4. Old pod terminates

For conversation state: - Active conversations stored in Redis/memory - On restart: conversation state lost (acceptable — kiosk conversations are short) - Future: persist conversation state in Redis for seamless handover

Monitoring¶

Health Checks¶

GET /health → { "status": "ok", "gpu": true, "model_loaded": true }
GET /ready  → { "ready": true, "qdrant": "connected", "ollama": "connected" }

Metrics (Prometheus)¶

kiosk_conversations_total (counter)
kiosk_response_time_seconds (histogram)
kiosk_stt_confidence (histogram)
kiosk_rag_chunks_retrieved (gauge)
kiosk_tts_generation_seconds (histogram)
kiosk_active_sessions (gauge)

Alerting¶

Response time > 15s → warning
STT confidence < 0.5 → warning
GPU memory > 90% → critical
Service restart → warning