Skip to content

Deployment Model

Current Stack B Deployment

Gitea (engineering-*) → Gitea Actions (uterrie) → Harbor → Fleet GitOps → node1 (RKE2)
  • Build: Podman/Buildah on uterrie
  • Registry: Harbor (harbor.cledorze.lan)
  • Deploy: Fleet watches engineering-infra, deploys Helm charts to node1
  • Target: node1 with praesenz.io/role=qa label + NoSchedule taint

Container Images

Image Registry Path Source Repo
u5-orchestrator harbor.cledorze.lan/library/u5-orchestrator engineering-ai
u5-signalling harbor.cledorze.lan/library/u5-signalling engineering-frontend
u5-admin harbor.cledorze.lan/library/u5-admin engineering-frontend
u5-avatar (external or local build) engineering-frontend
speaches harbor.cledorze.lan/library/speaches engineering-ai

Target Deployment (Per Store)

Kubernetes Resources

# Per kiosk instance
resources:
  orchestrator:
    requests: { cpu: "500m", memory: "512Mi" }
    limits: { cpu: "1000m", memory: "1Gi" }
    replicas: 1
    gpu: false

  signalling:
    requests: { cpu: "250m", memory: "128Mi" }
    limits: { cpu: "500m", memory: "256Mi" }
    replicas: 1  # per kiosk
    gpu: false

  speaches:
    requests: { cpu: "1000m", memory: "2Gi" }
    limits: { cpu: "2000m", memory: "4Gi" }
    replicas: 1
    gpu: true  # nvidia.com/gpu: 1

  avatar:
    requests: { cpu: "2000m", memory: "4Gi" }
    limits: { cpu: "4000m", memory: "8Gi" }
    replicas: 1  # per kiosk
    gpu: true  # nvidia.com/gpu: 1

  admin:
    requests: { cpu: "250m", memory: "256Mi" }
    limits: { cpu: "500m", memory: "512Mi" }
    replicas: 1
    gpu: false

# Shared services (per store, not per kiosk)
  ollama:
    gpu: true
    models: ["qwen2.5:14b", "nomic-embed-text"]

  qdrant:
    storage: 10Gi
    gpu: false

Helm Chart Structure

stack-b/helm/
  ├── Chart.yaml
  ├── values.yaml
  ├── values-qa.yaml        # node1 overrides
  ├── values-store1.yaml    # store-specific overrides
  ├── fleet.yaml
  └── templates/
      ├── orchestrator.yaml
      ├── signalling.yaml
      ├── speaches.yaml
      ├── avatar.yaml
      ├── admin.yaml
      ├── configmap-flows.yaml    # NEW: conversation flow definitions
      ├── configmap-prompts.yaml  # NEW: system prompts
      ├── configmap-kb.yaml       # NEW: KB config (collections, Qdrant endpoint)
      └── pvc-data.yaml

Configuration Management

Static config (Helm values): - Service endpoints, ports - GPU allocation - Resource limits - Store-specific settings (branding, language)

Dynamic config (Admin API → DB/ConfigMap): - Conversation flows (YAML) - System prompts - Knowledge base documents - Profile settings (timeouts, TTS voice)

GitOps Flow

Developer pushes code → Gitea Actions builds container
                       → pushes to Harbor
                       → updates image tag in engineering-infra values.yaml

Fleet detects change → deploys to node1
                     → rolling update (zero downtime)

Zero-Downtime Updates

For kiosk service: 1. New pod starts alongside old pod 2. New pod passes health check 3. WebSocket connections on old pod drain (30s grace) 4. Old pod terminates

For conversation state: - Active conversations stored in Redis/memory - On restart: conversation state lost (acceptable — kiosk conversations are short) - Future: persist conversation state in Redis for seamless handover

Monitoring

Health Checks

GET /health → { "status": "ok", "gpu": true, "model_loaded": true }
GET /ready  → { "ready": true, "qdrant": "connected", "ollama": "connected" }

Metrics (Prometheus)

  • kiosk_conversations_total (counter)
  • kiosk_response_time_seconds (histogram)
  • kiosk_stt_confidence (histogram)
  • kiosk_rag_chunks_retrieved (gauge)
  • kiosk_tts_generation_seconds (histogram)
  • kiosk_active_sessions (gauge)

Alerting

  • Response time > 15s → warning
  • STT confidence < 0.5 → warning
  • GPU memory > 90% → critical
  • Service restart → warning