Skip to content

Stack 2 / Multi-kiosk production (target)

Status: target shape decided (V3 briefing, 2026-05-29); not built yet. Role: scale deployment, N kiosks across M enseignes. Decision records: ADR-009 (per-site), ADR-011 (two-phase scale).

Why this stack appears in the docs before existing in code

We document the target now to: 1. Force the architectural questions that guide 1A and 1B (sensors, observability, OTA updates, GDPR) 2. Decide Stack 2 prod on scale criteria, not just demo criteria 3. Pitch deck: investors and pilot retailers want to see where it's going, not just where it is.

Scaling shape (hybrid by default)

The production model is hybrid by default. Three tiers, by store size then fleet size:

  1. Phase 1 — small store: one 16 GB master does everything, GPU-less thin workers (ADR-009).
  2. Phase 2 — larger store: a central RTX 4090 runs the AI pipeline, workers render locally on their own GPU (ADR-014).
  3. Fleet — many stores (> ~30 kiosks): a central mini-HPC carries the heavy shared models (ADR-011).

Phase 1 — per-site master (small store)

Each store is autonomous; nothing centralized in the conversation path (ADR-009).

graph TB subgraph Site["Store site (full-local runtime)"] Master["MASTER NODE
GPU 16 GB VRAM
LLM + RAG + STT + TTS + NeuroSync + UE5
RKE2 · Fleet · Harbor · Prometheus · Grafana
3 avatars max"] W1["WORKER 1..4
no GPU
screen + mic + HP + presence sensor"] Router["Managed 5G router
WAN: maintenance only"] end Master -->|WebRTC pixel stream| W1 Master <-->|planned maintenance
models · RAG · telemetry · OTA| Router classDef hero fill:#166534,stroke:#22c55e,stroke-width:3px,color:#fff classDef edge fill:#a16207,stroke:#fbbf24,stroke-width:2px,color:#fff class Master hero class W1,Router edge

Phase 2 — in-store scale-out: central RTX 4090 + GPU-rendering workers

Within a single larger store, split AI from rendering (ADR-014): a central RTX 4090 handles STT → LLM/RAG → TTS → NeuroSync lip-sync for every worker; each worker has its own classic GPU and renders its UE5 avatar locally from the audio + animation streamed over the LAN (no central pixel streaming). Avatar count scales with workers instead of being capped by one master.

graph TB subgraph Site2["Store site (in-store scale-out)"] Brain["CENTRAL NODE · RTX 4090 (24 GB)
STT → LLM / RAG → TTS → NeuroSync lip-sync
shared across all workers"] WA["WORKER A
classic GPU
UE5 render + screen"] WB["WORKER B
classic GPU
UE5 render + screen"] WC["WORKER 1..N
classic GPU
UE5 render + screen"] end Brain -->|audio + animation
(not video)| WA Brain -->|audio + animation| WB Brain -->|audio + animation| WC classDef hero fill:#166534,stroke:#22c55e,stroke-width:3px,color:#fff classDef edge fill:#a16207,stroke:#fbbf24,stroke-width:2px,color:#fff class Brain hero class WA,WB,WC edge

Fleet — central mini-HPC + edge workers (trigger: crossing ~30 kiosks)

graph TB subgraph Central["Datacenter / HQ"] HPC["CENTRAL MINI-HPC
NVIDIA 48 GB
heavy LLM (up to 70B) + large RAG
central rendering: 12-15 avatars"] end subgraph Sites["Edge sites (1..N)"] E1["EDGE WORKER 1..N
local UE5 rendering
small STT/TTS"] end subgraph External["External data"] Salsify[Salsify SupplierXM + Akeneo] OFF[Open Food Facts] Equadis[Equadis / GS1 GDSN] WhatsApp[WhatsApp Business API] BO[Retailer SI
nightly SFTP CSV] end E1 <-->|RAG/LLM queries| HPC HPC --> Salsify HPC --> OFF HPC --> Equadis HPC <--> WhatsApp HPC <-.-> BO classDef hero fill:#166534,stroke:#22c55e,stroke-width:3px,color:#fff classDef edge fill:#a16207,stroke:#fbbf24,stroke-width:2px,color:#fff classDef external fill:#1e293b,stroke:#475569,color:#e2e8f0 class HPC hero class E1 edge class Salsify,OFF,Equadis,WhatsApp,BO external

Latency-sensitive rendering/STT/TTS stays at the edge; expensive, shareable inference (big LLM + big RAG) consolidates centrally. Each site keeps a local fallback so a degraded WAN link does not take the kiosk down.

Structural decisions to make before Stack 2

Decision Options When to decide
~~Local avatar OR cloud stream~~ Decided (ADR-009): per-site master renders, thin workers display via WebRTC
~~Local LLM OR cloud LLM~~ Decided (ADR-011): local per-site until ~30 kiosks, then central mini-HPC for heavy models
Cross-retailer Léa identity Single Léa everywhere / Léa rebranded per retailer Before 2nd signed retailer
OTA updates Mender / Balena / Fleet edge K8s / custom Before 10 kiosks
Observability OTel + Loki self-hosted / Datadog / Grafana Cloud Before customer POC
Kiosk security TPM + secure boot / hardware key / dedicated VPN Before prod

Stack 1A vs 1B decision criteria for Stack 2

(see ../../crosscutting/stack-a-vs-b-comparison.md)

Critical: - Cost per kiosk per month (LLM + avatar + telecom + supervision) - Intrinsic GDPR compliance - Offline robustness - Packaging effort from demo to 100 kiosks - Market differentiation (cloud vs sovereign)

To clarify

  • Kiosk form factor: integrated (one-piece kiosk) vs split (independent sensors + screen + hidden compute)
  • Connectivity: retailer Wi-Fi / dedicated SIM 4G / Ethernet PoE / all three with failover
  • Commercial model with retailer: €/month/kiosk SaaS + hardware as CapEx or rental
  • Service scope: full-service Praesenz (deployment, maintenance) vs self-service with integrator partner

ADRs referencing this stack

  • 003 / Inverted magnet
  • 004 / Stack 1B local-first as Stack 2 prod target
  • 005 / PIM-native architecture
  • 006 / 3-tier data supply (superseded by 013)
  • 007 / ESL broadcast as price truth
  • 008 / Picard pilot (extended by 012)
  • 009 / Per-site master + thin workers
  • 011 / Two-phase scale model
  • 012 / Multi-enseigne go-to-market
  • 013 / PIM source tiers (extended)