System architecture

This document is the single comprehensive view of how KVM Fleet is built, deployed and operated. It is intended for prospects with a security or platform team who want to understand the trust model, for auditors evaluating the compliance posture, and for future maintainers who need to navigate the codebase.

The companion documents in this section drill into specific subsystems:

Threat model — adversarial view, comparison with VPN / exposed BMC / shared passwords
WebSocket multiplexing — how the agent tunnel carries HTTP and WS sessions
Audit chain — the SHA-256 hash chain on audit_events
Row-Level Security — how Postgres RLS isolates tenants

At a glance

flowchart LR
    subgraph operator["Operator (anywhere)"]
        browser["Browser<br/>SPA + WebRTC"]
    end

    subgraph hetzner["Hetzner CX22 (Falkenstein, DE) — EU only"]
        caddy["Caddy 2<br/>TLS 1.3 terminator"]
        platform["FastAPI platform<br/>audit + RBAC + JIT + approval"]
        pg[("Postgres 16<br/>RLS + SHA-256 audit chain")]
        redis[("Redis 7<br/>presence + sessions")]
        coturn["coturn<br/>TURN relay"]
    end

    subgraph customer["Customer network"]
        agent["Go agent<br/>on PiKVM<br/>(outbound-only)"]
        kvmd["kvmd<br/>localhost:80"]
        bmc["Dell iDRAC / HPE iLO<br/>Supermicro / Lenovo XCC<br/>(Redfish-managed)"]
    end

    browser <==>|"HTTPS + WSS"| caddy
    caddy --> platform
    platform <--> pg
    platform <--> redis
    agent ==>|"outbound WSS<br/>(agent dials in)"| caddy
    agent --> kvmd
    platform -->|"Redfish HTTPS<br/>session or basic auth"| bmc
    browser -. "WebRTC P2P<br/>DTLS-SRTP" .-> agent
    browser -. "WebRTC fallback" .-> coturn
    coturn -. "fallback only" .-> agent

Three trust domains: the operator's identity (Google SSO + 2FA), the platform (KVM Fleet — audit, authorisation, signaling), and the customer's BMC / PiKVM. The platform sits in the middle of the data path for everything except WebRTC media (which terminates DTLS-SRTP at the agent, peer-to-peer to the operator's browser). No inbound ports are opened on the customer side — the agent dials out over WSS. The full adversarial breakdown is in Threat model.

1. Product overview in one paragraph

KVM Fleet is a B2B SaaS that adds a fleet dashboard, browser-based remote console, tamper-evident audit log, role-based access control, Google SSO, alerting, ISO library, per-device monitoring and one-click compliance reports on top of customer-owned PiKVM hardware. A small Go agent (static binary, arm/arm64/amd64) runs on each PiKVM and dials out to the platform over an authenticated WebSocket. From there an operator at app.kvmfleet.io gets unified control of every device, with all data-at-rest and most data-in-flight inside the European Union (Hetzner, Falkenstein).

2. Components

2.1 The marketing site (`kvmfleet.io`)

Static HTML/CSS, no JavaScript framework. Served by Caddy from deploy/landing/. Pages: /, /terms.html, /privacy.html, /dpa.html, plus /.well-known/security.txt, /robots.txt, /sitemap.xml, /llms.txt. Includes JSON-LD structured data (SoftwareApplication, Organization) so search engines and LLM crawlers can index the product correctly. Heavy use of CSS variables and a single embedded <style> block per page; no build step.

2.2 The dashboard SPA (`app.kvmfleet.io`)

React 18 + TypeScript + Tailwind, bundled by Vite. Built once per deploy by the web-build Docker service into a Caddy-served static volume. State is managed by @tanstack/react-query. Real-time updates use a WebSocket to the platform at /v1/ws for presence + heartbeat broadcasts. Charts use Recharts. WebRTC console uses the browser's native RTCPeerConnection. No SSR — fully client-side after the initial bundle download.

2.3 The platform (FastAPI, Python 3.12)

Single FastAPI process behind Caddy at app.kvmfleet.io/v1/*. Routers under platform/app/routers/:

Router	Surface
`auth`	Local signup/login, password change/reset, refresh-token rotation
`sso`	Google OIDC SSO callback flow
`twofa`	TOTP enrolment, recovery codes, MFA challenge
`team`	Org member roster, invites, role changes, account deletion
`org`	Org-level settings (compliance frameworks, country)
`devices`	Device list, enrolment, rotate token, remove
`power`	ATX power control via the agent tunnel
`console`	The legacy iframe-tunnelled kvmd console
`webrtc`	Phase 1 WebRTC signalling (preview)
`isos`	ISO library catalogue + per-device mount/unmount
`support`	Customer-side support tickets
`admin`	Support-admin-only routes (errors, ticket cross-org view)
`audit`	Per-org audit event query + chain integrity check
`audit_webhooks`	Per-org SIEM webhook endpoints + dispatcher
`alerts`	Alert rules + history
`billing`	Stripe Checkout + Customer Portal + webhook + plan info
`reports`	PDF compliance report generation
`api_tokens`	Long-lived bearer tokens for the MCP server / scripts
`ws`	Operator presence stream + agent tunnel endpoint
`public`	Public, unauth endpoints (plans, beta status)
`downloads`	Signed agent binary downloads (pinned version)
`beta`	Private-beta gate + waitlist

2.4 The agent (Go 1.24, pure Go, no CGO)

Single static binary installed at /usr/local/bin/kvmfleet-agent on the PiKVM. Configured via env vars (KVMFLEET_API, KVMFLEET_TOKEN, KVMFLEET_KVMD_USER, KVMFLEET_KVMD_PASS, etc.) and a small JSON state file at /var/lib/kvmfleet/state.json. Dials out to wss://app.kvmfleet.io/v1/agent/ws on boot, authenticates with the per-device token, then services multiplexed HTTP requests and WebSocket channels from the platform side.

Internal HTTP routes the agent exposes through the tunnel:

/health — agent self-info, always reachable
/api/... — proxied to local kvmd (browser console + power + ATX state)
/streamer/... — proxied to local uStreamer (video frames for legacy console)
/extras/webterm/ttyd/... — direct Unix-socket route to ttyd (serial console)
/internal/iso/{mount,unmount} — ISO library handlers (download, sha256-verify, mount via kvmd MSD)
/internal/webrtc/offer — pion-based WebRTC PeerConnection negotiation (HID DataChannel; video track in phase 2)

2.5 Postgres (16-alpine)

Single instance. Two databases on the same cluster:

Database	Used by	Notes
`kvmfleet`	platform	All operational data
`glitchtip`	GlitchTip	Error tracker (separate schema)

Roles:

Role	Privilege	Used by
`kvmfleet`	superuser	Migrations, admin scripts, glitchtip's bootstrap
`kvmfleet_app`	NOSUPERUSER NOBYPASSRLS	The platform's runtime connection — RLS applies to it

Most org-scoped tables have forced row-level security with a org_iso policy keyed on current_setting('app.current_org'). The platform sets that GUC on every authed request via set_org_context(session, org_id). See rls.md for the full table list.

2.6 Redis (7-alpine)

Two databases on one instance:

DB	Used by
0	platform: agent presence, alert dedup, error-alert dedup
2	GlitchTip: Celery broker + result backend

No persistence enabled — all keyed data is short-lived (presence TTL 30s, dedup TTL 1h). A Redis crash loses presence state which the next agent heartbeat repopulates.

2.7 Caddy 2 (TLS terminator + static server + reverse proxy)

Single Caddy process serves four virtual hosts:

Host	Backed by	Purpose
`kvmfleet.io`	static `/srv/landing`	Marketing pages
`kvmfleet.io/docs/`	static `/srv/docs`	This documentation site
`app.kvmfleet.io`	reverse-proxy `platform:8000` for `/v1/*`, static `/srv/web` for the SPA	Operator dashboard + API
`errors.kvmfleet.io`	reverse-proxy `glitchtip:8000`	Internal error tracker

Plus the legacy eurokvm.io and app.eurokvm.io 308-redirect to the new domain.

Caddy auto-provisions TLS certs via Let's Encrypt, sets HSTS preload, strict CSP per host, long-lived Cache-Control: immutable on hashed assets, short cache on HTML.

2.8 coturn (TURN server for WebRTC)

Runs on the Hetzner host network (not in a docker bridge — TURN's UDP relay range needs raw access). Listens on UDP/TCP 3478 (STUN/TURN) and 5349 (TLS), with a relay range of 49152-49200. Uses --use-auth-secret so the platform mints ephemeral 24h credentials without a per-user database. Only used as a WebRTC fallback; most connections traverse direct host or STUN-discovered paths.

2.9 GlitchTip (error tracker, Sentry-SDK compatible)

Self-hosted, EU-resident replacement for Sentry. Three docker services (glitchtip-migrate, glitchtip, glitchtip-worker) sharing the cluster's Postgres + Redis. Public registration disabled (ENABLE_USER_REGISTRATION=false). The platform's app/observability.py initialises sentry-sdk against it; the SPA initialises @sentry/react. Aggressive PII scrubbing in before_send strips JWTs, refresh tokens, Stripe keys, sensitive headers and request bodies before any event leaves the platform.

2.10 External services

Service	Used for	Region
Stripe Payments Europe Ltd.	Checkout, Customer Portal, recurring subscriptions, dunning	Ireland
Brevo (ex-Sendinblue)	Outbound transactional email (welcome, invite, password reset, alerts, signup notification)	France
ImprovMX	Inbound email forwarding for `*@kvmfleet.io` aliases (privacy@, security@, support@, etc.)	Belgium
Google Cloud Identity Platform	OIDC SSO	Multi-region; we use the EU OAuth client
Hetzner Online GmbH	Compute (CX22), DNS lookups, network egress	Germany (Falkenstein)
INWX	Domain registrar + authoritative DNS for `kvmfleet.io`	Germany
Cloudflare STUN	NAT discovery for WebRTC (`stun.cloudflare.com:3478`)	global anycast
npmjs.com	Distribution for `@kvmfleet/mcp`	US (read-only on customer side)

Each is disclosed as a sub-processor in the DPA.

3. Hosting topology

                               Hetzner CX22  (FSN1, Falkenstein DE)
                       ┌─────────────────────────────────────────────────┐
                       │                                                 │
                       │  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
                       │  │  Caddy 2   │  │  platform  │  │ glitchtip │  │
                       │  │  TLS / SNI │  │  FastAPI   │  │  + worker │  │
                       │  └─────┬──────┘  └─────┬──────┘  └─────┬─────┘  │
                       │        │               │               │         │
                       │        │               ▼               │         │
                       │        │      ┌────────────────┐       │         │
                       │        │      │  Postgres 16   │◄──────┘         │
                       │        │      │  + Redis 7     │                 │
                       │        │      └────────────────┘                 │
                       │        │                                         │
                       │  ┌─────┴──────┐  ┌────────────┐                  │
                       │  │ web-build  │  │  coturn    │                  │
                       │  │ (one-shot) │  │  net=host  │                  │
                       │  └────────────┘  └────────────┘                  │
                       │                                                  │
                       └──────────────┬───────────────────────────────────┘
                                      │
                                      │ public IP 46.225.227.71
                                      │ TCP 443 (TLS), 80 (LE), 22 (ssh)
                                      │ UDP 3478, 5349, 49152-49200 (TURN)
                                      ▼
                          ┌────────────────────────────┐
                          │      THE INTERNET          │
                          └────────────────────────────┘
                            │              │            │
                            ▼              ▼            ▼
                 ┌──────────────┐ ┌─────────────────┐ ┌───────────┐
                 │  Operator    │ │   PiKVM agent   │ │ Customer  │
                 │  browser     │ │ (anywhere)      │ │ stripe /  │
                 │  (anywhere)  │ │ outbound WSS    │ │ google    │
                 └──────────────┘ └─────────────────┘ │ callbacks │
                                                      └───────────┘

One CX22 (€3.99/mo, 2 vCPU, 4 GB RAM, 40 GB NVMe) hosts everything: Caddy, the platform, Postgres, Redis, GlitchTip, coturn, the static SPA bundle, the docs site, the marketing site. There is no horizontal scaling today; that's a deliberate choice for a pre-revenue product. The path to scale is documented in the open-questions section at the bottom.

4. Walkthroughs of the major data flows

Browser POSTs /v1/auth/login with {email, password}.
Platform looks up the user (RLS off — app.current_org not yet set), bcrypt-verifies the password, checks account lockout state.
If TOTP is enabled, returns a short-lived mfa_token (5 min, audience mfa); otherwise issues an access JWT (15 min, audience access) + a refresh token (random opaque ekvrf_… string, sha256-hashed and stored in refresh_tokens).
Refresh token returned via HttpOnly Secure SameSite=Strict cookie scoped to /v1/auth. Access token returned in JSON body.
SPA stores the access token in memory and sends it as Authorization: Bearer … on every API call.

Refresh: SPA POSTs /v1/auth/refresh (cookie auto-attached). Platform sha256-hashes the cookie value, looks up the row, checks for prior reuse (returns 401 + revokes the entire family if found), rotates: marks the current row replaced, issues a new pair.

4.2 Google SSO

Browser navigates to /v1/auth/google → Authlib redirects to Google with the configured redirect_uri.
Google authenticates the user, redirects back to /v1/auth/google/callback?code=….
Platform exchanges the code for an ID token, validates email_verified=true, looks up or creates the user. If a non-expired invite exists for that email, joins them to the inviter's org; otherwise creates a fresh personal org.
Issues access + refresh tokens (same as local login). Redirects to https://app.kvmfleet.io/login#access_token=… so the SPA picks them up.
New users trigger a one-shot AcceptTerms gate before they can use the dashboard (Terms + B2B confirmation).

4.3 Device enrolment

Operator clicks "Add device" in the SPA → POST /v1/devices/enrollment with optional suggested_name. Platform issues an enrollment_token (random, 24h TTL) bound to the org.
SPA shows the operator a one-line install command: curl -sSL https://app.kvmfleet.io/install | KVMFLEET_TOKEN=<plaintext> sh.
Operator runs the command on the PiKVM via SSH. The install script downloads the agent binary from /downloads/kvmfleet-agent.linux-arm64, writes a systemd unit, sets KVMFLEET_TOKEN in /etc/kvmfleet/agent.env, starts the service.
Agent boots, POSTs /v1/agent/register with the enrollment token + hardware ID + agent version + kvmd version. Platform consumes the enrollment token, creates a devices row, returns a long-lived per-device auth_token (sha256-hashed at rest in devices.auth_token_hash).
Agent persists the auth_token to /var/lib/kvmfleet/state.json and switches to the long-lived authenticated WebSocket: wss://app.kvmfleet.io/v1/agent/ws?token=<auth_token>.
Heartbeats every 30s update devices.last_seen_at, cpu_temp_c, uptime_s, and append a device_metrics row for the historical-data graph.

4.4 Browser console (legacy iframe path)

Operator clicks "Console" → SPA POSTs /v1/console/start for a console_session row + a 60-min aud=console JWT in an HttpOnly cookie scoped to /v1/devices/{id}/console/.
SPA opens an <iframe> to https://app.kvmfleet.io/v1/devices/{id}/console/ (which serves the kvmd web UI through the proxy).
Every iframe HTTP request hits the platform's routers/console.py proxy. The proxy:
Validates the console JWT cookie + ACL the device against the org
Strips kvmfleet's own cookies from the outbound Cookie header
Path-allowlists requests (kvmd's /api/, /streamer/, /share/, /login, /logout, /auth/)
Rewrites response HTML/JS so resource URLs resolve under /v1/devices/{id}/console/…
Rewrites Set-Cookie attributes (Path, Domain, Secure)
Platform forwards the request through the agent's WebSocket tunnel (multiplexed channel; see ws-multiplex.md). Agent serves the request locally from kvmd, returns the response.
WebSocket subscriptions for the live video stream use a multiplexed WS channel through the same tunnel, with extra Origin checks at handshake.

This whole path is being retired in favour of the WebRTC console (next section).

4.5 Browser console (WebRTC, preview, support-admin only)

Operator clicks "Console (RTC)". SPA fetches /v1/devices/{id}/webrtc/ice-config to get STUN + ephemeral TURN credentials.
SPA creates an RTCPeerConnection with those ICE servers, attaches a recvonly video transceiver and an unordered HID DataChannel.
SPA generates an SDP offer, waits 2s for ICE gathering, POSTs the offer to /v1/devices/{id}/webrtc/offer.
Platform proxies the offer through the agent's WebSocket tunnel to the agent's /internal/webrtc/offer.
Agent (pion v4) accepts the offer, builds its own PeerConnection with the same ICE servers, creates an SDP answer, gathers ICE for up to 5s, returns the answer JSON.
SPA applies the remote description. ICE traversal runs concurrently on both sides — direct host pair, STUN-discovered pair, or TURN-relayed pair (whichever wins).
After ICE settles, DTLS-SRTP handshake runs end-to-end between browser and agent. The platform plays no role from this point on.
HID events flow as JSON strings over the DataChannel. The agent forwards each one as an HTTP POST to kvmd's /api/hid/events/send_… endpoints using the same Basic auth and session cookies as the rest of the proxy.
Video track is currently a placeholder (recvonly, no source bound) — Phase 2 wires it to kvmd's H.264 stream (/streamer/stream).

The signalling pipe is platform-mediated; the media pipe is end-to-end DTLS-SRTP encrypted. The platform sees SDPs (~10 KB) and ICE candidate metadata; it cannot decrypt media.

4.6 ISO mount

Operator registers an ISO at app.kvmfleet.io/isos: name, source URL (HTTPS), sha256, optional size. Stored in isos table with FORCE RLS.
Operator opens a device, picks an ISO, clicks "Mount" → POST /v1/devices/{id}/iso:mount. Platform audit-logs before firing, rate-limits to 1 mount per device per hour.
Platform forwards {name, source_url, sha256, size_bytes, media_type} to the agent's /internal/iso/mount.
Agent streams the URL to a tmp file with concurrent SHA-256 hashing. Refuses to proceed on hash or size mismatch.
Agent disconnects any existing MSD, multipart-uploads to kvmd's /api/msd/write?image=<name>, calls /api/msd/set_params and /api/msd/set_connected=1.
Bytes never touch the platform — egress flat, GDPR posture clean.

4.7 Compliance report generation

Operator picks a framework (NIS2, GDPR, SOC 2, ISO 27001, HIPAA, NIST 800-53, PCI DSS, ISO 9001) → POST /v1/reports/{framework} with optional from_date / to_date.
Platform's _collect_evidence runs a single async pass that queries device counts, audit-event counts by action, hash-chain integrity, member roster, failed logins, 2FA failures, password resets, refresh-token reuse and other org-scoped facts.
The framework spec maps each control to an evaluator function that classifies the control as COVERED / PARTIAL / NOT COVERED / N/A based on the evidence.
ReportLab renders the PDF: cover sheet, executive summary with coverage roll-up, per-control table with status badges, framework-specific sections (Art. 30 ROPA template for GDPR, CUECs for SOC 2, Annex A theme breakdown for ISO 27001, etc.), per-framework "this is not an attestation" disclaimer.
PDF returned with Content-Disposition: attachment; filename=kvmfleet-{framework}-{from}-{to}.pdf. Audit row written.

4.8 Audit log + tamper-evidence

Every meaningful action (login, console open, power cycle, ISO mount, settings change, role change, deletion, etc.) writes a row to audit_events. Each row carries the SHA-256 hash of the previous row's hash plus the canonical-JSON of its own payload — a cryptographic chain. See audit-chain.md.

The table is append-only at the DB level:

REVOKE UPDATE, DELETE, TRUNCATE ON audit_events FROM kvmfleet_app so the runtime role can't tamper with history.
A BEFORE UPDATE OR DELETE trigger raises an exception even from a superuser, defending against accidents.
Account deletion's pseudonymisation pass uses a SAVEPOINT that's expected to fail; the deletion still proceeds, the historical email stays in the audit log, the Privacy Policy carves this out under the immutable-log clause.

Operators verify the chain via GET /v1/audit/integrity, which re-walks the org's events and returns {ok: true, checked: N} or {ok: false, first_break_id: …, message: …}. The verification is also exposed through the MCP verify_audit_integrity tool.

Optionally, an org can configure SIEM webhooks (audit_webhooks table). An async dispatcher fans out HMAC-SHA256-signed POSTs (X-KVMFleet-Signature: sha256=…) to each registered URL. After 10 consecutive failures the webhook auto-disables.

4.9 Per-device monitoring

Every agent heartbeat (~30s) writes a device_metrics row: (device_id, org_id, ts, cpu_temp_c, uptime_s, online).
SPA's DeviceMonitoring page calls GET /v1/devices/{id}/metrics?hours=&max_points= for the selected range.
Platform returns raw rows when count ≤ max_points; otherwise downsamples server-side via epoch / bucket_seconds grouping with AVG() aggregation, so the chart never has to render more than ~200 points regardless of range.
Recharts renders CPU temp + uptime line charts with hover tooltips and adaptive x-axis tick formatting (HH:MM for ≤24h, day+time for ≤7d, date for >7d).

4.10 Stripe billing

Operator clicks "Upgrade" → POST /v1/billing/checkout {plan}. Platform creates a Stripe Checkout session with the price id mapped from the plan, sets metadata {org_id, plan}, returns the Checkout URL.
Stripe collects payment, redirects back to the SPA. Platform receives checkout.session.completed at /v1/billing/webhook.
Webhook is HMAC-verified against STRIPE_WEBHOOK_SECRET. Idempotency: every event_id is inserted into stripe_events with ON CONFLICT DO NOTHING; duplicate posts return {duplicate: true} without re-processing.
On checkout.session.completed the platform flips org.plan + org.max_devices and stores stripe_subscription_id. On invoice.payment_failed it sets payment_failed_at (starting the 7-day dunning clock). On invoice.paid or invoice.payment_succeeded it clears payment_failed_at. On customer.subscription.deleted it downgrades to free and clears all billing fields.
A janitor sweeps every 60s: orgs with payment_failed_at < NOW() - 7 days get auto-downgraded to free.

Soft enforcement: an over-cap org's existing devices keep working; new enrolment returns 402 Payment Required.

5. Authentication and authorization

5.1 Token types and audiences

Every JWT carries an explicit aud claim. Decoding always passes the expected audience, so a console token can never be replayed where an access token is expected.

Audience	TTL	Use
`access`	15 min	API calls from the SPA
`refresh`	7 days	Mint new access tokens (HttpOnly cookie)
`mfa`	5 min	Carry user identity through TOTP challenge
`reset`	1 hour	Password reset flow (also legacy; new flow uses opaque tokens in `password_resets`)
`console`	60 min	Iframe console session (HttpOnly cookie scoped to `/v1/devices/{id}/console/`)
`api`	configurable (no expiry / 30/90/180/365 days)	Long-lived programmatic tokens prefixed `kvmf_…`, sha256-hashed in `api_tokens`

5.2 RBAC

Four roles per OrgUser:

Role	Powers
`org_admin`	Everything: members, billing, devices, settings
`operator`	Console + power + ISO mount + audit-read
`auditor`	Read-only access to audit, sessions, devices
`read_only`	List-only access to devices and audit

Plus a global users.is_support_admin boolean, currently true only for the platform operator's two personal accounts. Gates the cross-tenant admin surface (/v1/admin/*) and the WebRTC preview button.

5.3 Tenant isolation (RLS)

Every org-scoped table enables FORCE ROW LEVEL SECURITY with a org_iso policy. The runtime role kvmfleet_app has NOSUPERUSER NOBYPASSRLS so the policy can't be bypassed by setting row_security = off at session level.

Each authenticated request flows through authed_principal(), which calls set_org_context(session, org_id) to set the app.current_org GUC. Policies key off current_setting('app.current_org', true)::uuid. A handful of "system mode" tables (e.g. devices) treat the unset GUC as the zero UUID and allow all rows — needed for the agent heartbeat path which has no org context.

Cross-tenant admin reads (e.g. the support-admin tickets view) flip a separate app.bypass_rls GUC inside an explicit helper, with the policy's OR clause checking it. See rls.md for the table list and policy text.

5.4 Refresh-token rotation chain

refresh_tokens rows are linked into a tree by parent_id. On every successful refresh:

The presented row's revoked_at is set to NOW.
A new row is inserted with parent_id = old.id and a fresh hash.
The new plaintext goes back to the client in the cookie.

If a previously-replaced token is presented again (i.e. theft + replay), the entire family (every row sharing the root parent) is revoked atomically and the user is forced to log in. Audit row written for forensics.

6. Trust model

KVM Fleet is a centralised SaaS. The platform is in the data path for almost everything except the new WebRTC media flow.

What the platform sees:

Operator email addresses, display names, hashed passwords (bcrypt cost 12), TOTP secrets (plaintext in DB), bcrypt-hashed recovery codes
Device names, hardware IDs, agent versions, last-seen timestamps, CPU temp + uptime time-series
Console session metadata (start/end, viewer, device)
Every audit-event row (action, target, result, IP, timestamp)
Live console traffic (legacy iframe path) — every video frame and every keystroke pass through the platform
ISO catalogue metadata (name, URL, sha256) — but never the bytes of the ISOs themselves
Stripe customer IDs, subscription IDs, plan, billing-failed timestamps — never card numbers
For the WebRTC console: SDP offers/answers (~10 KB each), ICE candidate metadata; not the media bytes (DTLS-SRTP encrypted end-to-end)

What the platform does not see:

kvmd's actual administrator password — set at agent install time via KVMFLEET_KVMD_PASS, known only to the agent on disk in /etc/kvmfleet/agent.env (mode 0600)
Plaintext payment cards — handled entirely by Stripe
Plaintext refresh tokens — only sha256 hashes after the response leaves
Anything inside the target server itself — the OS the operator is managing has no agent, no introspection, no telemetry

What an attacker who fully compromises the platform can do:

Mint JWTs for any user (they have JWT_SECRET)
Inject themselves into any active console session — see the screen, send keystrokes
Push commands through every connected agent's tunnel (e.g. mount a malicious ISO, power-cycle a server)
Read every audit row, every email, every device name, every console-session metadata
Stop new audit rows from being written (existing chain stays tamper-evident; gaps would be visible to an integrity check)
Cannot decrypt past WebRTC sessions retroactively (we don't record media)
Cannot silently modify past audit records (the SHA-256 chain breaks visibly under any mutation)

This is the same trust model as any centralised IT-management SaaS. The honest framing is "you trust us with live device control + audit history; you don't trust us with the OS inside the target server, its credentials, or its data".

Hardening currently in place:

bcrypt cost 12 on passwords + recovery codes
Short-lived access JWTs (15 min) with explicit audience claims
Refresh-token rotation with reuse detection + family revocation
Audit chain immutability enforced at the DB role layer + a refusal trigger
Postgres FORCE RLS so unscoped queries return zero rows
Console-token TTL 60 min, scoped to a single device path
Per-device rate limits on power (60s) and ISO mount (1h)
Containers run as non-root, capability-dropped, no-new-privileges
Strict CSP per host, HSTS preload, signed signup acceptance + B2B confirmation
Stripe webhook idempotency keyed by event_id
Sub-resource error capture via GlitchTip with PII scrubbing in before_send

Open hardening still pending (tracked in TODO):

Sign agent binaries (currently curl | sh with no signature/checksum)
Pin the platform's TLS cert SPKI inside the agent (defends against MitM via a different valid LE issuance)
Pin kvmd's self-signed cert from the agent (currently InsecureSkipVerify: true for the local kvmd)
Mirror JWT_SECRET + DB password + Brevo + Stripe + Google secrets to a real secret store (currently in /opt/kvmfleet/deploy/.env plaintext, root-owned)
Encrypt backups with age + ship to off-site Hetzner Storage Box (currently plain .sql.gz on the same disk as the DB)
mTLS the agent → platform WebSocket with per-device certs (limits damage if platform is compromised but attacker doesn't have device key)

7. Storage

7.1 Postgres tables (one-line summaries)

Table	Purpose
`users`	Account identity, password hash, TOTP, recovery codes, support-admin flag
`orgs`	Tenant root; plan, max_devices, billing state, compliance frameworks
`org_users`	Membership join with role + optional expires_at for time-limited contractors
`invites`	Pending team invites with role + expires_at + (optional) membership_expires_at
`devices`	Enrolled PiKVMs; auth_token_hash, last_seen, latest cpu_temp + uptime
`device_metrics`	Heartbeat-driven time-series for trend graphs
`enrollment_tokens`	One-shot tokens consumed by `agent/register`
`audit_events`	SHA-256 hash-chained immutable log
`audit_webhooks`	Per-org SIEM destinations + secret + auto-disable counter
`console_sessions`	Open + historical console windows with viewer + device
`support_tickets` + `support_ticket_messages`	Customer-side ticket inbox with admin bypass clause
`alert_rules` + `alert_history`	User-defined alerting + dedup history
`refresh_tokens`	Rotation chain with `parent_id` + `revoked_at`
`password_resets`	Single-use opaque tokens, 15 min TTL
`stripe_events`	Webhook idempotency log
`beta_waitlist`	Captured emails from the marketing landing
`platform_errors`	`ErrorCaptureMiddleware` writes here for the support-admin Errors page
`api_tokens`	Long-lived `kvmf_*` tokens for MCP / scripts (sha256-hashed)
`isos`	ISO library catalogue

7.2 File / volume layout on the server

/opt/kvmfleet/                        # rsync target from operator's laptop
  agent/                              # Go source (not the binary)
  bin/                                # built agent binaries (per-arch)
  deploy/
    .env                              # production secrets (mode 0600)
    docker-compose.prod.yml
    Caddyfile
    landing/                          # served by Caddy at kvmfleet.io
      hero.mp4                        # 832 KB compressed; original kept as hero-original.mp4
      fonts/                          # Alliance No.1 will land here
    setup.sh                          # one-shot bootstrap script
    backup.sh                         # nightly pg_dump
    admin.sh                          # break-glass CLI
  platform/                           # FastAPI source
  web/                                # SPA source + public/
  mcp/                                # @kvmfleet/mcp source
  docs/                               # mkdocs source
  postgres-init/                      # init scripts for the postgres container

/var/lib/docker/volumes/              # docker-managed volumes
  deploy_pgdata                       # main + glitchtip databases
  deploy_caddy_data                   # ACME state, issued certs
  deploy_caddy_config                 # caddy state
  deploy_web_build                    # built SPA (consumed by caddy)
  deploy_docs_build                   # built docs (consumed by caddy)
  deploy_glitchtip_uploads            # GlitchTip user-uploaded source maps

On each PiKVM:

/usr/local/bin/kvmfleet-agent         # binary (arm/arm64), mode 0755
/etc/kvmfleet/agent.env               # KVMFLEET_TOKEN, KVMFLEET_KVMD_PASS, mode 0600
/var/lib/kvmfleet/state.json          # device id + auth_token, persisted across reboots
/etc/systemd/system/kvmfleet-agent.service

7.3 Secrets and where they live

Secret	Format	Stored	Rotation
`JWT_SECRET`	48-byte hex	`/opt/kvmfleet/deploy/.env`	manual; new value invalidates all tokens
`SESSION_SECRET`	48-byte hex	`/opt/kvmfleet/deploy/.env`	manual
Postgres `kvmfleet` + `kvmfleet_app` passwords	32-byte hex	`/opt/kvmfleet/deploy/.env`, `postgres-init/`	manual
Stripe live secret + webhook secret	as Stripe issues	`.env`	rotated at Stripe side, mirror here
Brevo API key	as Brevo issues	`.env`	rotated at Brevo
Google OAuth client secret	as Google issues	`.env`	rotated at Google Cloud Console
`GLITCHTIP_SECRET_KEY`	48-byte hex	`.env`	manual
`TURN_SHARED_SECRET`	32-byte hex	`.env`	manual; rotation invalidates all in-flight TURN credentials
Per-device `auth_token`	random with `ekvt_` prefix	DB hash, agent on-disk plaintext at `/var/lib/kvmfleet/state.json` (mode 0600)	per-device endpoint `/v1/devices/{id}/rotate-token`
User refresh tokens	random with `ekvrf_` prefix	DB hash, browser HttpOnly cookie	rotated on every refresh
User API tokens	random with `kvmf_` prefix	DB hash, user copies plaintext on creation	revoked via Account → API tokens
`SENTRY_DSN`, `SENTRY_DSN_WEB`	URLs	`.env` (platform), baked into SPA bundle (web)	manual via GlitchTip UI

8. Deployment workflow

There is no GitOps. The current deployment loop:

Operator edits code on their laptop, runs tests via make test.
Operator commits + pushes to github.com/patrickattard/KVMFleet. CI runs the platform test suite, agent build matrix (amd64, arm64, arm/v7), web typecheck, MCP typecheck, security scans.
Operator runs rsync from their laptop to /opt/kvmfleet/ on Hetzner (excluding node_modules, .git, build artefacts, .env, hero.mp4).
Operator SSHes in, runs docker compose -f deploy/docker-compose.prod.yml build platform web-build then up -d --no-deps platform web-build caddy.
Migrations run automatically as the platform container's startup command (alembic upgrade head then uvicorn app.main:app …).
Caddy reload uses bind-mounted Caddyfile; rsync replaces the file via rename, which strands the container's existing inode reference, so Caddy must be restarted (not reloaded) when the Caddyfile changes.

Agent builds run on the Hetzner host via a throwaway golang:1.24-alpine Docker container (make agent-linux-arm64 etc). Cross-compiles for amd64 / arm64 / arm-v7 with CGO_ENABLED=0.

9. Backup and disaster recovery

Honest state: minimal.

deploy/backup.sh runs nightly via cron: pg_dump | gzip > /opt/kvmfleet/backups/kvmfleet-YYYY-MM-DD-HHMM.sql.gz. Keeps 7 days locally, prunes older.
No off-site copy. A disk failure on the CX22 is currently a total-loss event for billing+account data.
The audit chain is logically protected against tampering but physically unprotected against disk failure.
Source code lives on GitHub (off-site copy of every file except .env).
Agent binaries on PiKVMs are not backed up — they're regenerable from source.

Documented as an open security item in TODO. Planned fix: encrypt nightly dump with age, ship via restic to a Hetzner Storage Box (€3.45/mo for 1 TB).

10. Observability

Signal	Where it lands
Unhandled platform exceptions	`platform_errors` table + GlitchTip via `sentry-sdk` with PII scrubbing
Browser exceptions in the SPA	GlitchTip via `@sentry/react`
Agent log lines	systemd journal on each PiKVM
Docker container logs	`docker compose logs` on Hetzner
Audit-relevant business events	`audit_events` table
Heartbeat / device state	`devices` row + `device_metrics` time series
Stripe webhook deliveries	`stripe_events` + Stripe dashboard
Outbound email	Brevo dashboard
TURN relay traffic	`coturn` log to stdout

Smoke tests run hourly via GitHub Actions cron (scripts/smoke.py): /healthz, /v1/billing/plans, login bad-creds returns 401, landing page returns 200. On failure the action notification email is delivered to the operator.

There is no centralised log aggregator (no ELK, no Loki). At current scale this is a deliberate "wait until pain emerges" decision.

11. Tech stack summary

Layer	Choice	Why
Backend	FastAPI + Python 3.12	Async I/O for the WebSocket tunnel + agent fan-out; rich ecosystem for compliance / PDF / Stripe
ORM	SQLAlchemy 2.0 async + asyncpg	Mature async driver; FORCE RLS works cleanly with the per-request GUC pattern
Auth	python-jose for JWT, bcrypt for passwords, pyotp for TOTP	Standard primitives; pyjwt migration is on the security TODO
HTTP	uvicorn 0.30 with `--proxy-headers --forwarded-allow-ips '*'`	Behind Caddy; needs to trust X-Forwarded-*
DB	Postgres 16	RLS, partial indexes, full-text search if needed later
Cache / dedup	Redis 7	Presence TTL + dedup keys; no persistence needed
Frontend	React 18 + TypeScript + Tailwind + Vite	Single-page dashboard; small bundle (~250 KB gzipped after the recent compression pass)
Charts	Recharts 2.13	Declarative API; ~40 KB gzipped
Charts dependencies	none	We deliberately avoided D3, since recharts wraps it
Agent	Go 1.24 + pion/webrtc/v4	Pure Go; cross-compiles trivially; static binary; pion is the canonical pure-Go WebRTC stack
Static binary HTTP	net/http stdlib + nhooyr.io/websocket	nhooyr is more idiomatic than gorilla and supports HTTP/2
MCP server	TypeScript + @modelcontextprotocol/sdk	Published as `@kvmfleet/mcp` on npmjs.com
Reverse proxy	Caddy 2	Auto-TLS, simple Caddyfile, HTTP/3 by default
TURN	coturn 4.6	Industry standard, --use-auth-secret obviates per-user DB
Error tracker	GlitchTip 4.1 (Sentry-SDK compatible, EU-hosted, AGPL)	Sentry SaaS would be ~€26/mo and out-of-EU; GlitchTip self-host is free + EU-resident
Email	Brevo (transactional) + ImprovMX (inbound)	Both EU-resident; Brevo has a generous free tier; ImprovMX is free for ≤5 aliases
Billing	Stripe (Ireland)	Industry standard, EU-resident
Compliance reports	ReportLab	Pure-Python PDF generation; no headless Chrome
Tests	pytest + httpx + ephemeral Postgres	Async-native; 76 tests at last count

12. Open questions and likely future direction

These are not promises, just where the operator currently expects the architecture to evolve:

Horizontal scale: today's single CX22 handles a few hundred concurrent agents comfortably. Beyond that the agent registry needs to move from per-process in-memory to Redis (so multiple platform replicas can route to the same agent), and the Postgres connection pool needs sizing review. Documented in security-hardening TODO.
WebRTC video: Phase 2 (#106) wires kvmd's H.264 stream as an actual track. Once stable, the iframe console (routers/console.py + the URL-rewriting glue) is retired entirely, removing ~1000 LOC of brittle code.
Cross-KVM support via Redfish: the agent currently only knows kvmd. Adding a Redfish "translator" mode lets it bridge to Dell iDRAC, HPE iLO, Supermicro IPMI, etc. Turns KVM Fleet from "wrapper around hobbyist hardware" into "single pane of glass over every BMC the enterprise owns".
Multi-tenant / MSP mode: nested orgs so a consultant can manage many customer tenants from one account. Needs a parent_org_id column, a token claim for the active sub-tenant, and an org-switcher UI.
On-prem deployable bundle: a tarball with docker-compose.prod.yml + Caddyfile.template + offline licence-key check, for enterprise customers who can't accept SaaS.
Cross-customer telemetry intelligence: anonymised baselines on the customer's own dashboard ("your average uptime: 28 days, fleet p50: 43 days"). Requires a separate telemetry warehouse (probably ClickHouse), an opt-in flag, lawyer-reviewed copy on the privacy implications, and an aggregation function that enforces a minimum bucket size to prevent re-identification.
Hardware-bound audit-chain anchoring: HMAC-sign the daily audit-tail hash with a key the platform doesn't hold + anchor externally (S3 object-lock, Bitcoin testnet, etc.) so platform compromise can't silently change the official chain head.

13. Glossary

PiKVM: an open-source KVM-over-IP appliance built on Raspberry Pi or compatible single-board computers. Provides remote video, keyboard/mouse, virtual mass storage, and ATX power control over IP. KVM Fleet's primary supported hardware.
kvmd: the daemon running on each PiKVM that exposes the KVM functions as an HTTP + WebSocket API. KVM Fleet's agent talks to kvmd locally on the device.
uStreamer: kvmd's video-source process. Provides MJPEG or H.264 over an internal interface.
agent: KVM Fleet's small Go binary that runs alongside kvmd on each device. The bridge between kvmd and the platform.
platform: KVM Fleet's central FastAPI service. Hosts the API, processes webhooks, mints tokens, runs the audit chain.
org: a single tenant. Every paying customer is one or more orgs.
operator: an authenticated user of the dashboard, with one of the four roles.
support admin: a user with is_support_admin = true. Can read tickets across orgs and access the WebRTC console preview. Currently only the platform operator's two personal accounts.
MCP: Model Context Protocol — Anthropic's standard for exposing tools to AI assistants. KVM Fleet ships an MCP server (@kvmfleet/mcp) that lets Claude Desktop and similar tools query fleet state in plain English.
RLS: Row-Level Security. The Postgres feature that enforces per-row visibility based on a session-level setting. KVM Fleet uses it as defence-in-depth for tenant isolation.
audit chain: the SHA-256 hash chain over audit_events rows that makes tampering with the log mathematically detectable.