Skip to content

System architecture

This document is the single comprehensive view of how KVM Fleet is built, deployed and operated. It is intended for prospects with a security or platform team who want to understand the trust model, for auditors evaluating the compliance posture, and for future maintainers who need to navigate the codebase.

The companion documents in this section drill into specific subsystems:


At a glance

flowchart LR
    subgraph operator["Operator (anywhere)"]
        browser["Browser<br/>SPA + WebRTC"]
    end

    subgraph hetzner["Hetzner CX22 (Falkenstein, DE) — EU only"]
        caddy["Caddy 2<br/>TLS 1.3 terminator"]
        platform["FastAPI platform<br/>audit + RBAC + JIT + approval"]
        pg[("Postgres 16<br/>RLS + SHA-256 audit chain")]
        redis[("Redis 7<br/>presence + sessions")]
        coturn["coturn<br/>TURN relay"]
    end

    subgraph customer["Customer network"]
        agent["Go agent<br/>on PiKVM<br/>(outbound-only)"]
        kvmd["kvmd<br/>localhost:80"]
        bmc["Dell iDRAC / HPE iLO<br/>Supermicro / Lenovo XCC<br/>(Redfish-managed)"]
    end

    browser <==>|"HTTPS + WSS"| caddy
    caddy --> platform
    platform <--> pg
    platform <--> redis
    agent ==>|"outbound WSS<br/>(agent dials in)"| caddy
    agent --> kvmd
    platform -->|"Redfish HTTPS<br/>session or basic auth"| bmc
    browser -. "WebRTC P2P<br/>DTLS-SRTP" .-> agent
    browser -. "WebRTC fallback" .-> coturn
    coturn -. "fallback only" .-> agent

Three trust domains: the operator's identity (Google SSO + 2FA), the platform (KVM Fleet — audit, authorisation, signaling), and the customer's BMC / PiKVM. The platform sits in the middle of the data path for everything except WebRTC media (which terminates DTLS-SRTP at the agent, peer-to-peer to the operator's browser). No inbound ports are opened on the customer side — the agent dials out over WSS. The full adversarial breakdown is in Threat model.


1. Product overview in one paragraph

KVM Fleet is a B2B SaaS that adds a fleet dashboard, browser-based remote console, tamper-evident audit log, role-based access control, Google SSO, alerting, ISO library, per-device monitoring and one-click compliance reports on top of customer-owned PiKVM hardware. A small Go agent (~10 MB static binary, arm/arm64/amd64) runs on each PiKVM and dials out to the platform over an authenticated WebSocket. From there an operator at app.kvmfleet.io gets unified control of every device, with all data-at-rest and most data-in-flight inside the European Union (Hetzner, Falkenstein).


2. Components

2.1 The marketing site (kvmfleet.io)

Static HTML/CSS, no JavaScript framework. Served by Caddy from deploy/landing/. Pages: /, /terms.html, /privacy.html, /dpa.html, plus /.well-known/security.txt, /robots.txt, /sitemap.xml, /llms.txt. Includes JSON-LD structured data (SoftwareApplication, Organization) so search engines and LLM crawlers can index the product correctly. Heavy use of CSS variables and a single embedded <style> block per page; no build step.

2.2 The dashboard SPA (app.kvmfleet.io)

React 18 + TypeScript + Tailwind, bundled by Vite. Built once per deploy by the web-build Docker service into a Caddy-served static volume. State is managed by @tanstack/react-query. Real-time updates use a WebSocket to the platform at /v1/ws for presence + heartbeat broadcasts. Charts use Recharts. WebRTC console uses the browser's native RTCPeerConnection. No SSR — fully client-side after the initial bundle download.

2.3 The platform (FastAPI, Python 3.12)

Single FastAPI process behind Caddy at app.kvmfleet.io/v1/*. Routers under platform/app/routers/:

Router Surface
auth Local signup/login, password change/reset, refresh-token rotation
sso Google OIDC SSO callback flow
twofa TOTP enrolment, recovery codes, MFA challenge
team Org member roster, invites, role changes, account deletion
org Org-level settings (compliance frameworks, country)
devices Device list, enrolment, rotate token, remove
power ATX power control via the agent tunnel
console The legacy iframe-tunnelled kvmd console
webrtc Phase 1 WebRTC signalling (preview)
isos ISO library catalogue + per-device mount/unmount
support Customer-side support tickets
admin Support-admin-only routes (errors, ticket cross-org view)
audit Per-org audit event query + chain integrity check
audit_webhooks Per-org SIEM webhook endpoints + dispatcher
alerts Alert rules + history
billing Stripe Checkout + Customer Portal + webhook + plan info
reports PDF compliance report generation
api_tokens Long-lived bearer tokens for the MCP server / scripts
ws Operator presence stream + agent tunnel endpoint
public Public, unauth endpoints (plans, beta status)
downloads Signed agent binary downloads (pinned version)
beta Private-beta gate + waitlist

2.4 The agent (Go 1.24, pure Go, no CGO)

Single static binary installed at /usr/local/bin/kvmfleet-agent on the PiKVM. Configured via env vars (KVMFLEET_API, KVMFLEET_TOKEN, KVMFLEET_KVMD_USER, KVMFLEET_KVMD_PASS, etc.) and a small JSON state file at /var/lib/kvmfleet/state.json. Dials out to wss://app.kvmfleet.io/v1/agent/ws on boot, authenticates with the per-device token, then services multiplexed HTTP requests and WebSocket channels from the platform side.

Internal HTTP routes the agent exposes through the tunnel:

  • /health — agent self-info, always reachable
  • /api/... — proxied to local kvmd (browser console + power + ATX state)
  • /streamer/... — proxied to local uStreamer (video frames for legacy console)
  • /extras/webterm/ttyd/... — direct Unix-socket route to ttyd (serial console)
  • /internal/iso/{mount,unmount} — ISO library handlers (download, sha256-verify, mount via kvmd MSD)
  • /internal/webrtc/offer — pion-based WebRTC PeerConnection negotiation (HID DataChannel; video track in phase 2)

2.5 Postgres (16-alpine)

Single instance. Two databases on the same cluster:

Database Used by Notes
kvmfleet platform All operational data
glitchtip GlitchTip Error tracker (separate schema)

Roles:

Role Privilege Used by
kvmfleet superuser Migrations, admin scripts, glitchtip's bootstrap
kvmfleet_app NOSUPERUSER NOBYPASSRLS The platform's runtime connection — RLS applies to it

Most org-scoped tables have forced row-level security with a org_iso policy keyed on current_setting('app.current_org'). The platform sets that GUC on every authed request via set_org_context(session, org_id). See rls.md for the full table list.

2.6 Redis (7-alpine)

Two databases on one instance:

DB Used by
0 platform: agent presence, alert dedup, error-alert dedup
2 GlitchTip: Celery broker + result backend

No persistence enabled — all keyed data is short-lived (presence TTL 30s, dedup TTL 1h). A Redis crash loses presence state which the next agent heartbeat repopulates.

2.7 Caddy 2 (TLS terminator + static server + reverse proxy)

Single Caddy process serves four virtual hosts:

Host Backed by Purpose
kvmfleet.io static /srv/landing Marketing pages
kvmfleet.io/docs/ static /srv/docs This documentation site
app.kvmfleet.io reverse-proxy platform:8000 for /v1/*, static /srv/web for the SPA Operator dashboard + API
errors.kvmfleet.io reverse-proxy glitchtip:8000 Internal error tracker

Plus the legacy eurokvm.io and app.eurokvm.io 308-redirect to the new domain.

Caddy auto-provisions TLS certs via Let's Encrypt, sets HSTS preload, strict CSP per host, long-lived Cache-Control: immutable on hashed assets, short cache on HTML.

2.8 coturn (TURN server for WebRTC)

Runs on the Hetzner host network (not in a docker bridge — TURN's UDP relay range needs raw access). Listens on UDP/TCP 3478 (STUN/TURN) and 5349 (TLS), with a relay range of 49152-49200. Uses --use-auth-secret so the platform mints ephemeral 24h credentials without a per-user database. Only used as a WebRTC fallback; most connections traverse direct host or STUN-discovered paths.

2.9 GlitchTip (error tracker, Sentry-SDK compatible)

Self-hosted, EU-resident replacement for Sentry. Three docker services (glitchtip-migrate, glitchtip, glitchtip-worker) sharing the cluster's Postgres + Redis. Public registration disabled (ENABLE_USER_REGISTRATION=false). The platform's app/observability.py initialises sentry-sdk against it; the SPA initialises @sentry/react. Aggressive PII scrubbing in before_send strips JWTs, refresh tokens, Stripe keys, sensitive headers and request bodies before any event leaves the platform.

2.10 External services

Service Used for Region
Stripe Payments Europe Ltd. Checkout, Customer Portal, recurring subscriptions, dunning Ireland
Brevo (ex-Sendinblue) Outbound transactional email (welcome, invite, password reset, alerts, signup notification) France
ImprovMX Inbound email forwarding for *@kvmfleet.io aliases (privacy@, security@, support@, etc.) Belgium
Google Cloud Identity Platform OIDC SSO Multi-region; we use the EU OAuth client
Hetzner Online GmbH Compute (CX22), DNS lookups, network egress Germany (Falkenstein)
INWX Domain registrar + authoritative DNS for kvmfleet.io Germany
Cloudflare STUN NAT discovery for WebRTC (stun.cloudflare.com:3478) global anycast
npmjs.com Distribution for @kvmfleet/mcp US (read-only on customer side)

Each is disclosed as a sub-processor in the DPA.


3. Hosting topology

                               Hetzner CX22  (FSN1, Falkenstein DE)
                       ┌─────────────────────────────────────────────────┐
                       │                                                 │
                       │  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
                       │  │  Caddy 2   │  │  platform  │  │ glitchtip │  │
                       │  │  TLS / SNI │  │  FastAPI   │  │  + worker │  │
                       │  └─────┬──────┘  └─────┬──────┘  └─────┬─────┘  │
                       │        │               │               │         │
                       │        │               ▼               │         │
                       │        │      ┌────────────────┐       │         │
                       │        │      │  Postgres 16   │◄──────┘         │
                       │        │      │  + Redis 7     │                 │
                       │        │      └────────────────┘                 │
                       │        │                                         │
                       │  ┌─────┴──────┐  ┌────────────┐                  │
                       │  │ web-build  │  │  coturn    │                  │
                       │  │ (one-shot) │  │  net=host  │                  │
                       │  └────────────┘  └────────────┘                  │
                       │                                                  │
                       └──────────────┬───────────────────────────────────┘
                                      │ public IP 46.225.227.71
                                      │ TCP 443 (TLS), 80 (LE), 22 (ssh)
                                      │ UDP 3478, 5349, 49152-49200 (TURN)
                          ┌────────────────────────────┐
                          │      THE INTERNET          │
                          └────────────────────────────┘
                            │              │            │
                            ▼              ▼            ▼
                 ┌──────────────┐ ┌─────────────────┐ ┌───────────┐
                 │  Operator    │ │   PiKVM agent   │ │ Customer  │
                 │  browser     │ │ (anywhere)      │ │ stripe /  │
                 │  (anywhere)  │ │ outbound WSS    │ │ google    │
                 └──────────────┘ └─────────────────┘ │ callbacks │
                                                      └───────────┘

One CX22 (€3.99/mo, 2 vCPU, 4 GB RAM, 40 GB NVMe) hosts everything: Caddy, the platform, Postgres, Redis, GlitchTip, coturn, the static SPA bundle, the docs site, the marketing site. There is no horizontal scaling today; that's a deliberate choice for a pre-revenue product. The path to scale is documented in the open-questions section at the bottom.


4. Walkthroughs of the major data flows

4.1 Operator login (local credentials)

  1. Browser POSTs /v1/auth/login with {email, password}.
  2. Platform looks up the user (RLS off — app.current_org not yet set), bcrypt-verifies the password, checks account lockout state.
  3. If TOTP is enabled, returns a short-lived mfa_token (5 min, audience mfa); otherwise issues an access JWT (15 min, audience access) + a refresh token (random opaque ekvrf_… string, sha256-hashed and stored in refresh_tokens).
  4. Refresh token returned via HttpOnly Secure SameSite=Strict cookie scoped to /v1/auth. Access token returned in JSON body.
  5. SPA stores the access token in memory and sends it as Authorization: Bearer … on every API call.

Refresh: SPA POSTs /v1/auth/refresh (cookie auto-attached). Platform sha256-hashes the cookie value, looks up the row, checks for prior reuse (returns 401 + revokes the entire family if found), rotates: marks the current row replaced, issues a new pair.

4.2 Google SSO

  1. Browser navigates to /v1/auth/google → Authlib redirects to Google with the configured redirect_uri.
  2. Google authenticates the user, redirects back to /v1/auth/google/callback?code=….
  3. Platform exchanges the code for an ID token, validates email_verified=true, looks up or creates the user. If a non-expired invite exists for that email, joins them to the inviter's org; otherwise creates a fresh personal org.
  4. Issues access + refresh tokens (same as local login). Redirects to https://app.kvmfleet.io/login#access_token=… so the SPA picks them up.
  5. New users trigger a one-shot AcceptTerms gate before they can use the dashboard (Terms + B2B confirmation).

4.3 Device enrolment

  1. Operator clicks "Add device" in the SPA → POST /v1/devices/enrollment with optional suggested_name. Platform issues an enrollment_token (random, 24h TTL) bound to the org.
  2. SPA shows the operator a one-line install command: curl -sSL https://app.kvmfleet.io/install | KVMFLEET_TOKEN=<plaintext> sh.
  3. Operator runs the command on the PiKVM via SSH. The install script downloads the agent binary from /downloads/kvmfleet-agent.linux-arm64, writes a systemd unit, sets KVMFLEET_TOKEN in /etc/kvmfleet/agent.env, starts the service.
  4. Agent boots, POSTs /v1/agent/register with the enrollment token + hardware ID + agent version + kvmd version. Platform consumes the enrollment token, creates a devices row, returns a long-lived per-device auth_token (sha256-hashed at rest in devices.auth_token_hash).
  5. Agent persists the auth_token to /var/lib/kvmfleet/state.json and switches to the long-lived authenticated WebSocket: wss://app.kvmfleet.io/v1/agent/ws?token=<auth_token>.
  6. Heartbeats every 30s update devices.last_seen_at, cpu_temp_c, uptime_s, and append a device_metrics row for the historical-data graph.

4.4 Browser console (legacy iframe path)

  1. Operator clicks "Console" → SPA POSTs /v1/console/start for a console_session row + a 60-min aud=console JWT in an HttpOnly cookie scoped to /v1/devices/{id}/console/.
  2. SPA opens an <iframe> to https://app.kvmfleet.io/v1/devices/{id}/console/ (which serves the kvmd web UI through the proxy).
  3. Every iframe HTTP request hits the platform's routers/console.py proxy. The proxy:
  4. Validates the console JWT cookie + ACL the device against the org
  5. Strips kvmfleet's own cookies from the outbound Cookie header
  6. Path-allowlists requests (kvmd's /api/, /streamer/, /share/, /login, /logout, /auth/)
  7. Rewrites response HTML/JS so resource URLs resolve under /v1/devices/{id}/console/…
  8. Rewrites Set-Cookie attributes (Path, Domain, Secure)
  9. Platform forwards the request through the agent's WebSocket tunnel (multiplexed channel; see ws-multiplex.md). Agent serves the request locally from kvmd, returns the response.
  10. WebSocket subscriptions for the live video stream use a multiplexed WS channel through the same tunnel, with extra Origin checks at handshake.

This whole path is being retired in favour of the WebRTC console (next section).

4.5 Browser console (WebRTC, preview, support-admin only)

  1. Operator clicks "Console (RTC)". SPA fetches /v1/devices/{id}/webrtc/ice-config to get STUN + ephemeral TURN credentials.
  2. SPA creates an RTCPeerConnection with those ICE servers, attaches a recvonly video transceiver and an unordered HID DataChannel.
  3. SPA generates an SDP offer, waits 2s for ICE gathering, POSTs the offer to /v1/devices/{id}/webrtc/offer.
  4. Platform proxies the offer through the agent's WebSocket tunnel to the agent's /internal/webrtc/offer.
  5. Agent (pion v4) accepts the offer, builds its own PeerConnection with the same ICE servers, creates an SDP answer, gathers ICE for up to 5s, returns the answer JSON.
  6. SPA applies the remote description. ICE traversal runs concurrently on both sides — direct host pair, STUN-discovered pair, or TURN-relayed pair (whichever wins).
  7. After ICE settles, DTLS-SRTP handshake runs end-to-end between browser and agent. The platform plays no role from this point on.
  8. HID events flow as JSON strings over the DataChannel. The agent forwards each one as an HTTP POST to kvmd's /api/hid/events/send_… endpoints using the same Basic auth and session cookies as the rest of the proxy.
  9. Video track is currently a placeholder (recvonly, no source bound) — Phase 2 wires it to kvmd's H.264 stream (/streamer/stream).

The signalling pipe is platform-mediated; the media pipe is end-to-end DTLS-SRTP encrypted. The platform sees SDPs (~10 KB) and ICE candidate metadata; it cannot decrypt media.

4.6 ISO mount

  1. Operator registers an ISO at app.kvmfleet.io/isos: name, source URL (HTTPS), sha256, optional size. Stored in isos table with FORCE RLS.
  2. Operator opens a device, picks an ISO, clicks "Mount" → POST /v1/devices/{id}/iso:mount. Platform audit-logs before firing, rate-limits to 1 mount per device per hour.
  3. Platform forwards {name, source_url, sha256, size_bytes, media_type} to the agent's /internal/iso/mount.
  4. Agent streams the URL to a tmp file with concurrent SHA-256 hashing. Refuses to proceed on hash or size mismatch.
  5. Agent disconnects any existing MSD, multipart-uploads to kvmd's /api/msd/write?image=<name>, calls /api/msd/set_params and /api/msd/set_connected=1.
  6. Bytes never touch the platform — egress flat, GDPR posture clean.

4.7 Compliance report generation

  1. Operator picks a framework (NIS2, GDPR, SOC 2, ISO 27001, HIPAA, NIST 800-53, PCI DSS, ISO 9001) → POST /v1/reports/{framework} with optional from_date / to_date.
  2. Platform's _collect_evidence runs a single async pass that queries device counts, audit-event counts by action, hash-chain integrity, member roster, failed logins, 2FA failures, password resets, refresh-token reuse and other org-scoped facts.
  3. The framework spec maps each control to an evaluator function that classifies the control as COVERED / PARTIAL / NOT COVERED / N/A based on the evidence.
  4. ReportLab renders the PDF: cover sheet, executive summary with coverage roll-up, per-control table with status badges, framework-specific sections (Art. 30 ROPA template for GDPR, CUECs for SOC 2, Annex A theme breakdown for ISO 27001, etc.), per-framework "this is not an attestation" disclaimer.
  5. PDF returned with Content-Disposition: attachment; filename=kvmfleet-{framework}-{from}-{to}.pdf. Audit row written.

4.8 Audit log + tamper-evidence

Every meaningful action (login, console open, power cycle, ISO mount, settings change, role change, deletion, etc.) writes a row to audit_events. Each row carries the SHA-256 hash of the previous row's hash plus the canonical-JSON of its own payload — a cryptographic chain. See audit-chain.md.

The table is append-only at the DB level:

  • REVOKE UPDATE, DELETE, TRUNCATE ON audit_events FROM kvmfleet_app so the runtime role can't tamper with history.
  • A BEFORE UPDATE OR DELETE trigger raises an exception even from a superuser, defending against accidents.
  • Account deletion's pseudonymisation pass uses a SAVEPOINT that's expected to fail; the deletion still proceeds, the historical email stays in the audit log, the Privacy Policy carves this out under the immutable-log clause.

Operators verify the chain via GET /v1/audit/integrity, which re-walks the org's events and returns {ok: true, checked: N} or {ok: false, first_break_id: …, message: …}. The verification is also exposed through the MCP verify_audit_integrity tool.

Optionally, an org can configure SIEM webhooks (audit_webhooks table). An async dispatcher fans out HMAC-SHA256-signed POSTs (X-KVMFleet-Signature: sha256=…) to each registered URL. After 10 consecutive failures the webhook auto-disables.

4.9 Per-device monitoring

  1. Every agent heartbeat (~30s) writes a device_metrics row: (device_id, org_id, ts, cpu_temp_c, uptime_s, online).
  2. SPA's DeviceMonitoring page calls GET /v1/devices/{id}/metrics?hours=&max_points= for the selected range.
  3. Platform returns raw rows when count ≤ max_points; otherwise downsamples server-side via epoch / bucket_seconds grouping with AVG() aggregation, so the chart never has to render more than ~200 points regardless of range.
  4. Recharts renders CPU temp + uptime line charts with hover tooltips and adaptive x-axis tick formatting (HH:MM for ≤24h, day+time for ≤7d, date for >7d).

4.10 Stripe billing

  1. Operator clicks "Upgrade" → POST /v1/billing/checkout {plan}. Platform creates a Stripe Checkout session with the price id mapped from the plan, sets metadata {org_id, plan}, returns the Checkout URL.
  2. Stripe collects payment, redirects back to the SPA. Platform receives checkout.session.completed at /v1/billing/webhook.
  3. Webhook is HMAC-verified against STRIPE_WEBHOOK_SECRET. Idempotency: every event_id is inserted into stripe_events with ON CONFLICT DO NOTHING; duplicate posts return {duplicate: true} without re-processing.
  4. On checkout.session.completed the platform flips org.plan + org.max_devices and stores stripe_subscription_id. On invoice.payment_failed it sets payment_failed_at (starting the 7-day dunning clock). On invoice.paid or invoice.payment_succeeded it clears payment_failed_at. On customer.subscription.deleted it downgrades to free and clears all billing fields.
  5. A janitor sweeps every 60s: orgs with payment_failed_at < NOW() - 7 days get auto-downgraded to free.

Soft enforcement: an over-cap org's existing devices keep working; new enrolment returns 402 Payment Required.


5. Authentication and authorization

5.1 Token types and audiences

Every JWT carries an explicit aud claim. Decoding always passes the expected audience, so a console token can never be replayed where an access token is expected.

Audience TTL Use
access 15 min API calls from the SPA
refresh 7 days Mint new access tokens (HttpOnly cookie)
mfa 5 min Carry user identity through TOTP challenge
reset 1 hour Password reset flow (also legacy; new flow uses opaque tokens in password_resets)
console 60 min Iframe console session (HttpOnly cookie scoped to /v1/devices/{id}/console/)
api configurable (no expiry / 30/90/180/365 days) Long-lived programmatic tokens prefixed kvmf_…, sha256-hashed in api_tokens

5.2 RBAC

Four roles per OrgUser:

Role Powers
org_admin Everything: members, billing, devices, settings
operator Console + power + ISO mount + audit-read
auditor Read-only access to audit, sessions, devices
read_only List-only access to devices and audit

Plus a global users.is_support_admin boolean, currently true only for the platform operator's two personal accounts. Gates the cross-tenant admin surface (/v1/admin/*) and the WebRTC preview button.

5.3 Tenant isolation (RLS)

Every org-scoped table enables FORCE ROW LEVEL SECURITY with a org_iso policy. The runtime role kvmfleet_app has NOSUPERUSER NOBYPASSRLS so the policy can't be bypassed by setting row_security = off at session level.

Each authenticated request flows through authed_principal(), which calls set_org_context(session, org_id) to set the app.current_org GUC. Policies key off current_setting('app.current_org', true)::uuid. A handful of "system mode" tables (e.g. devices) treat the unset GUC as the zero UUID and allow all rows — needed for the agent heartbeat path which has no org context.

Cross-tenant admin reads (e.g. the support-admin tickets view) flip a separate app.bypass_rls GUC inside an explicit helper, with the policy's OR clause checking it. See rls.md for the table list and policy text.

5.4 Refresh-token rotation chain

refresh_tokens rows are linked into a tree by parent_id. On every successful refresh:

  1. The presented row's revoked_at is set to NOW.
  2. A new row is inserted with parent_id = old.id and a fresh hash.
  3. The new plaintext goes back to the client in the cookie.

If a previously-replaced token is presented again (i.e. theft + replay), the entire family (every row sharing the root parent) is revoked atomically and the user is forced to log in. Audit row written for forensics.


6. Trust model

KVM Fleet is a centralised SaaS. The platform is in the data path for almost everything except the new WebRTC media flow.

What the platform sees:

  • Operator email addresses, display names, hashed passwords (bcrypt cost 12), TOTP secrets (plaintext in DB), bcrypt-hashed recovery codes
  • Device names, hardware IDs, agent versions, last-seen timestamps, CPU temp + uptime time-series
  • Console session metadata (start/end, viewer, device)
  • Every audit-event row (action, target, result, IP, timestamp)
  • Live console traffic (legacy iframe path) — every video frame and every keystroke pass through the platform
  • ISO catalogue metadata (name, URL, sha256) — but never the bytes of the ISOs themselves
  • Stripe customer IDs, subscription IDs, plan, billing-failed timestamps — never card numbers
  • For the WebRTC console: SDP offers/answers (~10 KB each), ICE candidate metadata; not the media bytes (DTLS-SRTP encrypted end-to-end)

What the platform does not see:

  • kvmd's actual administrator password — set at agent install time via KVMFLEET_KVMD_PASS, known only to the agent on disk in /etc/kvmfleet/agent.env (mode 0600)
  • Plaintext payment cards — handled entirely by Stripe
  • Plaintext refresh tokens — only sha256 hashes after the response leaves
  • Anything inside the target server itself — the OS the operator is managing has no agent, no introspection, no telemetry

What an attacker who fully compromises the platform can do:

  • Mint JWTs for any user (they have JWT_SECRET)
  • Inject themselves into any active console session — see the screen, send keystrokes
  • Push commands through every connected agent's tunnel (e.g. mount a malicious ISO, power-cycle a server)
  • Read every audit row, every email, every device name, every console-session metadata
  • Stop new audit rows from being written (existing chain stays tamper-evident; gaps would be visible to an integrity check)
  • Cannot decrypt past WebRTC sessions retroactively (we don't record media)
  • Cannot silently modify past audit records (the SHA-256 chain breaks visibly under any mutation)

This is the same trust model as any centralised IT-management SaaS. The honest framing is "you trust us with live device control + audit history; you don't trust us with the OS inside the target server, its credentials, or its data".

Hardening currently in place:

  • bcrypt cost 12 on passwords + recovery codes
  • Short-lived access JWTs (15 min) with explicit audience claims
  • Refresh-token rotation with reuse detection + family revocation
  • Audit chain immutability enforced at the DB role layer + a refusal trigger
  • Postgres FORCE RLS so unscoped queries return zero rows
  • Console-token TTL 60 min, scoped to a single device path
  • Per-device rate limits on power (60s) and ISO mount (1h)
  • Containers run as non-root, capability-dropped, no-new-privileges
  • Strict CSP per host, HSTS preload, signed signup acceptance + B2B confirmation
  • Stripe webhook idempotency keyed by event_id
  • Sub-resource error capture via GlitchTip with PII scrubbing in before_send

Open hardening still pending (tracked in TODO):

  • Sign agent binaries (currently curl | sh with no signature/checksum)
  • Pin the platform's TLS cert SPKI inside the agent (defends against MitM via a different valid LE issuance)
  • Pin kvmd's self-signed cert from the agent (currently InsecureSkipVerify: true for the local kvmd)
  • Mirror JWT_SECRET + DB password + Brevo + Stripe + Google secrets to a real secret store (currently in /opt/kvmfleet/deploy/.env plaintext, root-owned)
  • Encrypt backups with age + ship to off-site Hetzner Storage Box (currently plain .sql.gz on the same disk as the DB)
  • mTLS the agent → platform WebSocket with per-device certs (limits damage if platform is compromised but attacker doesn't have device key)

7. Storage

7.1 Postgres tables (one-line summaries)

Table Purpose
users Account identity, password hash, TOTP, recovery codes, support-admin flag
orgs Tenant root; plan, max_devices, billing state, compliance frameworks
org_users Membership join with role + optional expires_at for time-limited contractors
invites Pending team invites with role + expires_at + (optional) membership_expires_at
devices Enrolled PiKVMs; auth_token_hash, last_seen, latest cpu_temp + uptime
device_metrics Heartbeat-driven time-series for trend graphs
enrollment_tokens One-shot tokens consumed by agent/register
audit_events SHA-256 hash-chained immutable log
audit_webhooks Per-org SIEM destinations + secret + auto-disable counter
console_sessions Open + historical console windows with viewer + device
support_tickets + support_ticket_messages Customer-side ticket inbox with admin bypass clause
alert_rules + alert_history User-defined alerting + dedup history
refresh_tokens Rotation chain with parent_id + revoked_at
password_resets Single-use opaque tokens, 15 min TTL
stripe_events Webhook idempotency log
beta_waitlist Captured emails from the marketing landing
platform_errors ErrorCaptureMiddleware writes here for the support-admin Errors page
api_tokens Long-lived kvmf_* tokens for MCP / scripts (sha256-hashed)
isos ISO library catalogue

7.2 File / volume layout on the server

/opt/kvmfleet/                        # rsync target from operator's laptop
  agent/                              # Go source (not the binary)
  bin/                                # built agent binaries (per-arch)
  deploy/
    .env                              # production secrets (mode 0600)
    docker-compose.prod.yml
    Caddyfile
    landing/                          # served by Caddy at kvmfleet.io
      hero.mp4                        # 832 KB compressed; original kept as hero-original.mp4
      fonts/                          # Alliance No.1 will land here
    setup.sh                          # one-shot bootstrap script
    backup.sh                         # nightly pg_dump
    admin.sh                          # break-glass CLI
  platform/                           # FastAPI source
  web/                                # SPA source + public/
  mcp/                                # @kvmfleet/mcp source
  docs/                               # mkdocs source
  postgres-init/                      # init scripts for the postgres container

/var/lib/docker/volumes/              # docker-managed volumes
  deploy_pgdata                       # main + glitchtip databases
  deploy_caddy_data                   # ACME state, issued certs
  deploy_caddy_config                 # caddy state
  deploy_web_build                    # built SPA (consumed by caddy)
  deploy_docs_build                   # built docs (consumed by caddy)
  deploy_glitchtip_uploads            # GlitchTip user-uploaded source maps

On each PiKVM:

/usr/local/bin/kvmfleet-agent         # binary (arm/arm64), mode 0755
/etc/kvmfleet/agent.env               # KVMFLEET_TOKEN, KVMFLEET_KVMD_PASS, mode 0600
/var/lib/kvmfleet/state.json          # device id + auth_token, persisted across reboots
/etc/systemd/system/kvmfleet-agent.service

7.3 Secrets and where they live

Secret Format Stored Rotation
JWT_SECRET 48-byte hex /opt/kvmfleet/deploy/.env manual; new value invalidates all tokens
SESSION_SECRET 48-byte hex /opt/kvmfleet/deploy/.env manual
Postgres kvmfleet + kvmfleet_app passwords 32-byte hex /opt/kvmfleet/deploy/.env, postgres-init/ manual
Stripe live secret + webhook secret as Stripe issues .env rotated at Stripe side, mirror here
Brevo API key as Brevo issues .env rotated at Brevo
Google OAuth client secret as Google issues .env rotated at Google Cloud Console
GLITCHTIP_SECRET_KEY 48-byte hex .env manual
TURN_SHARED_SECRET 32-byte hex .env manual; rotation invalidates all in-flight TURN credentials
Per-device auth_token random with ekvt_ prefix DB hash, agent on-disk plaintext at /var/lib/kvmfleet/state.json (mode 0600) per-device endpoint /v1/devices/{id}/rotate-token
User refresh tokens random with ekvrf_ prefix DB hash, browser HttpOnly cookie rotated on every refresh
User API tokens random with kvmf_ prefix DB hash, user copies plaintext on creation revoked via Account → API tokens
SENTRY_DSN, SENTRY_DSN_WEB URLs .env (platform), baked into SPA bundle (web) manual via GlitchTip UI

8. Deployment workflow

There is no GitOps. The current deployment loop:

  1. Operator edits code on their laptop, runs tests via make test.
  2. Operator commits + pushes to github.com/patrickattard/KVMFleet. CI runs the platform test suite, agent build matrix (amd64, arm64, arm/v7), web typecheck, MCP typecheck, security scans.
  3. Operator runs rsync from their laptop to /opt/kvmfleet/ on Hetzner (excluding node_modules, .git, build artefacts, .env, hero.mp4).
  4. Operator SSHes in, runs docker compose -f deploy/docker-compose.prod.yml build platform web-build then up -d --no-deps platform web-build caddy.
  5. Migrations run automatically as the platform container's startup command (alembic upgrade head then uvicorn app.main:app …).
  6. Caddy reload uses bind-mounted Caddyfile; rsync replaces the file via rename, which strands the container's existing inode reference, so Caddy must be restarted (not reloaded) when the Caddyfile changes.

Agent builds run on the Hetzner host via a throwaway golang:1.24-alpine Docker container (make agent-linux-arm64 etc). Cross-compiles for amd64 / arm64 / arm-v7 with CGO_ENABLED=0. Binary size: ~10 MB stripped.


9. Backup and disaster recovery

Honest state: minimal.

  • deploy/backup.sh runs nightly via cron: pg_dump | gzip > /opt/kvmfleet/backups/kvmfleet-YYYY-MM-DD-HHMM.sql.gz. Keeps 7 days locally, prunes older.
  • No off-site copy. A disk failure on the CX22 is currently a total-loss event for billing+account data.
  • The audit chain is logically protected against tampering but physically unprotected against disk failure.
  • Source code lives on GitHub (off-site copy of every file except .env).
  • Agent binaries on PiKVMs are not backed up — they're regenerable from source.

Documented as an open security item in TODO. Planned fix: encrypt nightly dump with age, ship via restic to a Hetzner Storage Box (€3.45/mo for 1 TB).


10. Observability

Signal Where it lands
Unhandled platform exceptions platform_errors table + GlitchTip via sentry-sdk with PII scrubbing
Browser exceptions in the SPA GlitchTip via @sentry/react
Agent log lines systemd journal on each PiKVM
Docker container logs docker compose logs on Hetzner
Audit-relevant business events audit_events table
Heartbeat / device state devices row + device_metrics time series
Stripe webhook deliveries stripe_events + Stripe dashboard
Outbound email Brevo dashboard
TURN relay traffic coturn log to stdout

Smoke tests run hourly via GitHub Actions cron (scripts/smoke.py): /healthz, /v1/billing/plans, login bad-creds returns 401, landing page returns 200. On failure the action notification email is delivered to the operator.

There is no centralised log aggregator (no ELK, no Loki). At current scale this is a deliberate "wait until pain emerges" decision.


11. Tech stack summary

Layer Choice Why
Backend FastAPI + Python 3.12 Async I/O for the WebSocket tunnel + agent fan-out; rich ecosystem for compliance / PDF / Stripe
ORM SQLAlchemy 2.0 async + asyncpg Mature async driver; FORCE RLS works cleanly with the per-request GUC pattern
Auth python-jose for JWT, bcrypt for passwords, pyotp for TOTP Standard primitives; pyjwt migration is on the security TODO
HTTP uvicorn 0.30 with --proxy-headers --forwarded-allow-ips '*' Behind Caddy; needs to trust X-Forwarded-*
DB Postgres 16 RLS, partial indexes, full-text search if needed later
Cache / dedup Redis 7 Presence TTL + dedup keys; no persistence needed
Frontend React 18 + TypeScript + Tailwind + Vite Single-page dashboard; small bundle (~250 KB gzipped after the recent compression pass)
Charts Recharts 2.13 Declarative API; ~40 KB gzipped
Charts dependencies none We deliberately avoided D3, since recharts wraps it
Agent Go 1.24 + pion/webrtc/v4 Pure Go; cross-compiles trivially; static binary; pion is the canonical pure-Go WebRTC stack
Static binary HTTP net/http stdlib + nhooyr.io/websocket nhooyr is more idiomatic than gorilla and supports HTTP/2
MCP server TypeScript + @modelcontextprotocol/sdk Published as @kvmfleet/mcp on npmjs.com
Reverse proxy Caddy 2 Auto-TLS, simple Caddyfile, HTTP/3 by default
TURN coturn 4.6 Industry standard, --use-auth-secret obviates per-user DB
Error tracker GlitchTip 4.1 (Sentry-SDK compatible, EU-hosted, AGPL) Sentry SaaS would be ~€26/mo and out-of-EU; GlitchTip self-host is free + EU-resident
Email Brevo (transactional) + ImprovMX (inbound) Both EU-resident; Brevo has a generous free tier; ImprovMX is free for ≤5 aliases
Billing Stripe (Ireland) Industry standard, EU-resident
Compliance reports ReportLab Pure-Python PDF generation; no headless Chrome
Tests pytest + httpx + ephemeral Postgres Async-native; 76 tests at last count

12. Open questions and likely future direction

These are not promises, just where the operator currently expects the architecture to evolve:

  • Horizontal scale: today's single CX22 handles a few hundred concurrent agents comfortably. Beyond that the agent registry needs to move from per-process in-memory to Redis (so multiple platform replicas can route to the same agent), and the Postgres connection pool needs sizing review. Documented in security-hardening TODO.
  • WebRTC video: Phase 2 (#106) wires kvmd's H.264 stream as an actual track. Once stable, the iframe console (routers/console.py + the URL-rewriting glue) is retired entirely, removing ~1000 LOC of brittle code.
  • Cross-KVM support via Redfish: the agent currently only knows kvmd. Adding a Redfish "translator" mode lets it bridge to Dell iDRAC, HPE iLO, Supermicro IPMI, etc. Turns KVM Fleet from "wrapper around hobbyist hardware" into "single pane of glass over every BMC the enterprise owns".
  • Multi-tenant / MSP mode: nested orgs so a consultant can manage many customer tenants from one account. Needs a parent_org_id column, a token claim for the active sub-tenant, and an org-switcher UI.
  • On-prem deployable bundle: a tarball with docker-compose.prod.yml + Caddyfile.template + offline licence-key check, for enterprise customers who can't accept SaaS.
  • Cross-customer telemetry intelligence: anonymised baselines on the customer's own dashboard ("your average uptime: 28 days, fleet p50: 43 days"). Requires a separate telemetry warehouse (probably ClickHouse), an opt-in flag, lawyer-reviewed copy on the privacy implications, and an aggregation function that enforces a minimum bucket size to prevent re-identification.
  • Hardware-bound audit-chain anchoring: HMAC-sign the daily audit-tail hash with a key the platform doesn't hold + anchor externally (S3 object-lock, Bitcoin testnet, etc.) so platform compromise can't silently change the official chain head.

13. Glossary

  • PiKVM: an open-source KVM-over-IP appliance built on Raspberry Pi or compatible single-board computers. Provides remote video, keyboard/mouse, virtual mass storage, and ATX power control over IP. KVM Fleet's primary supported hardware.
  • kvmd: the daemon running on each PiKVM that exposes the KVM functions as an HTTP + WebSocket API. KVM Fleet's agent talks to kvmd locally on the device.
  • uStreamer: kvmd's video-source process. Provides MJPEG or H.264 over an internal interface.
  • agent: KVM Fleet's small Go binary that runs alongside kvmd on each device. The bridge between kvmd and the platform.
  • platform: KVM Fleet's central FastAPI service. Hosts the API, processes webhooks, mints tokens, runs the audit chain.
  • org: a single tenant. Every paying customer is one or more orgs.
  • operator: an authenticated user of the dashboard, with one of the four roles.
  • support admin: a user with is_support_admin = true. Can read tickets across orgs and access the WebRTC console preview. Currently only the platform operator's two personal accounts.
  • MCP: Model Context Protocol — Anthropic's standard for exposing tools to AI assistants. KVM Fleet ships an MCP server (@kvmfleet/mcp) that lets Claude Desktop and similar tools query fleet state in plain English.
  • RLS: Row-Level Security. The Postgres feature that enforces per-row visibility based on a session-level setting. KVM Fleet uses it as defence-in-depth for tenant isolation.
  • audit chain: the SHA-256 hash chain over audit_events rows that makes tampering with the log mathematically detectable.