System architecture
This document is the single comprehensive view of how KVM Fleet is built, deployed and operated. It is intended for prospects with a security or platform team who want to understand the trust model, for auditors evaluating the compliance posture, and for future maintainers who need to navigate the codebase.
The companion documents in this section drill into specific subsystems:
- Threat model — adversarial view, comparison with VPN / exposed BMC / shared passwords
- WebSocket multiplexing — how the agent tunnel carries HTTP and WS sessions
- Audit chain — the SHA-256 hash chain on
audit_events - Row-Level Security — how Postgres RLS isolates tenants
At a glance
flowchart LR
subgraph operator["Operator (anywhere)"]
browser["Browser<br/>SPA + WebRTC"]
end
subgraph hetzner["Hetzner CX22 (Falkenstein, DE) — EU only"]
caddy["Caddy 2<br/>TLS 1.3 terminator"]
platform["FastAPI platform<br/>audit + RBAC + JIT + approval"]
pg[("Postgres 16<br/>RLS + SHA-256 audit chain")]
redis[("Redis 7<br/>presence + sessions")]
coturn["coturn<br/>TURN relay"]
end
subgraph customer["Customer network"]
agent["Go agent<br/>on PiKVM<br/>(outbound-only)"]
kvmd["kvmd<br/>localhost:80"]
bmc["Dell iDRAC / HPE iLO<br/>Supermicro / Lenovo XCC<br/>(Redfish-managed)"]
end
browser <==>|"HTTPS + WSS"| caddy
caddy --> platform
platform <--> pg
platform <--> redis
agent ==>|"outbound WSS<br/>(agent dials in)"| caddy
agent --> kvmd
platform -->|"Redfish HTTPS<br/>session or basic auth"| bmc
browser -. "WebRTC P2P<br/>DTLS-SRTP" .-> agent
browser -. "WebRTC fallback" .-> coturn
coturn -. "fallback only" .-> agent
Three trust domains: the operator's identity (Google SSO + 2FA), the platform (KVM Fleet — audit, authorisation, signaling), and the customer's BMC / PiKVM. The platform sits in the middle of the data path for everything except WebRTC media (which terminates DTLS-SRTP at the agent, peer-to-peer to the operator's browser). No inbound ports are opened on the customer side — the agent dials out over WSS. The full adversarial breakdown is in Threat model.
1. Product overview in one paragraph
KVM Fleet is a B2B SaaS that adds a fleet dashboard, browser-based remote console, tamper-evident audit log, role-based access control, Google SSO, alerting, ISO library, per-device monitoring and one-click compliance reports on top of customer-owned PiKVM hardware. A small Go agent (~10 MB static binary, arm/arm64/amd64) runs on each PiKVM and dials out to the platform over an authenticated WebSocket. From there an operator at app.kvmfleet.io gets unified control of every device, with all data-at-rest and most data-in-flight inside the European Union (Hetzner, Falkenstein).
2. Components
2.1 The marketing site (kvmfleet.io)
Static HTML/CSS, no JavaScript framework. Served by Caddy from deploy/landing/. Pages: /, /terms.html, /privacy.html, /dpa.html, plus /.well-known/security.txt, /robots.txt, /sitemap.xml, /llms.txt. Includes JSON-LD structured data (SoftwareApplication, Organization) so search engines and LLM crawlers can index the product correctly. Heavy use of CSS variables and a single embedded <style> block per page; no build step.
2.2 The dashboard SPA (app.kvmfleet.io)
React 18 + TypeScript + Tailwind, bundled by Vite. Built once per deploy by the web-build Docker service into a Caddy-served static volume. State is managed by @tanstack/react-query. Real-time updates use a WebSocket to the platform at /v1/ws for presence + heartbeat broadcasts. Charts use Recharts. WebRTC console uses the browser's native RTCPeerConnection. No SSR — fully client-side after the initial bundle download.
2.3 The platform (FastAPI, Python 3.12)
Single FastAPI process behind Caddy at app.kvmfleet.io/v1/*. Routers under platform/app/routers/:
| Router | Surface |
|---|---|
auth |
Local signup/login, password change/reset, refresh-token rotation |
sso |
Google OIDC SSO callback flow |
twofa |
TOTP enrolment, recovery codes, MFA challenge |
team |
Org member roster, invites, role changes, account deletion |
org |
Org-level settings (compliance frameworks, country) |
devices |
Device list, enrolment, rotate token, remove |
power |
ATX power control via the agent tunnel |
console |
The legacy iframe-tunnelled kvmd console |
webrtc |
Phase 1 WebRTC signalling (preview) |
isos |
ISO library catalogue + per-device mount/unmount |
support |
Customer-side support tickets |
admin |
Support-admin-only routes (errors, ticket cross-org view) |
audit |
Per-org audit event query + chain integrity check |
audit_webhooks |
Per-org SIEM webhook endpoints + dispatcher |
alerts |
Alert rules + history |
billing |
Stripe Checkout + Customer Portal + webhook + plan info |
reports |
PDF compliance report generation |
api_tokens |
Long-lived bearer tokens for the MCP server / scripts |
ws |
Operator presence stream + agent tunnel endpoint |
public |
Public, unauth endpoints (plans, beta status) |
downloads |
Signed agent binary downloads (pinned version) |
beta |
Private-beta gate + waitlist |
2.4 The agent (Go 1.24, pure Go, no CGO)
Single static binary installed at /usr/local/bin/kvmfleet-agent on the PiKVM. Configured via env vars (KVMFLEET_API, KVMFLEET_TOKEN, KVMFLEET_KVMD_USER, KVMFLEET_KVMD_PASS, etc.) and a small JSON state file at /var/lib/kvmfleet/state.json. Dials out to wss://app.kvmfleet.io/v1/agent/ws on boot, authenticates with the per-device token, then services multiplexed HTTP requests and WebSocket channels from the platform side.
Internal HTTP routes the agent exposes through the tunnel:
/health— agent self-info, always reachable/api/...— proxied to local kvmd (browser console + power + ATX state)/streamer/...— proxied to local uStreamer (video frames for legacy console)/extras/webterm/ttyd/...— direct Unix-socket route to ttyd (serial console)/internal/iso/{mount,unmount}— ISO library handlers (download, sha256-verify, mount via kvmd MSD)/internal/webrtc/offer— pion-based WebRTC PeerConnection negotiation (HID DataChannel; video track in phase 2)
2.5 Postgres (16-alpine)
Single instance. Two databases on the same cluster:
| Database | Used by | Notes |
|---|---|---|
kvmfleet |
platform | All operational data |
glitchtip |
GlitchTip | Error tracker (separate schema) |
Roles:
| Role | Privilege | Used by |
|---|---|---|
kvmfleet |
superuser | Migrations, admin scripts, glitchtip's bootstrap |
kvmfleet_app |
NOSUPERUSER NOBYPASSRLS | The platform's runtime connection — RLS applies to it |
Most org-scoped tables have forced row-level security with a org_iso policy keyed on current_setting('app.current_org'). The platform sets that GUC on every authed request via set_org_context(session, org_id). See rls.md for the full table list.
2.6 Redis (7-alpine)
Two databases on one instance:
| DB | Used by |
|---|---|
| 0 | platform: agent presence, alert dedup, error-alert dedup |
| 2 | GlitchTip: Celery broker + result backend |
No persistence enabled — all keyed data is short-lived (presence TTL 30s, dedup TTL 1h). A Redis crash loses presence state which the next agent heartbeat repopulates.
2.7 Caddy 2 (TLS terminator + static server + reverse proxy)
Single Caddy process serves four virtual hosts:
| Host | Backed by | Purpose |
|---|---|---|
kvmfleet.io |
static /srv/landing |
Marketing pages |
kvmfleet.io/docs/ |
static /srv/docs |
This documentation site |
app.kvmfleet.io |
reverse-proxy platform:8000 for /v1/*, static /srv/web for the SPA |
Operator dashboard + API |
errors.kvmfleet.io |
reverse-proxy glitchtip:8000 |
Internal error tracker |
Plus the legacy eurokvm.io and app.eurokvm.io 308-redirect to the new domain.
Caddy auto-provisions TLS certs via Let's Encrypt, sets HSTS preload, strict CSP per host, long-lived Cache-Control: immutable on hashed assets, short cache on HTML.
2.8 coturn (TURN server for WebRTC)
Runs on the Hetzner host network (not in a docker bridge — TURN's UDP relay range needs raw access). Listens on UDP/TCP 3478 (STUN/TURN) and 5349 (TLS), with a relay range of 49152-49200. Uses --use-auth-secret so the platform mints ephemeral 24h credentials without a per-user database. Only used as a WebRTC fallback; most connections traverse direct host or STUN-discovered paths.
2.9 GlitchTip (error tracker, Sentry-SDK compatible)
Self-hosted, EU-resident replacement for Sentry. Three docker services (glitchtip-migrate, glitchtip, glitchtip-worker) sharing the cluster's Postgres + Redis. Public registration disabled (ENABLE_USER_REGISTRATION=false). The platform's app/observability.py initialises sentry-sdk against it; the SPA initialises @sentry/react. Aggressive PII scrubbing in before_send strips JWTs, refresh tokens, Stripe keys, sensitive headers and request bodies before any event leaves the platform.
2.10 External services
| Service | Used for | Region |
|---|---|---|
| Stripe Payments Europe Ltd. | Checkout, Customer Portal, recurring subscriptions, dunning | Ireland |
| Brevo (ex-Sendinblue) | Outbound transactional email (welcome, invite, password reset, alerts, signup notification) | France |
| ImprovMX | Inbound email forwarding for *@kvmfleet.io aliases (privacy@, security@, support@, etc.) |
Belgium |
| Google Cloud Identity Platform | OIDC SSO | Multi-region; we use the EU OAuth client |
| Hetzner Online GmbH | Compute (CX22), DNS lookups, network egress | Germany (Falkenstein) |
| INWX | Domain registrar + authoritative DNS for kvmfleet.io |
Germany |
| Cloudflare STUN | NAT discovery for WebRTC (stun.cloudflare.com:3478) |
global anycast |
| npmjs.com | Distribution for @kvmfleet/mcp |
US (read-only on customer side) |
Each is disclosed as a sub-processor in the DPA.
3. Hosting topology
Hetzner CX22 (FSN1, Falkenstein DE)
┌─────────────────────────────────────────────────┐
│ │
│ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ Caddy 2 │ │ platform │ │ glitchtip │ │
│ │ TLS / SNI │ │ FastAPI │ │ + worker │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Postgres 16 │◄──────┘ │
│ │ │ + Redis 7 │ │
│ │ └────────────────┘ │
│ │ │
│ ┌─────┴──────┐ ┌────────────┐ │
│ │ web-build │ │ coturn │ │
│ │ (one-shot) │ │ net=host │ │
│ └────────────┘ └────────────┘ │
│ │
└──────────────┬───────────────────────────────────┘
│
│ public IP 46.225.227.71
│ TCP 443 (TLS), 80 (LE), 22 (ssh)
│ UDP 3478, 5349, 49152-49200 (TURN)
▼
┌────────────────────────────┐
│ THE INTERNET │
└────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌─────────────────┐ ┌───────────┐
│ Operator │ │ PiKVM agent │ │ Customer │
│ browser │ │ (anywhere) │ │ stripe / │
│ (anywhere) │ │ outbound WSS │ │ google │
└──────────────┘ └─────────────────┘ │ callbacks │
└───────────┘
One CX22 (€3.99/mo, 2 vCPU, 4 GB RAM, 40 GB NVMe) hosts everything: Caddy, the platform, Postgres, Redis, GlitchTip, coturn, the static SPA bundle, the docs site, the marketing site. There is no horizontal scaling today; that's a deliberate choice for a pre-revenue product. The path to scale is documented in the open-questions section at the bottom.
4. Walkthroughs of the major data flows
4.1 Operator login (local credentials)
- Browser POSTs
/v1/auth/loginwith{email, password}. - Platform looks up the user (RLS off —
app.current_orgnot yet set), bcrypt-verifies the password, checks account lockout state. - If TOTP is enabled, returns a short-lived
mfa_token(5 min, audiencemfa); otherwise issues an access JWT (15 min, audienceaccess) + a refresh token (random opaqueekvrf_…string, sha256-hashed and stored inrefresh_tokens). - Refresh token returned via HttpOnly Secure SameSite=Strict cookie scoped to
/v1/auth. Access token returned in JSON body. - SPA stores the access token in memory and sends it as
Authorization: Bearer …on every API call.
Refresh: SPA POSTs /v1/auth/refresh (cookie auto-attached). Platform sha256-hashes the cookie value, looks up the row, checks for prior reuse (returns 401 + revokes the entire family if found), rotates: marks the current row replaced, issues a new pair.
4.2 Google SSO
- Browser navigates to
/v1/auth/google→ Authlib redirects to Google with the configuredredirect_uri. - Google authenticates the user, redirects back to
/v1/auth/google/callback?code=…. - Platform exchanges the code for an ID token, validates
email_verified=true, looks up or creates the user. If a non-expired invite exists for that email, joins them to the inviter's org; otherwise creates a fresh personal org. - Issues access + refresh tokens (same as local login). Redirects to
https://app.kvmfleet.io/login#access_token=…so the SPA picks them up. - New users trigger a one-shot
AcceptTermsgate before they can use the dashboard (Terms + B2B confirmation).
4.3 Device enrolment
- Operator clicks "Add device" in the SPA → POST
/v1/devices/enrollmentwith optionalsuggested_name. Platform issues anenrollment_token(random, 24h TTL) bound to the org. - SPA shows the operator a one-line install command:
curl -sSL https://app.kvmfleet.io/install | KVMFLEET_TOKEN=<plaintext> sh. - Operator runs the command on the PiKVM via SSH. The install script downloads the agent binary from
/downloads/kvmfleet-agent.linux-arm64, writes a systemd unit, setsKVMFLEET_TOKENin/etc/kvmfleet/agent.env, starts the service. - Agent boots, POSTs
/v1/agent/registerwith the enrollment token + hardware ID + agent version + kvmd version. Platform consumes the enrollment token, creates adevicesrow, returns a long-lived per-deviceauth_token(sha256-hashed at rest indevices.auth_token_hash). - Agent persists the
auth_tokento/var/lib/kvmfleet/state.jsonand switches to the long-lived authenticated WebSocket:wss://app.kvmfleet.io/v1/agent/ws?token=<auth_token>. - Heartbeats every 30s update
devices.last_seen_at,cpu_temp_c,uptime_s, and append adevice_metricsrow for the historical-data graph.
4.4 Browser console (legacy iframe path)
- Operator clicks "Console" → SPA POSTs
/v1/console/startfor a console_session row + a 60-minaud=consoleJWT in an HttpOnly cookie scoped to/v1/devices/{id}/console/. - SPA opens an
<iframe>tohttps://app.kvmfleet.io/v1/devices/{id}/console/(which serves the kvmd web UI through the proxy). - Every iframe HTTP request hits the platform's
routers/console.pyproxy. The proxy: - Validates the console JWT cookie + ACL the device against the org
- Strips kvmfleet's own cookies from the outbound
Cookieheader - Path-allowlists requests (kvmd's
/api/,/streamer/,/share/,/login,/logout,/auth/) - Rewrites response HTML/JS so resource URLs resolve under
/v1/devices/{id}/console/… - Rewrites Set-Cookie attributes (Path, Domain, Secure)
- Platform forwards the request through the agent's WebSocket tunnel (multiplexed channel; see ws-multiplex.md). Agent serves the request locally from kvmd, returns the response.
- WebSocket subscriptions for the live video stream use a multiplexed WS channel through the same tunnel, with extra Origin checks at handshake.
This whole path is being retired in favour of the WebRTC console (next section).
4.5 Browser console (WebRTC, preview, support-admin only)
- Operator clicks "Console (RTC)". SPA fetches
/v1/devices/{id}/webrtc/ice-configto get STUN + ephemeral TURN credentials. - SPA creates an
RTCPeerConnectionwith those ICE servers, attaches a recvonly video transceiver and an unordered HID DataChannel. - SPA generates an SDP offer, waits 2s for ICE gathering, POSTs the offer to
/v1/devices/{id}/webrtc/offer. - Platform proxies the offer through the agent's WebSocket tunnel to the agent's
/internal/webrtc/offer. - Agent (pion v4) accepts the offer, builds its own PeerConnection with the same ICE servers, creates an SDP answer, gathers ICE for up to 5s, returns the answer JSON.
- SPA applies the remote description. ICE traversal runs concurrently on both sides — direct host pair, STUN-discovered pair, or TURN-relayed pair (whichever wins).
- After ICE settles, DTLS-SRTP handshake runs end-to-end between browser and agent. The platform plays no role from this point on.
- HID events flow as JSON strings over the DataChannel. The agent forwards each one as an HTTP POST to
kvmd's /api/hid/events/send_…endpoints using the same Basic auth and session cookies as the rest of the proxy. - Video track is currently a placeholder (recvonly, no source bound) — Phase 2 wires it to kvmd's H.264 stream (
/streamer/stream).
The signalling pipe is platform-mediated; the media pipe is end-to-end DTLS-SRTP encrypted. The platform sees SDPs (~10 KB) and ICE candidate metadata; it cannot decrypt media.
4.6 ISO mount
- Operator registers an ISO at
app.kvmfleet.io/isos: name, source URL (HTTPS), sha256, optional size. Stored inisostable with FORCE RLS. - Operator opens a device, picks an ISO, clicks "Mount" → POST
/v1/devices/{id}/iso:mount. Platform audit-logs before firing, rate-limits to 1 mount per device per hour. - Platform forwards
{name, source_url, sha256, size_bytes, media_type}to the agent's/internal/iso/mount. - Agent streams the URL to a tmp file with concurrent SHA-256 hashing. Refuses to proceed on hash or size mismatch.
- Agent disconnects any existing MSD, multipart-uploads to kvmd's
/api/msd/write?image=<name>, calls/api/msd/set_paramsand/api/msd/set_connected=1. - Bytes never touch the platform — egress flat, GDPR posture clean.
4.7 Compliance report generation
- Operator picks a framework (NIS2, GDPR, SOC 2, ISO 27001, HIPAA, NIST 800-53, PCI DSS, ISO 9001) → POST
/v1/reports/{framework}with optionalfrom_date/to_date. - Platform's
_collect_evidenceruns a single async pass that queries device counts, audit-event counts by action, hash-chain integrity, member roster, failed logins, 2FA failures, password resets, refresh-token reuse and other org-scoped facts. - The framework spec maps each control to an evaluator function that classifies the control as
COVERED/PARTIAL/NOT COVERED/N/Abased on the evidence. - ReportLab renders the PDF: cover sheet, executive summary with coverage roll-up, per-control table with status badges, framework-specific sections (Art. 30 ROPA template for GDPR, CUECs for SOC 2, Annex A theme breakdown for ISO 27001, etc.), per-framework "this is not an attestation" disclaimer.
- PDF returned with
Content-Disposition: attachment; filename=kvmfleet-{framework}-{from}-{to}.pdf. Audit row written.
4.8 Audit log + tamper-evidence
Every meaningful action (login, console open, power cycle, ISO mount, settings change, role change, deletion, etc.) writes a row to audit_events. Each row carries the SHA-256 hash of the previous row's hash plus the canonical-JSON of its own payload — a cryptographic chain. See audit-chain.md.
The table is append-only at the DB level:
REVOKE UPDATE, DELETE, TRUNCATE ON audit_events FROM kvmfleet_appso the runtime role can't tamper with history.- A
BEFORE UPDATE OR DELETEtrigger raises an exception even from a superuser, defending against accidents. - Account deletion's pseudonymisation pass uses a SAVEPOINT that's expected to fail; the deletion still proceeds, the historical email stays in the audit log, the Privacy Policy carves this out under the immutable-log clause.
Operators verify the chain via GET /v1/audit/integrity, which re-walks the org's events and returns {ok: true, checked: N} or {ok: false, first_break_id: …, message: …}. The verification is also exposed through the MCP verify_audit_integrity tool.
Optionally, an org can configure SIEM webhooks (audit_webhooks table). An async dispatcher fans out HMAC-SHA256-signed POSTs (X-KVMFleet-Signature: sha256=…) to each registered URL. After 10 consecutive failures the webhook auto-disables.
4.9 Per-device monitoring
- Every agent heartbeat (~30s) writes a
device_metricsrow:(device_id, org_id, ts, cpu_temp_c, uptime_s, online). - SPA's
DeviceMonitoringpage callsGET /v1/devices/{id}/metrics?hours=&max_points=for the selected range. - Platform returns raw rows when count ≤
max_points; otherwise downsamples server-side viaepoch / bucket_secondsgrouping withAVG()aggregation, so the chart never has to render more than ~200 points regardless of range. - Recharts renders CPU temp + uptime line charts with hover tooltips and adaptive x-axis tick formatting (HH:MM for ≤24h, day+time for ≤7d, date for >7d).
4.10 Stripe billing
- Operator clicks "Upgrade" → POST
/v1/billing/checkout {plan}. Platform creates a Stripe Checkout session with the price id mapped from the plan, sets metadata{org_id, plan}, returns the Checkout URL. - Stripe collects payment, redirects back to the SPA. Platform receives
checkout.session.completedat/v1/billing/webhook. - Webhook is HMAC-verified against
STRIPE_WEBHOOK_SECRET. Idempotency: every event_id is inserted intostripe_eventswithON CONFLICT DO NOTHING; duplicate posts return{duplicate: true}without re-processing. - On
checkout.session.completedthe platform flipsorg.plan+org.max_devicesand storesstripe_subscription_id. Oninvoice.payment_failedit setspayment_failed_at(starting the 7-day dunning clock). Oninvoice.paidorinvoice.payment_succeededit clearspayment_failed_at. Oncustomer.subscription.deletedit downgrades tofreeand clears all billing fields. - A janitor sweeps every 60s: orgs with
payment_failed_at < NOW() - 7 daysget auto-downgraded to free.
Soft enforcement: an over-cap org's existing devices keep working; new enrolment returns 402 Payment Required.
5. Authentication and authorization
5.1 Token types and audiences
Every JWT carries an explicit aud claim. Decoding always passes the expected audience, so a console token can never be replayed where an access token is expected.
| Audience | TTL | Use |
|---|---|---|
access |
15 min | API calls from the SPA |
refresh |
7 days | Mint new access tokens (HttpOnly cookie) |
mfa |
5 min | Carry user identity through TOTP challenge |
reset |
1 hour | Password reset flow (also legacy; new flow uses opaque tokens in password_resets) |
console |
60 min | Iframe console session (HttpOnly cookie scoped to /v1/devices/{id}/console/) |
api |
configurable (no expiry / 30/90/180/365 days) | Long-lived programmatic tokens prefixed kvmf_…, sha256-hashed in api_tokens |
5.2 RBAC
Four roles per OrgUser:
| Role | Powers |
|---|---|
org_admin |
Everything: members, billing, devices, settings |
operator |
Console + power + ISO mount + audit-read |
auditor |
Read-only access to audit, sessions, devices |
read_only |
List-only access to devices and audit |
Plus a global users.is_support_admin boolean, currently true only for the platform operator's two personal accounts. Gates the cross-tenant admin surface (/v1/admin/*) and the WebRTC preview button.
5.3 Tenant isolation (RLS)
Every org-scoped table enables FORCE ROW LEVEL SECURITY with a org_iso policy. The runtime role kvmfleet_app has NOSUPERUSER NOBYPASSRLS so the policy can't be bypassed by setting row_security = off at session level.
Each authenticated request flows through authed_principal(), which calls set_org_context(session, org_id) to set the app.current_org GUC. Policies key off current_setting('app.current_org', true)::uuid. A handful of "system mode" tables (e.g. devices) treat the unset GUC as the zero UUID and allow all rows — needed for the agent heartbeat path which has no org context.
Cross-tenant admin reads (e.g. the support-admin tickets view) flip a separate app.bypass_rls GUC inside an explicit helper, with the policy's OR clause checking it. See rls.md for the table list and policy text.
5.4 Refresh-token rotation chain
refresh_tokens rows are linked into a tree by parent_id. On every successful refresh:
- The presented row's
revoked_atis set to NOW. - A new row is inserted with
parent_id = old.idand a fresh hash. - The new plaintext goes back to the client in the cookie.
If a previously-replaced token is presented again (i.e. theft + replay), the entire family (every row sharing the root parent) is revoked atomically and the user is forced to log in. Audit row written for forensics.
6. Trust model
KVM Fleet is a centralised SaaS. The platform is in the data path for almost everything except the new WebRTC media flow.
What the platform sees:
- Operator email addresses, display names, hashed passwords (bcrypt cost 12), TOTP secrets (plaintext in DB), bcrypt-hashed recovery codes
- Device names, hardware IDs, agent versions, last-seen timestamps, CPU temp + uptime time-series
- Console session metadata (start/end, viewer, device)
- Every audit-event row (action, target, result, IP, timestamp)
- Live console traffic (legacy iframe path) — every video frame and every keystroke pass through the platform
- ISO catalogue metadata (name, URL, sha256) — but never the bytes of the ISOs themselves
- Stripe customer IDs, subscription IDs, plan, billing-failed timestamps — never card numbers
- For the WebRTC console: SDP offers/answers (~10 KB each), ICE candidate metadata; not the media bytes (DTLS-SRTP encrypted end-to-end)
What the platform does not see:
- kvmd's actual administrator password — set at agent install time via
KVMFLEET_KVMD_PASS, known only to the agent on disk in/etc/kvmfleet/agent.env(mode 0600) - Plaintext payment cards — handled entirely by Stripe
- Plaintext refresh tokens — only sha256 hashes after the response leaves
- Anything inside the target server itself — the OS the operator is managing has no agent, no introspection, no telemetry
What an attacker who fully compromises the platform can do:
- Mint JWTs for any user (they have
JWT_SECRET) - Inject themselves into any active console session — see the screen, send keystrokes
- Push commands through every connected agent's tunnel (e.g. mount a malicious ISO, power-cycle a server)
- Read every audit row, every email, every device name, every console-session metadata
- Stop new audit rows from being written (existing chain stays tamper-evident; gaps would be visible to an integrity check)
- Cannot decrypt past WebRTC sessions retroactively (we don't record media)
- Cannot silently modify past audit records (the SHA-256 chain breaks visibly under any mutation)
This is the same trust model as any centralised IT-management SaaS. The honest framing is "you trust us with live device control + audit history; you don't trust us with the OS inside the target server, its credentials, or its data".
Hardening currently in place:
- bcrypt cost 12 on passwords + recovery codes
- Short-lived access JWTs (15 min) with explicit audience claims
- Refresh-token rotation with reuse detection + family revocation
- Audit chain immutability enforced at the DB role layer + a refusal trigger
- Postgres FORCE RLS so unscoped queries return zero rows
- Console-token TTL 60 min, scoped to a single device path
- Per-device rate limits on power (60s) and ISO mount (1h)
- Containers run as non-root, capability-dropped, no-new-privileges
- Strict CSP per host, HSTS preload, signed signup acceptance + B2B confirmation
- Stripe webhook idempotency keyed by
event_id - Sub-resource error capture via GlitchTip with PII scrubbing in
before_send
Open hardening still pending (tracked in TODO):
- Sign agent binaries (currently
curl | shwith no signature/checksum) - Pin the platform's TLS cert SPKI inside the agent (defends against MitM via a different valid LE issuance)
- Pin kvmd's self-signed cert from the agent (currently
InsecureSkipVerify: truefor the local kvmd) - Mirror
JWT_SECRET+ DB password + Brevo + Stripe + Google secrets to a real secret store (currently in/opt/kvmfleet/deploy/.envplaintext, root-owned) - Encrypt backups with
age+ ship to off-site Hetzner Storage Box (currently plain.sql.gzon the same disk as the DB) - mTLS the agent → platform WebSocket with per-device certs (limits damage if platform is compromised but attacker doesn't have device key)
7. Storage
7.1 Postgres tables (one-line summaries)
| Table | Purpose |
|---|---|
users |
Account identity, password hash, TOTP, recovery codes, support-admin flag |
orgs |
Tenant root; plan, max_devices, billing state, compliance frameworks |
org_users |
Membership join with role + optional expires_at for time-limited contractors |
invites |
Pending team invites with role + expires_at + (optional) membership_expires_at |
devices |
Enrolled PiKVMs; auth_token_hash, last_seen, latest cpu_temp + uptime |
device_metrics |
Heartbeat-driven time-series for trend graphs |
enrollment_tokens |
One-shot tokens consumed by agent/register |
audit_events |
SHA-256 hash-chained immutable log |
audit_webhooks |
Per-org SIEM destinations + secret + auto-disable counter |
console_sessions |
Open + historical console windows with viewer + device |
support_tickets + support_ticket_messages |
Customer-side ticket inbox with admin bypass clause |
alert_rules + alert_history |
User-defined alerting + dedup history |
refresh_tokens |
Rotation chain with parent_id + revoked_at |
password_resets |
Single-use opaque tokens, 15 min TTL |
stripe_events |
Webhook idempotency log |
beta_waitlist |
Captured emails from the marketing landing |
platform_errors |
ErrorCaptureMiddleware writes here for the support-admin Errors page |
api_tokens |
Long-lived kvmf_* tokens for MCP / scripts (sha256-hashed) |
isos |
ISO library catalogue |
7.2 File / volume layout on the server
/opt/kvmfleet/ # rsync target from operator's laptop
agent/ # Go source (not the binary)
bin/ # built agent binaries (per-arch)
deploy/
.env # production secrets (mode 0600)
docker-compose.prod.yml
Caddyfile
landing/ # served by Caddy at kvmfleet.io
hero.mp4 # 832 KB compressed; original kept as hero-original.mp4
fonts/ # Alliance No.1 will land here
setup.sh # one-shot bootstrap script
backup.sh # nightly pg_dump
admin.sh # break-glass CLI
platform/ # FastAPI source
web/ # SPA source + public/
mcp/ # @kvmfleet/mcp source
docs/ # mkdocs source
postgres-init/ # init scripts for the postgres container
/var/lib/docker/volumes/ # docker-managed volumes
deploy_pgdata # main + glitchtip databases
deploy_caddy_data # ACME state, issued certs
deploy_caddy_config # caddy state
deploy_web_build # built SPA (consumed by caddy)
deploy_docs_build # built docs (consumed by caddy)
deploy_glitchtip_uploads # GlitchTip user-uploaded source maps
On each PiKVM:
/usr/local/bin/kvmfleet-agent # binary (arm/arm64), mode 0755
/etc/kvmfleet/agent.env # KVMFLEET_TOKEN, KVMFLEET_KVMD_PASS, mode 0600
/var/lib/kvmfleet/state.json # device id + auth_token, persisted across reboots
/etc/systemd/system/kvmfleet-agent.service
7.3 Secrets and where they live
| Secret | Format | Stored | Rotation |
|---|---|---|---|
JWT_SECRET |
48-byte hex | /opt/kvmfleet/deploy/.env |
manual; new value invalidates all tokens |
SESSION_SECRET |
48-byte hex | /opt/kvmfleet/deploy/.env |
manual |
Postgres kvmfleet + kvmfleet_app passwords |
32-byte hex | /opt/kvmfleet/deploy/.env, postgres-init/ |
manual |
| Stripe live secret + webhook secret | as Stripe issues | .env |
rotated at Stripe side, mirror here |
| Brevo API key | as Brevo issues | .env |
rotated at Brevo |
| Google OAuth client secret | as Google issues | .env |
rotated at Google Cloud Console |
GLITCHTIP_SECRET_KEY |
48-byte hex | .env |
manual |
TURN_SHARED_SECRET |
32-byte hex | .env |
manual; rotation invalidates all in-flight TURN credentials |
Per-device auth_token |
random with ekvt_ prefix |
DB hash, agent on-disk plaintext at /var/lib/kvmfleet/state.json (mode 0600) |
per-device endpoint /v1/devices/{id}/rotate-token |
| User refresh tokens | random with ekvrf_ prefix |
DB hash, browser HttpOnly cookie | rotated on every refresh |
| User API tokens | random with kvmf_ prefix |
DB hash, user copies plaintext on creation | revoked via Account → API tokens |
SENTRY_DSN, SENTRY_DSN_WEB |
URLs | .env (platform), baked into SPA bundle (web) |
manual via GlitchTip UI |
8. Deployment workflow
There is no GitOps. The current deployment loop:
- Operator edits code on their laptop, runs tests via
make test. - Operator commits + pushes to
github.com/patrickattard/KVMFleet. CI runs the platform test suite, agent build matrix (amd64, arm64, arm/v7), web typecheck, MCP typecheck, security scans. - Operator runs
rsyncfrom their laptop to/opt/kvmfleet/on Hetzner (excludingnode_modules,.git, build artefacts,.env,hero.mp4). - Operator SSHes in, runs
docker compose -f deploy/docker-compose.prod.yml build platform web-buildthenup -d --no-deps platform web-build caddy. - Migrations run automatically as the platform container's startup command (
alembic upgrade headthenuvicorn app.main:app …). - Caddy reload uses bind-mounted Caddyfile; rsync replaces the file via rename, which strands the container's existing inode reference, so Caddy must be restarted (not reloaded) when the Caddyfile changes.
Agent builds run on the Hetzner host via a throwaway golang:1.24-alpine Docker container (make agent-linux-arm64 etc). Cross-compiles for amd64 / arm64 / arm-v7 with CGO_ENABLED=0. Binary size: ~10 MB stripped.
9. Backup and disaster recovery
Honest state: minimal.
deploy/backup.shruns nightly via cron:pg_dump | gzip > /opt/kvmfleet/backups/kvmfleet-YYYY-MM-DD-HHMM.sql.gz. Keeps 7 days locally, prunes older.- No off-site copy. A disk failure on the CX22 is currently a total-loss event for billing+account data.
- The audit chain is logically protected against tampering but physically unprotected against disk failure.
- Source code lives on GitHub (off-site copy of every file except
.env). - Agent binaries on PiKVMs are not backed up — they're regenerable from source.
Documented as an open security item in TODO. Planned fix: encrypt nightly dump with age, ship via restic to a Hetzner Storage Box (€3.45/mo for 1 TB).
10. Observability
| Signal | Where it lands |
|---|---|
| Unhandled platform exceptions | platform_errors table + GlitchTip via sentry-sdk with PII scrubbing |
| Browser exceptions in the SPA | GlitchTip via @sentry/react |
| Agent log lines | systemd journal on each PiKVM |
| Docker container logs | docker compose logs on Hetzner |
| Audit-relevant business events | audit_events table |
| Heartbeat / device state | devices row + device_metrics time series |
| Stripe webhook deliveries | stripe_events + Stripe dashboard |
| Outbound email | Brevo dashboard |
| TURN relay traffic | coturn log to stdout |
Smoke tests run hourly via GitHub Actions cron (scripts/smoke.py): /healthz, /v1/billing/plans, login bad-creds returns 401, landing page returns 200. On failure the action notification email is delivered to the operator.
There is no centralised log aggregator (no ELK, no Loki). At current scale this is a deliberate "wait until pain emerges" decision.
11. Tech stack summary
| Layer | Choice | Why |
|---|---|---|
| Backend | FastAPI + Python 3.12 | Async I/O for the WebSocket tunnel + agent fan-out; rich ecosystem for compliance / PDF / Stripe |
| ORM | SQLAlchemy 2.0 async + asyncpg | Mature async driver; FORCE RLS works cleanly with the per-request GUC pattern |
| Auth | python-jose for JWT, bcrypt for passwords, pyotp for TOTP | Standard primitives; pyjwt migration is on the security TODO |
| HTTP | uvicorn 0.30 with --proxy-headers --forwarded-allow-ips '*' |
Behind Caddy; needs to trust X-Forwarded-* |
| DB | Postgres 16 | RLS, partial indexes, full-text search if needed later |
| Cache / dedup | Redis 7 | Presence TTL + dedup keys; no persistence needed |
| Frontend | React 18 + TypeScript + Tailwind + Vite | Single-page dashboard; small bundle (~250 KB gzipped after the recent compression pass) |
| Charts | Recharts 2.13 | Declarative API; ~40 KB gzipped |
| Charts dependencies | none | We deliberately avoided D3, since recharts wraps it |
| Agent | Go 1.24 + pion/webrtc/v4 | Pure Go; cross-compiles trivially; static binary; pion is the canonical pure-Go WebRTC stack |
| Static binary HTTP | net/http stdlib + nhooyr.io/websocket | nhooyr is more idiomatic than gorilla and supports HTTP/2 |
| MCP server | TypeScript + @modelcontextprotocol/sdk | Published as @kvmfleet/mcp on npmjs.com |
| Reverse proxy | Caddy 2 | Auto-TLS, simple Caddyfile, HTTP/3 by default |
| TURN | coturn 4.6 | Industry standard, --use-auth-secret obviates per-user DB |
| Error tracker | GlitchTip 4.1 (Sentry-SDK compatible, EU-hosted, AGPL) | Sentry SaaS would be ~€26/mo and out-of-EU; GlitchTip self-host is free + EU-resident |
| Brevo (transactional) + ImprovMX (inbound) | Both EU-resident; Brevo has a generous free tier; ImprovMX is free for ≤5 aliases | |
| Billing | Stripe (Ireland) | Industry standard, EU-resident |
| Compliance reports | ReportLab | Pure-Python PDF generation; no headless Chrome |
| Tests | pytest + httpx + ephemeral Postgres | Async-native; 76 tests at last count |
12. Open questions and likely future direction
These are not promises, just where the operator currently expects the architecture to evolve:
- Horizontal scale: today's single CX22 handles a few hundred concurrent agents comfortably. Beyond that the agent registry needs to move from per-process in-memory to Redis (so multiple platform replicas can route to the same agent), and the Postgres connection pool needs sizing review. Documented in security-hardening TODO.
- WebRTC video: Phase 2 (#106) wires kvmd's H.264 stream as an actual track. Once stable, the iframe console (
routers/console.py+ the URL-rewriting glue) is retired entirely, removing ~1000 LOC of brittle code. - Cross-KVM support via Redfish: the agent currently only knows kvmd. Adding a Redfish "translator" mode lets it bridge to Dell iDRAC, HPE iLO, Supermicro IPMI, etc. Turns KVM Fleet from "wrapper around hobbyist hardware" into "single pane of glass over every BMC the enterprise owns".
- Multi-tenant / MSP mode: nested orgs so a consultant can manage many customer tenants from one account. Needs a
parent_org_idcolumn, a token claim for the active sub-tenant, and an org-switcher UI. - On-prem deployable bundle: a tarball with
docker-compose.prod.yml+Caddyfile.template+ offline licence-key check, for enterprise customers who can't accept SaaS. - Cross-customer telemetry intelligence: anonymised baselines on the customer's own dashboard ("your average uptime: 28 days, fleet p50: 43 days"). Requires a separate telemetry warehouse (probably ClickHouse), an opt-in flag, lawyer-reviewed copy on the privacy implications, and an aggregation function that enforces a minimum bucket size to prevent re-identification.
- Hardware-bound audit-chain anchoring: HMAC-sign the daily audit-tail hash with a key the platform doesn't hold + anchor externally (S3 object-lock, Bitcoin testnet, etc.) so platform compromise can't silently change the official chain head.
13. Glossary
- PiKVM: an open-source KVM-over-IP appliance built on Raspberry Pi or compatible single-board computers. Provides remote video, keyboard/mouse, virtual mass storage, and ATX power control over IP. KVM Fleet's primary supported hardware.
- kvmd: the daemon running on each PiKVM that exposes the KVM functions as an HTTP + WebSocket API. KVM Fleet's agent talks to kvmd locally on the device.
- uStreamer: kvmd's video-source process. Provides MJPEG or H.264 over an internal interface.
- agent: KVM Fleet's small Go binary that runs alongside kvmd on each device. The bridge between kvmd and the platform.
- platform: KVM Fleet's central FastAPI service. Hosts the API, processes webhooks, mints tokens, runs the audit chain.
- org: a single tenant. Every paying customer is one or more orgs.
- operator: an authenticated user of the dashboard, with one of the four roles.
- support admin: a user with
is_support_admin = true. Can read tickets across orgs and access the WebRTC console preview. Currently only the platform operator's two personal accounts. - MCP: Model Context Protocol — Anthropic's standard for exposing tools to AI assistants. KVM Fleet ships an MCP server (
@kvmfleet/mcp) that lets Claude Desktop and similar tools query fleet state in plain English. - RLS: Row-Level Security. The Postgres feature that enforces per-row visibility based on a session-level setting. KVM Fleet uses it as defence-in-depth for tenant isolation.
- audit chain: the SHA-256 hash chain over
audit_eventsrows that makes tampering with the log mathematically detectable.