Threat model

This page is the adversarial read of KVM Fleet — what attacks it defends against, what it doesn't, and how the design compares to the alternatives infrastructure teams commonly use today.

For the structural detail (which router does what, which port is open where), see System architecture.

The trust diagram

flowchart LR
    subgraph trusted_op["Trusted: operator identity"]
        operator["Operator<br/>SSO + 2FA + RBAC"]
    end

    subgraph trusted_kf["Trusted: KVM Fleet"]
        platform["Platform<br/>authorisation + audit + JIT"]
        agent["Agent / Redfish proxy<br/>outbound-only"]
    end

    subgraph trusted_org["Trusted: customer"]
        bmc["BMC / PiKVM<br/>customer-owned"]
        target["Target server<br/>(no agent, no telemetry)"]
    end

    subgraph untrusted["Untrusted: the internet"]
        attacker["Adversaries"]
    end

    operator -- "TLS 1.3" --> platform
    platform -- "WSS tunnel<br/>or Redfish HTTPS" --> agent
    agent -- "Local HTTP<br/>or LAN HTTPS" --> bmc
    bmc -- "HID / KVM-IP" --> target

    attacker -. "no inbound path" .-> bmc
    attacker -. "no inbound path" .-> agent
    attacker -- "TLS attempts<br/>blocked by HSTS preload<br/>+ Caddy LE" --> platform

Read this as: operator → platform → agent → BMC → target is a single connected trust chain. Adversaries sit outside. The platform is in the middle by design — it centralises audit, RBAC, JIT, and policy decisions; that's the SaaS value proposition. Without it, every BMC's auth + audit posture would be each customer's individual problem.

Three common alternatives we're better than

Most small IT teams today are doing one of these. Each has a specific failure mode.

1. Exposing the BMC directly to the internet

The cheapest path: put iDRAC / iLO on a public IP behind a firewall ACL, optionally a VPN.

What this exposes:

Default credentials still in use (calvin / admin / ADMIN)
The vendor's auth surface (SessionService, /cgi-bin/, custom web UIs) — historically a frequent CVE target. iDRAC, iLO and Supermicro have all had high-severity unauthenticated RCEs in the past five years.
Zero per-action accountability. Which admin power-cycled which server at 03:14 last Tuesday? Vendor logs are local, non-chained, deletable.

What KVM Fleet does instead:

The BMC stays inside the customer network (or firewalled to allow only Hetzner's IP, 46.225.227.71).
Every action goes through the platform's RBAC + audit + (optional) JIT + approval workflow.
Operator identity, IP, action, target, timestamp, and result are all in a SHA-256 hash chain that breaks visibly on any mutation.

2. Shared admin passwords

The "one admin password the ops team uses, rotated when someone leaves (eventually)" pattern.

Problems:

No per-action accountability — vendor logs only show "admin did X".
Rotation on leaver-events is brittle; teams forget.
Stolen credentials grant indefinite, unaudited access.

What KVM Fleet does instead:

Per-user identity from Google SSO or local credentials + TOTP 2FA.
Operators never see the BMC's admin password — the platform holds an encrypted credential (Fernet, key derived from jwt_secret).
JIT access grants expire automatically. Approval workflows leave a paper trail (access.requested / access.approved / access.denied / access.break_glass audit actions).
Account disabled → all sessions for that user are revokable centrally and visible in the audit log.

3. VPN-based access

The standard enterprise pattern. Sometimes a jump host adds another layer.

Problems:

VPNs grant network access. Once inside, an attacker can scan, lateral-move, hit any management interface on the network.
Audit is at the VPN level (who connected when), not the action level (who power-cycled what).
Hard to enforce JIT or approval — VPN access is usually on-or-off for a whole network segment.
Performance: serial / video over VPN is often choppy when the VPN concentrator is far away.

What KVM Fleet does instead:

Zero-trust per action: every console open, every power action, every ISO mount, every grant request is authorised separately. No broad network grant.
Audit is at the action level, with operator identity attached.
JIT + approval per device per session.
Tunnel is outbound-only from the customer side — no firewall holes, no VPN endpoint to attack.

Threat scenarios we explicitly mitigate

Scenario	Without KVM Fleet	With KVM Fleet
Operator's password leaked	Attacker has full BMC access until rotation	2FA + JIT + audit alert on first anomalous use; access revoked centrally
Disgruntled employee	Manually revoke per-system credentials, hope nothing missed	One account-disable revokes all sessions + future grants; audit history retained
Compromised platform host	Full BMC takeover by attacker	Outbound-only tunnel limits blast radius to live sessions; audit chain detects post-hoc rewrite
Insider modifies audit log	Often undetectable	SHA-256 chain breaks visibly on any row mutation; integrity-check API exposes it
Mass enumeration / brute force on login	Vendor surface exposed	Login rate-limited; lockout after 6 failed attempts; every attempt audited
BMC firmware CVE	Direct internet attack surface	BMC unreachable from internet; only platform's egress IP can reach
Stale contractor access	Forgot to remove from BMC	`org_users.expires_at` auto-expires membership; janitor sweeps every minute
Sensitive action without approval	Anyone with the password fires it	JIT-required devices block console-start until grant; approval workflow logs decision + approver
Break-glass abuse	Hard to detect	`access.break_glass` audit event flagged in compliance reports; org-admin email alerted
Lost / stolen laptop with active session	Operator panics, hopes IT can revoke fast	Console-token TTL is 5 min; session row is revocable from the dashboard immediately
MitM via rogue Wi-Fi	Vendor UIs often weak TLS, no HSTS	HSTS preload + TLS 1.3; SPKI pinning on the agent side (in progress on security TODO)
ISO swap / mass-image attack	Hard to rate-limit at the vendor	Per-device 1-hour rate limit on mount, audit row written before the action fires

What the platform is NOT designed to defend against

Being honest about the bounds:

1. A determined platform operator with DB superuser access

They can read all org data, read audit history, mint JWTs (they have JWT_SECRET), inject into active sessions. The audit chain stops them from silently rewriting history — any mutation breaks the SHA-256 chain visibly to anyone who runs GET /v1/audit/integrity — but they have full live-system access.

This is the same posture as any centralised IT-management SaaS. The honest framing for customers is:

"You trust us with live device control + audit history. You don't trust us with the OS inside the target server, its credentials, or its data."

Mitigations on the roadmap (see TODO.md):

Customer-owned audit signing keys (Phase 3, trigger-driven): customer's Ed25519 key signs each audit event server-side. Customer can verify offline using only their public key. Platform operator cannot forge new audit rows that pass verification.
Envelope encryption with customer KMS for stored creds.

2. A compromised PiKVM

If an attacker roots a customer's PiKVM (e.g. via kvmd CVE, weak SSH password), they own that one device. The platform can alert on anomalous behaviour (agent reconnects from new IPs, sudden console-session bursts) via the existing alert rules, but cannot prevent damage on the already-compromised host.

3. Network adversaries with valid Let's Encrypt certificate issuance

A determined adversary that obtains a valid LE cert for app.kvmfleet.io could intercept agent or browser traffic. SPKI pinning in the agent + HSTS preload on the browser side close this. HSTS preload is shipped. SPKI pinning is on the security TODO, not yet shipped.

4. A targeted compromise of the customer's identity provider

If an attacker takes over a customer's Google Workspace, they can SSO into KVM Fleet as that user. Out of scope for us to defend; the customer's IdP is their responsibility. We can reduce blast radius by enforcing 2FA at the KVM Fleet layer (already on).

5. Side-channel on the host running the agent

If an attacker has root on the same host as the agent (a PiKVM running other software, or a customer-side Redfish gateway), they can read the encrypted-creds blob and the platform's session token from /etc/kvmfleet/agent.env. Mitigation: host hygiene is the customer's responsibility. The agent runs as a non-root user, the env file is mode 0600, but a root attacker can read anything.

What "tamper-evident" actually means

The audit chain (audit_events.prev_hash) is SHA-256(serialized previous row || current row's mutable fields). The chain starts at the zero row (all zeros). The DB role used by the application has explicit REVOKE UPDATE, DELETE, TRUNCATE ON audit_events; a refusal trigger blocks even superuser modifications via the standard SQL surface.

Three properties:

Insert-only. The application role cannot modify a row after writing it. Only insert.
Chained. Each row references the prior row's hash. Mutating any past row breaks every subsequent hash.
Verifiable offline. GET /v1/audit/integrity returns the row count + a first_break_id if any mutation has happened. The same check can be re-run by a customer with read access to the schema (they don't need our code).

What this doesn't prevent: a superuser with SET session_replication_role = replica can bypass the trigger and write directly. They cannot forge hashes that line up with subsequent rows without re-writing the entire chain after the break point, and any verifier downstream sees that. So: an attacker can gap the chain (visible), they cannot modify the chain (also visible).

See Audit chain for the schema-level detail.

Concurrency-break vs. tamper-break

A prev_hash mismatch on the verifier can come from two distinct causes — important to distinguish:

Signature	Cause	What it means
Row N+1's `prev_hash` matches row N's `prev_hash` (not row N's `row_hash`); rows N and N+1 are within milliseconds and look like duplicates (same `action`, `target_id`, `actor_id`)	Concurrency artifact — two simultaneous `record()` calls each read the same chain head before either had inserted	Not tampering. Each individual row's `row_hash` correctly hashes its own payload + the prev_hash it observed at insert time. No data was modified.
Row N's `row_hash` doesn't equal `sha256(prev_hash + canonical(payload))` for its own payload	Tampering — somebody mutated row N's fields after insert	The chain detected a post-hoc modification. Row N's content is suspect.

KVM Fleet's services/audit.py::record() shipped with a concurrency bug from initial release until 2026-05-13 (commit a1b07c1): the read-of-latest-row + INSERT cycle was not serialized. The bug surfaced in production on the founder's own test org as six pairs of device.offline events from rapid agent reconnect cycles racing through the WS handler's finally block. Each pair produced one chain-break signature of the first type above.

Mitigation deployed:

record() now takes a Postgres advisory transaction lock (pg_advisory_xact_lock(hashtext("audit:" + org_id))) before reading the chain head. The lock releases on commit, so the contention window is exactly the audit write itself. Different orgs hash to different keys and don't contend with each other.
Regression test (test_chain_intact_under_concurrent_writers): launches 10 simultaneous record() calls via asyncio.gather, asserts the chain is intact and no non-zero prev_hash repeats. Would fail without the lock.

For pre-fix chains where the bug already produced visible breaks, a one-shot repair script (scripts/repair_audit_chain.py) recomputes every row's prev_hash + row_hash chained off the previous row, then inserts a marker audit.chain.repaired event at the tail of the chain. The marker itself is in the chain and explicitly documents that the chain was rewritten: when, why, and how many rows changed. Any auditor reading the chain sees the marker and knows what happened.

The repair script does the rewrite as super (the runtime role still has REVOKE UPDATE); the immutability trigger is dropped and recreated in the same transaction so the window is exactly one statement. After the rewrite the chain verifies green, and the marker event is permanent evidence of the rewrite.

Compliance frameworks the trust model supports

The arguments above underpin the compliance-report claims. We map controls from:

NIS2 (EU directive)
SOC 2 Type II (AICPA)
ISO 27001:2022
GDPR (EU regulation)
HIPAA Security Rule (US)
NIST SP 800-53 Rev 5 (US federal baseline)
PCI DSS 4.0
Cyber Essentials Plus (UK)

For the full control-by-control coverage, see the org's Compliance page after enabling a framework — the PDF is generated from app/services/report.py. The trust-model claims here are the evidence the report cites; the report itself is the audit-grade artifact.

System architecture — comprehensive structural view
WebSocket multiplexing — how the agent tunnel actually works
Audit chain — SHA-256 hash chain detail
Row-Level Security — Postgres-level tenant isolation

Threat model

The trust diagram

Three common alternatives we're better than

1. Exposing the BMC directly to the internet

2. Shared admin passwords

3. VPN-based access

Threat scenarios we explicitly mitigate

What the platform is NOT designed to defend against

1. A determined platform operator with DB superuser access

2. A compromised PiKVM

3. Network adversaries with valid Let's Encrypt certificate issuance

4. A targeted compromise of the customer's identity provider

5. Side-channel on the host running the agent

What "tamper-evident" actually means

Concurrency-break vs. tamper-break

Compliance frameworks the trust model supports

Related documents