Threat model
This page is the adversarial read of KVM Fleet — what attacks it defends against, what it doesn't, and how the design compares to the alternatives infrastructure teams commonly use today.
For the structural detail (which router does what, which port is open where), see System architecture.
The trust diagram
flowchart LR
subgraph trusted_op["Trusted: operator identity"]
operator["Operator<br/>SSO + 2FA + RBAC"]
end
subgraph trusted_kf["Trusted: KVM Fleet"]
platform["Platform<br/>authorisation + audit + JIT"]
agent["Agent / Redfish proxy<br/>outbound-only"]
end
subgraph trusted_org["Trusted: customer"]
bmc["BMC / PiKVM<br/>customer-owned"]
target["Target server<br/>(no agent, no telemetry)"]
end
subgraph untrusted["Untrusted: the internet"]
attacker["Adversaries"]
end
operator -- "TLS 1.3" --> platform
platform -- "WSS tunnel<br/>or Redfish HTTPS" --> agent
agent -- "Local HTTP<br/>or LAN HTTPS" --> bmc
bmc -- "HID / KVM-IP" --> target
attacker -. "no inbound path" .-> bmc
attacker -. "no inbound path" .-> agent
attacker -- "TLS attempts<br/>blocked by HSTS preload<br/>+ Caddy LE" --> platform
Read this as: operator → platform → agent → BMC → target is a single connected trust chain. Adversaries sit outside. The platform is in the middle by design — it centralises audit, RBAC, JIT, and policy decisions; that's the SaaS value proposition. Without it, every BMC's auth + audit posture would be each customer's individual problem.
Three common alternatives we're better than
Most small IT teams today are doing one of these. Each has a specific failure mode.
1. Exposing the BMC directly to the internet
The cheapest path: put iDRAC / iLO on a public IP behind a firewall ACL, optionally a VPN.
What this exposes:
- Default credentials still in use (calvin / admin / ADMIN)
- The vendor's auth surface (
SessionService,/cgi-bin/, custom web UIs) — historically a frequent CVE target. iDRAC, iLO and Supermicro have all had high-severity unauthenticated RCEs in the past five years. - Zero per-action accountability. Which admin power-cycled which server at 03:14 last Tuesday? Vendor logs are local, non-chained, deletable.
What KVM Fleet does instead:
- The BMC stays inside the customer network (or firewalled to allow only Hetzner's IP,
46.225.227.71). - Every action goes through the platform's RBAC + audit + (optional) JIT + approval workflow.
- Operator identity, IP, action, target, timestamp, and result are all in a SHA-256 hash chain that breaks visibly on any mutation.
2. Shared admin passwords
The "one admin password the ops team uses, rotated when someone leaves (eventually)" pattern.
Problems:
- No per-action accountability — vendor logs only show "admin did X".
- Rotation on leaver-events is brittle; teams forget.
- Stolen credentials grant indefinite, unaudited access.
What KVM Fleet does instead:
- Per-user identity from Google SSO or local credentials + TOTP 2FA.
- Operators never see the BMC's admin password — the platform holds an encrypted credential (Fernet, key derived from
jwt_secret). - JIT access grants expire automatically. Approval workflows leave a paper trail (
access.requested/access.approved/access.denied/access.break_glassaudit actions). - Account disabled → all sessions for that user are revokable centrally and visible in the audit log.
3. VPN-based access
The standard enterprise pattern. Sometimes a jump host adds another layer.
Problems:
- VPNs grant network access. Once inside, an attacker can scan, lateral-move, hit any management interface on the network.
- Audit is at the VPN level (who connected when), not the action level (who power-cycled what).
- Hard to enforce JIT or approval — VPN access is usually on-or-off for a whole network segment.
- Performance: serial / video over VPN is often choppy when the VPN concentrator is far away.
What KVM Fleet does instead:
- Zero-trust per action: every console open, every power action, every ISO mount, every grant request is authorised separately. No broad network grant.
- Audit is at the action level, with operator identity attached.
- JIT + approval per device per session.
- Tunnel is outbound-only from the customer side — no firewall holes, no VPN endpoint to attack.
Threat scenarios we explicitly mitigate
| Scenario | Without KVM Fleet | With KVM Fleet |
|---|---|---|
| Operator's password leaked | Attacker has full BMC access until rotation | 2FA + JIT + audit alert on first anomalous use; access revoked centrally |
| Disgruntled employee | Manually revoke per-system credentials, hope nothing missed | One account-disable revokes all sessions + future grants; audit history retained |
| Compromised platform host | Full BMC takeover by attacker | Outbound-only tunnel limits blast radius to live sessions; audit chain detects post-hoc rewrite |
| Insider modifies audit log | Often undetectable | SHA-256 chain breaks visibly on any row mutation; integrity-check API exposes it |
| Mass enumeration / brute force on login | Vendor surface exposed | Login rate-limited; lockout after 6 failed attempts; every attempt audited |
| BMC firmware CVE | Direct internet attack surface | BMC unreachable from internet; only platform's egress IP can reach |
| Stale contractor access | Forgot to remove from BMC | org_users.expires_at auto-expires membership; janitor sweeps every minute |
| Sensitive action without approval | Anyone with the password fires it | JIT-required devices block console-start until grant; approval workflow logs decision + approver |
| Break-glass abuse | Hard to detect | access.break_glass audit event flagged in compliance reports; org-admin email alerted |
| Lost / stolen laptop with active session | Operator panics, hopes IT can revoke fast | Console-token TTL is 5 min; session row is revocable from the dashboard immediately |
| MitM via rogue Wi-Fi | Vendor UIs often weak TLS, no HSTS | HSTS preload + TLS 1.3; SPKI pinning on the agent side (in progress on security TODO) |
| ISO swap / mass-image attack | Hard to rate-limit at the vendor | Per-device 1-hour rate limit on mount, audit row written before the action fires |
What the platform is NOT designed to defend against
Being honest about the bounds:
1. A determined platform operator with DB superuser access
They can read all org data, read audit history, mint JWTs (they have JWT_SECRET), inject into active sessions. The audit chain stops them from silently rewriting history — any mutation breaks the SHA-256 chain visibly to anyone who runs GET /v1/audit/integrity — but they have full live-system access.
This is the same posture as any centralised IT-management SaaS. The honest framing for customers is:
"You trust us with live device control + audit history. You don't trust us with the OS inside the target server, its credentials, or its data."
Mitigations on the roadmap (see TODO.md):
- Customer-owned audit signing keys (Phase 3, trigger-driven): customer's Ed25519 key signs each audit event server-side. Customer can verify offline using only their public key. Platform operator cannot forge new audit rows that pass verification.
- Envelope encryption with customer KMS for stored creds.
2. A compromised PiKVM
If an attacker roots a customer's PiKVM (e.g. via kvmd CVE, weak SSH password), they own that one device. The platform can alert on anomalous behaviour (agent reconnects from new IPs, sudden console-session bursts) via the existing alert rules, but cannot prevent damage on the already-compromised host.
3. Network adversaries with valid Let's Encrypt certificate issuance
A determined adversary that obtains a valid LE cert for app.kvmfleet.io could intercept agent or browser traffic. SPKI pinning in the agent + HSTS preload on the browser side close this. HSTS preload is shipped. SPKI pinning is on the security TODO, not yet shipped.
4. A targeted compromise of the customer's identity provider
If an attacker takes over a customer's Google Workspace, they can SSO into KVM Fleet as that user. Out of scope for us to defend; the customer's IdP is their responsibility. We can reduce blast radius by enforcing 2FA at the KVM Fleet layer (already on).
5. Side-channel on the host running the agent
If an attacker has root on the same host as the agent (a PiKVM running other software, or a customer-side Redfish gateway), they can read the encrypted-creds blob and the platform's session token from /etc/kvmfleet/agent.env. Mitigation: host hygiene is the customer's responsibility. The agent runs as a non-root user, the env file is mode 0600, but a root attacker can read anything.
What "tamper-evident" actually means
The audit chain (audit_events.prev_hash) is SHA-256(serialized previous row || current row's mutable fields). The chain starts at the zero row (all zeros). The DB role used by the application has explicit REVOKE UPDATE, DELETE, TRUNCATE ON audit_events; a refusal trigger blocks even superuser modifications via the standard SQL surface.
Three properties:
- Insert-only. The application role cannot modify a row after writing it. Only insert.
- Chained. Each row references the prior row's hash. Mutating any past row breaks every subsequent hash.
- Verifiable offline.
GET /v1/audit/integrityreturns the row count + afirst_break_idif any mutation has happened. The same check can be re-run by a customer with read access to the schema (they don't need our code).
What this doesn't prevent: a superuser with SET session_replication_role = replica can bypass the trigger and write directly. They cannot forge hashes that line up with subsequent rows without re-writing the entire chain after the break point, and any verifier downstream sees that. So: an attacker can gap the chain (visible), they cannot modify the chain (also visible).
See Audit chain for the schema-level detail.
Concurrency-break vs. tamper-break
A prev_hash mismatch on the verifier can come from two distinct causes — important to distinguish:
| Signature | Cause | What it means |
|---|---|---|
Row N+1's prev_hash matches row N's prev_hash (not row N's row_hash); rows N and N+1 are within milliseconds and look like duplicates (same action, target_id, actor_id) |
Concurrency artifact — two simultaneous record() calls each read the same chain head before either had inserted |
Not tampering. Each individual row's row_hash correctly hashes its own payload + the prev_hash it observed at insert time. No data was modified. |
Row N's row_hash doesn't equal sha256(prev_hash + canonical(payload)) for its own payload |
Tampering — somebody mutated row N's fields after insert | The chain detected a post-hoc modification. Row N's content is suspect. |
KVM Fleet's services/audit.py::record() shipped with a concurrency bug from initial release until 2026-05-13 (commit a1b07c1): the read-of-latest-row + INSERT cycle was not serialized. The bug surfaced in production on the founder's own test org as six pairs of device.offline events from rapid agent reconnect cycles racing through the WS handler's finally block. Each pair produced one chain-break signature of the first type above.
Mitigation deployed:
record()now takes a Postgres advisory transaction lock (pg_advisory_xact_lock(hashtext("audit:" + org_id))) before reading the chain head. The lock releases on commit, so the contention window is exactly the audit write itself. Different orgs hash to different keys and don't contend with each other.- Regression test (
test_chain_intact_under_concurrent_writers): launches 10 simultaneousrecord()calls viaasyncio.gather, asserts the chain is intact and no non-zeroprev_hashrepeats. Would fail without the lock.
For pre-fix chains where the bug already produced visible breaks, a one-shot repair script (scripts/repair_audit_chain.py) recomputes every row's prev_hash + row_hash chained off the previous row, then inserts a marker audit.chain.repaired event at the tail of the chain. The marker itself is in the chain and explicitly documents that the chain was rewritten: when, why, and how many rows changed. Any auditor reading the chain sees the marker and knows what happened.
The repair script does the rewrite as super (the runtime role still has REVOKE UPDATE); the immutability trigger is dropped and recreated in the same transaction so the window is exactly one statement. After the rewrite the chain verifies green, and the marker event is permanent evidence of the rewrite.
Compliance frameworks the trust model supports
The arguments above underpin the compliance-report claims. We map controls from:
- NIS2 (EU directive)
- SOC 2 Type II (AICPA)
- ISO 27001:2022
- GDPR (EU regulation)
- HIPAA Security Rule (US)
- NIST SP 800-53 Rev 5 (US federal baseline)
- PCI DSS 4.0
- Cyber Essentials Plus (UK)
For the full control-by-control coverage, see the org's Compliance page after enabling a framework — the PDF is generated from app/services/report.py. The trust-model claims here are the evidence the report cites; the report itself is the audit-grade artifact.
Related documents
- System architecture — comprehensive structural view
- WebSocket multiplexing — how the agent tunnel actually works
- Audit chain — SHA-256 hash chain detail
- Row-Level Security — Postgres-level tenant isolation