Linux hardening for AI workloads

GPU clusters are high-value targets. They run expensive hardware, host valuable models, and often sit on networks designed for research convenience rather than production security. Before any model touches production, I run through the same hardening checklist. Here is the short version.

1. Secure Boot and measured boot

Disable unsigned kernel modules and enable Secure Boot with a measured boot chain. If an attacker can load a kernel module, they own the host. If they can persist in the boot chain, they own it across reboots. I verify boot measurements against a known-good baseline before scheduling workloads.

2. Encrypt data at rest

Models, training data, and checkpoints belong on encrypted volumes. I use LUKS for block-level encryption and ensure keys are managed through a KMS or sealed with TPM. This protects data if a physical drive leaves the data center and limits blast radius if a node is compromised.

3. Least-privilege SSH

Interactive SSH on GPU nodes should be rare. I disable password auth, enforce key-based access, restrict allowed users through AllowUsers, and gate admin access via a bastion with audit logging. Better yet, I prefer immutable nodes and redeploy over remote debugging.

4. Network segmentation

Training traffic, inference traffic, and management traffic live on separate networks. I use VLANs or VXLANs, restrict east-west movement with firewall rules, and expose inference endpoints only through a reverse proxy with TLS termination and rate limiting. The storage backend should not be reachable from the public inference tier.

5. Audit logging and alerting

Every privileged command, every module load, every authentication attempt, and every network connection gets logged and shipped off-host. I alert on anomalies: new kernel modules, unexpected outbound connections, privilege escalation, and large data transfers. Logs are evidence, not just diagnostics.

Hardening is not a one-time checklist. It is a baseline that you continuously verify. If you are preparing an AI cluster for production, I can help.