Key Takeaways
- "Self-hosted" means at least five different things in 2026, and only one of them is actually private. The taxonomy matters more than the marketing.
- Open weights solve the model-provider trust problem and create three new ones: the hypervisor, the SRE, and the GPU PCIe bus. vLLM and SGLang cannot fix those at the software layer.
- TDX + TEE-IO closes the loop. CPU memory, VRAM, and CPU↔GPU traffic are all encrypted with keys the operator never sees. The application stack (vLLM, SGLang, TGI) runs unchanged.
- The performance bill is 4-6% on H200 with TEE-IO — an order of magnitude smaller than most engineering teams assume, and small enough to never be the reason you stay on plaintext.
Every time DeepSeek, Llama, or Qwen ships a new release, the same pattern plays out on Hacker News and r/LocalLLaMA. Someone declares: "great, now we can self-host this and finally have private inference, no more sending data to OpenAI." Hundreds of upvotes follow. And the threat model implied by "private" in that sentence is, almost universally, wrong.
I run an infrastructure company that serves regulated buyers — law firms, clinics, fintech compliance teams — and I have spent enough hours on calls explaining why their CTO's "we self-host DeepSeek on AWS" answer is not what their DPO actually asked for. This piece is the long-form version of that call. It is not an anti-open-source piece. I love open weights. They solve a real problem. They just do not solve the one most people think they solve.
The Five Things People Mean By "Self-Hosted"
Before we can argue about whether self-hosting gives you privacy, we have to agree on what self-hosting is. In conversations with buyers I now refuse to use the word without a qualifier. The taxonomy that matters in 2026:
# What "self-hosted" actually means in 2026.
# A short, brutal taxonomy.
LEVELS = {
"saas": "OpenAI / Anthropic. Vendor sees prompts.",
"byo-cloud": "vLLM on your AWS account. AWS still sees memory.",
"byo-rented-gpu": "vLLM on rented H200. Operator's hypervisor sees memory.",
"self-host": "Your hardware, your DC. You see memory.",
"confidential": "TDX-attested enclave. Even YOU can't see memory.",
}
# Most "self-hosted DeepSeek" deployments are byo-rented-gpu.
# That's not the same threat model as either bare-metal or confidential.The interesting and uncomfortable observation is that the modal "self-hosted" deployment in 2026 is not bare-metal in your own datacenter. It is byo-rented-gpu: a vLLM container running on a rented H200 from a marketplace provider. From a privacy regulator's perspective, that has the same threat surface as an OpenAI API call — just with one extra hop.
The Three Trust Boundaries Open Weights Don't Fix
Open-weight models remove the model-provider as a trust boundary: you no longer have to trust OpenAI or Anthropic not to log, train against, or silently swap your model. That is a real win. But it leaves three boundaries in place that most teams handwave past:
- The hypervisor. When your vLLM pod boots inside a hosted VM, the cloud operator's hypervisor sits between your guest kernel and the silicon. It can read every page of guest RAM. Encryption-at-rest does not apply mid-flight: the moment the OS touches a tokenizer buffer, plaintext lives in physical memory.
- The privileged operator. Even on bare metal in a colocated cage, the datacenter SRE with physical access can extract DRAM via cold-boot, attach a debugger to the host, or simply wait for a misconfigured Kubernetes secret. The whole point of renting compute is that someone else has root on the box. That someone is in your privacy threat model whether you like it or not.
- The GPU PCIe bus. This is the one almost nobody thinks about until they read the NVIDIA Confidential Computing paper. CPU memory might be encrypted (with TDX or SEV-SNP), but the data sent over PCIe to the H200 is, by default, plaintext. A bus analyzer on a malicious or seized server reads tokens, KV cache entries, and weights-in-flight without touching the CPU at all.
These are not theoretical. Two of the three have been demonstrated by academic security teams in 2024-2025; the third is the explicit threat model NVIDIA cites in their Hopper Confidential Compute whitepaper. None of them are fixed by upgrading from Llama-3.1 to Llama-3.3, by switching from vLLM to SGLang, or by adding TLS between your services.
What Actually Fixes It
The combination that closes all three holes is unromantically named: Intel TDX 1.5 + TEE-IO + NVIDIA Confidential Compute on Hopper/Blackwell. In English:
- TDX 1.5 creates a Trust Domain — a VM whose memory is encrypted with a per-TD AES-256-XTS key managed by the CPU. The hypervisor sees ciphertext. The host kernel sees ciphertext. Cloud operators see ciphertext.
- TEE-IO extends that encryption to PCIe traffic flowing to the attested GPU. The H100/H200/B200 is enrolled into the same trust domain, with bus-level encryption and integrity protection.
- NVIDIA Confidential Compute on the GPU side keeps VRAM encrypted with keys the operator never sees, and refuses to load workloads that don't pass attestation.
From the application layer, nothing changes. vLLM, SGLang, TGI, and TensorRT-LLM run as-is inside the TDX guest. The encryption is below the OS. Your inference code does not need a single line of change.
Deploying DeepSeek-V3.2 Behind That Stack
Two paths exist. You can build it yourself: rent a TDX-enabled host, install Intel's DCAP attestation libraries, configure the kernel command line, set up a quote-verification proxy, and pray the firmware on the GPU matches what NVIDIA shipped. Allow two to four engineer-weeks for a first deployment.
Or you can do it in two API calls:
# Bring up DeepSeek-V3.2 inside an attested TDX pod, vLLM-style.
# Operator cannot read VRAM. Hypervisor cannot read RAM. PCIe is encrypted.
curl -sSf https://api.voltagegpu.com/v1/pods/deploy \
-H "Authorization: Bearer $VGPU_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"gpu": "h200",
"count": 2,
"confidential": true,
"image": "vllm-deepseek-v3.2-tee:latest",
"env": {
"MODEL_HASH": "sha256:c8d7...",
"TLS_INSIDE_ENCLAVE": "true"
}
}'
# Verify the attestation BEFORE you point your traffic at it.
curl -sSf https://api.voltagegpu.com/v1/pods/$POD_ID/attestation \
-H "Authorization: Bearer $VGPU_API_KEY" | tee quote.json
python3 - <<'PY'
import json
q = json.load(open("quote.json"))
assert q["tdx_version"] == "1.5"
assert q["measurement_valid"] is True
assert q["mr_td"] == "EXPECTED_MR_TD_PINNED_AT_PROVISIONING"
print("Enclave verified. You can route traffic now.")
PYFor application-level inference, if you don't need a pod and just want a hosted confidential endpoint, the OpenAI-compatible VoltageGPU Inference API exposes the same models behind a TEE-attested gateway:
from openai import OpenAI
# OpenAI-compatible. Change one line vs. api.openai.com.
client = OpenAI(
base_url="https://api.voltagegpu.com/v1",
api_key="vgpu_YOUR_KEY",
)
# Pin the model artifact + ask for the attestation header. The server
# returns a TDX quote in x-tdx-quote that your proxy can persist.
resp = client.chat.completions.create(
model="deepseek-v3.2-tee",
messages=[
{"role": "system", "content": "You are a privacy-aware assistant."},
{"role": "user", "content": "Summarize this PHI without copying any name."},
],
extra_headers={"x-attestation": "required"},
)
# resp._raw_response.headers["x-tdx-quote"] is your audit trail.The header x-attestation: required tells the gateway to refuse the call if the underlying enclave fails attestation. The response includes the TDX quote in x-tdx-quote; persist it next to your request id and you have audit-grade evidence per call.
When True Bare-Metal Self-Host Is Still The Right Answer
I am not arguing nobody should self-host. There are real cases where bare-metal in your own DC is the correct answer:
- Air-gapped national-security workloads where any external connection is forbidden by classification rules.
- Enormous, steady-state workloads where you have already amortized the GPU capex and operations cost, and a 4-6% confidential-compute overhead would dominate at your scale.
- Specialized hardware accelerators not yet covered by confidential computing toolchains (some FPGA pipelines, certain Intel Gaudi configurations as of mid-2026).
For the other 95% of regulated AI workloads — legal, healthcare, fintech, HR-tech, compliance — renting confidential capacity is faster, cheaper, and produces stronger evidence than building your own.
The 4-6% Overhead, In Numbers
For comparison, in mid-2025 the same workloads were measured at 9-12% overhead. Most of the improvement came from TEE-IO maturity and DMA optimisations in TDX 1.5. By the time Blackwell-Ultra ships at scale (H2 2026), I expect this to settle near 3%.
What This Doesn't Solve (Pratfall, Honest Edition)
Confidential inference is the floor, not the ceiling. Three honest limitations:
- It does not turn a bad model into a safe one. If DeepSeek decides to produce a confabulated medical recommendation, sealing the memory does not change the output. Model evals and human oversight remain your job.
- Side-channel attacks are still a research frontier. Speculative execution, power analysis, and microarchitectural timing attacks against confidential enclaves are an active academic field. Intel and NVIDIA push patches; the threat model is "not perfect" rather than "solved." For nation-state-grade adversaries, layered defense remains necessary.
- Open-weight licensing still applies. DeepSeek-V3.2 and Llama-3.3 each have their own license terms (commercial use clauses, derivative restrictions). TDX does not waive those. Read the license.
Who Should Read This Twice
- Platform engineers and ML infra leads who built a self-hosted LLM gateway in 2024-2025 and now have to defend it against an internal audit.
- Security engineers writing the AI threat model section of their ISO 42001 or SOC 2 documentation.
- Founders of vertical AI products in regulated sectors who keep getting asked "can the model provider see the data?" in vendor onboarding calls.
Two starting points if you want to go deeper: our Intel TDX deep-dive for the architecture, and the EU AI Act August 2026 piece for the regulatory side of why this matters now.
FAQ
But I am self-hosting on my own AWS account. Isn’t that already private?
Why does open-weight matter at all if hardware sealing is the point?
Is vLLM or SGLang aware of TDX? What about NVIDIA Confidential Compute?
What’s the performance overhead of running inside TDX?
Does using your TDX pods make me a "deployer" of your model under the EU AI Act?
Try a confidential DeepSeek-V3.2 endpoint in 60 seconds
OpenAI-compatible. Change your base URL, keep your code. Pull a TDX attestation quote on your first call. $5 free credit, no credit card.