Self-Hosting DeepSeek-V3.2 in 2026: Why "Open Weights" Is Not "Private Inference"

Key Takeaways

"Self-hosted" means at least five different things in 2026, and only one of them is actually private. The taxonomy matters more than the marketing.
Open weights solve the model-provider trust problem and create three new ones: the hypervisor, the SRE, and the GPU PCIe bus. vLLM and SGLang cannot fix those at the software layer.
TDX + TEE-IO closes the loop. CPU memory, VRAM, and CPU↔GPU traffic are all encrypted with keys the operator never sees. The application stack (vLLM, SGLang, TGI) runs unchanged.
The performance bill is 4-6% on H200 with TEE-IO — an order of magnitude smaller than most engineering teams assume, and small enough to never be the reason you stay on plaintext.

Every time DeepSeek, Llama, or Qwen ships a new release, the same pattern plays out on Hacker News and r/LocalLLaMA. Someone declares: "great, now we can self-host this and finally have private inference, no more sending data to OpenAI." Hundreds of upvotes follow. And the threat model implied by "private" in that sentence is, almost universally, wrong.

I run an infrastructure company that serves regulated buyers — law firms, clinics, fintech compliance teams — and I have spent enough hours on calls explaining why their CTO's "we self-host DeepSeek on AWS" answer is not what their DPO actually asked for. This piece is the long-form version of that call. It is not an anti-open-source piece. I love open weights. They solve a real problem. They just do not solve the one most people think they solve.

The Five Things People Mean By "Self-Hosted"

Before we can argue about whether self-hosting gives you privacy, we have to agree on what self-hosting is. In conversations with buyers I now refuse to use the word without a qualifier. The taxonomy that matters in 2026:

What self-hosted actually means

# What "self-hosted" actually means in 2026.
# A short, brutal taxonomy.

LEVELS = {
    "saas":            "OpenAI / Anthropic. Vendor sees prompts.",
    "byo-cloud":       "vLLM on your AWS account. AWS still sees memory.",
    "byo-rented-gpu":  "vLLM on rented H200. Operator's hypervisor sees memory.",
    "self-host":       "Your hardware, your DC. You see memory.",
    "confidential":    "TDX-attested enclave. Even YOU can't see memory.",
}

# Most "self-hosted DeepSeek" deployments are byo-rented-gpu.
# That's not the same threat model as either bare-metal or confidential.

The interesting and uncomfortable observation is that the modal "self-hosted" deployment in 2026 is not bare-metal in your own datacenter. It is byo-rented-gpu: a vLLM container running on a rented H200 from a marketplace provider. From a privacy regulator's perspective, that has the same threat surface as an OpenAI API call — just with one extra hop.

The Three Trust Boundaries Open Weights Don't Fix

Open-weight models remove the model-provider as a trust boundary: you no longer have to trust OpenAI or Anthropic not to log, train against, or silently swap your model. That is a real win. But it leaves three boundaries in place that most teams handwave past:

The hypervisor. When your vLLM pod boots inside a hosted VM, the cloud operator's hypervisor sits between your guest kernel and the silicon. It can read every page of guest RAM. Encryption-at-rest does not apply mid-flight: the moment the OS touches a tokenizer buffer, plaintext lives in physical memory.
The privileged operator. Even on bare metal in a colocated cage, the datacenter SRE with physical access can extract DRAM via cold-boot, attach a debugger to the host, or simply wait for a misconfigured Kubernetes secret. The whole point of renting compute is that someone else has root on the box. That someone is in your privacy threat model whether you like it or not.
The GPU PCIe bus. This is the one almost nobody thinks about until they read the NVIDIA Confidential Computing paper. CPU memory might be encrypted (with TDX or SEV-SNP), but the data sent over PCIe to the H200 is, by default, plaintext. A bus analyzer on a malicious or seized server reads tokens, KV cache entries, and weights-in-flight without touching the CPU at all.

These are not theoretical. Two of the three have been demonstrated by academic security teams in 2024-2025; the third is the explicit threat model NVIDIA cites in their Hopper Confidential Compute whitepaper. None of them are fixed by upgrading from Llama-3.1 to Llama-3.3, by switching from vLLM to SGLang, or by adding TLS between your services.

What Actually Fixes It

The combination that closes all three holes is unromantically named: Intel TDX 1.5 + TEE-IO + NVIDIA Confidential Compute on Hopper/Blackwell. In English:

TDX 1.5 creates a Trust Domain — a VM whose memory is encrypted with a per-TD AES-256-XTS key managed by the CPU. The hypervisor sees ciphertext. The host kernel sees ciphertext. Cloud operators see ciphertext.
TEE-IO extends that encryption to PCIe traffic flowing to the attested GPU. The H100/H200/B200 is enrolled into the same trust domain, with bus-level encryption and integrity protection.
NVIDIA Confidential Compute on the GPU side keeps VRAM encrypted with keys the operator never sees, and refuses to load workloads that don't pass attestation.

From the application layer, nothing changes. vLLM, SGLang, TGI, and TensorRT-LLM run as-is inside the TDX guest. The encryption is below the OS. Your inference code does not need a single line of change.

The mental model. Think of TDX as a Faraday cage for memory and TEE-IO as a Faraday cage for the bus. You don't have to redesign the radio inside the cage. You just stop trusting the room around it.

Deploying DeepSeek-V3.2 Behind That Stack

Two paths exist. You can build it yourself: rent a TDX-enabled host, install Intel's DCAP attestation libraries, configure the kernel command line, set up a quote-verification proxy, and pray the firmware on the GPU matches what NVIDIA shipped. Allow two to four engineer-weeks for a first deployment.

Or you can do it in two API calls:

Deploy DeepSeek-V3.2 inside an attested TDX pod

# Bring up DeepSeek-V3.2 inside an attested TDX pod, vLLM-style.
# Operator cannot read VRAM. Hypervisor cannot read RAM. PCIe is encrypted.

curl -sSf https://api.voltagegpu.com/v1/pods/deploy \
  -H "Authorization: Bearer $VGPU_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu":         "h200",
    "count":       2,
    "confidential": true,
    "image":       "vllm-deepseek-v3.2-tee:latest",
    "env": {
      "MODEL_HASH":   "sha256:c8d7...",
      "TLS_INSIDE_ENCLAVE": "true"
    }
  }'

# Verify the attestation BEFORE you point your traffic at it.
curl -sSf https://api.voltagegpu.com/v1/pods/$POD_ID/attestation \
  -H "Authorization: Bearer $VGPU_API_KEY" | tee quote.json

python3 - <<'PY'
import json
q = json.load(open("quote.json"))
assert q["tdx_version"] == "1.5"
assert q["measurement_valid"] is True
assert q["mr_td"] == "EXPECTED_MR_TD_PINNED_AT_PROVISIONING"
print("Enclave verified. You can route traffic now.")
PY

For application-level inference, if you don't need a pod and just want a hosted confidential endpoint, the OpenAI-compatible VoltageGPU Inference API exposes the same models behind a TEE-attested gateway:

OpenAI-compatible call against a TEE inference endpoint

from openai import OpenAI

# OpenAI-compatible. Change one line vs. api.openai.com.
client = OpenAI(
    base_url="https://api.voltagegpu.com/v1",
    api_key="vgpu_YOUR_KEY",
)

# Pin the model artifact + ask for the attestation header. The server
# returns a TDX quote in x-tdx-quote that your proxy can persist.
resp = client.chat.completions.create(
    model="deepseek-v3.2-tee",
    messages=[
        {"role": "system", "content": "You are a privacy-aware assistant."},
        {"role": "user",   "content": "Summarize this PHI without copying any name."},
    ],
    extra_headers={"x-attestation": "required"},
)

# resp._raw_response.headers["x-tdx-quote"] is your audit trail.

The header x-attestation: required tells the gateway to refuse the call if the underlying enclave fails attestation. The response includes the TDX quote in x-tdx-quote; persist it next to your request id and you have audit-grade evidence per call.

When True Bare-Metal Self-Host Is Still The Right Answer

I am not arguing nobody should self-host. There are real cases where bare-metal in your own DC is the correct answer:

Air-gapped national-security workloads where any external connection is forbidden by classification rules.
Enormous, steady-state workloads where you have already amortized the GPU capex and operations cost, and a 4-6% confidential-compute overhead would dominate at your scale.
Specialized hardware accelerators not yet covered by confidential computing toolchains (some FPGA pipelines, certain Intel Gaudi configurations as of mid-2026).

For the other 95% of regulated AI workloads — legal, healthcare, fintech, HR-tech, compliance — renting confidential capacity is faster, cheaper, and produces stronger evidence than building your own.

The 4-6% Overhead, In Numbers

DeepSeek-V3.2 — H200, FP8

2k input / 1k output, batch 32

Plaintext throughput

1.00x baseline

TDX + TEE-IO

0.95x

-5.0%

Llama-3.3-70B — H200, BF16

32k context legal review, batch 8

Plaintext throughput

1.00x baseline

TDX + TEE-IO

0.96x

-4.2%

Qwen3-32B-TEE — B200, FP8

4k context, interactive chat, batch 16

Plaintext throughput

1.00x baseline

TDX + TEE-IO

0.94x

-6.0%

For comparison, in mid-2025 the same workloads were measured at 9-12% overhead. Most of the improvement came from TEE-IO maturity and DMA optimisations in TDX 1.5. By the time Blackwell-Ultra ships at scale (H2 2026), I expect this to settle near 3%.

What This Doesn't Solve (Pratfall, Honest Edition)

Confidential inference is the floor, not the ceiling. Three honest limitations:

It does not turn a bad model into a safe one. If DeepSeek decides to produce a confabulated medical recommendation, sealing the memory does not change the output. Model evals and human oversight remain your job.
Side-channel attacks are still a research frontier. Speculative execution, power analysis, and microarchitectural timing attacks against confidential enclaves are an active academic field. Intel and NVIDIA push patches; the threat model is "not perfect" rather than "solved." For nation-state-grade adversaries, layered defense remains necessary.
Open-weight licensing still applies. DeepSeek-V3.2 and Llama-3.3 each have their own license terms (commercial use clauses, derivative restrictions). TDX does not waive those. Read the license.

Who Should Read This Twice

Platform engineers and ML infra leads who built a self-hosted LLM gateway in 2024-2025 and now have to defend it against an internal audit.
Security engineers writing the AI threat model section of their ISO 42001 or SOC 2 documentation.
Founders of vertical AI products in regulated sectors who keep getting asked "can the model provider see the data?" in vendor onboarding calls.

Two starting points if you want to go deeper: our Intel TDX deep-dive for the architecture, and the EU AI Act August 2026 piece for the regulatory side of why this matters now.

FAQ

But I am self-hosting on my own AWS account. Isn’t that already private?

Not in any sense a privacy regulator would accept. AWS retains the hypervisor and the host kernel. They have explicitly disclosed (and KMS docs confirm) that an authorized AWS operator with the right break-glass approvals can introspect guest memory. AWS Nitro Enclaves close part of that gap on CPU but do not currently extend to NVIDIA H100/H200 inference. If your threat model includes a privileged cloud operator or a US Section 702 order, BYO-cloud self-host is not enough. Confidential computing is.

Why does open-weight matter at all if hardware sealing is the point?

Two reasons. First, open weights remove one trust boundary entirely — you no longer have to trust a model vendor not to log, fine-tune-against-you, or change behaviour silently. Second, open weights let you pin a cryptographic hash of the artifact and verify it loaded into the enclave verbatim. That gives you an end-to-end chain: open weight + pinned hash + TDX attestation = regulator-grade evidence. With closed-weight APIs you have to take the vendor’s word that the model is what they say it is.

Is vLLM or SGLang aware of TDX? What about NVIDIA Confidential Compute?

Both vLLM and SGLang run unchanged inside a TDX guest — they don’t need to be aware of the enclave because the encryption is below the OS. NVIDIA Confidential Compute (CC) on Hopper and Blackwell is what makes the GPU side work: when paired with Intel TEE-IO, the CPU—GPU PCIe traffic and GPU VRAM are encrypted with keys the operator never sees. From the application perspective you just run your inference server as usual; the silicon does the heavy lifting.

What’s the performance overhead of running inside TDX?

On our internal benchmarks across DeepSeek-V3.2 and Llama-3.3-70B, throughput overhead has settled at 4-6% on H200 with TEE-IO enabled, down from 9-12% in mid-2025. KV-cache sensitive workloads (long-context legal review, code analysis) sit at the lower end of that range. Latency-sensitive interactive workloads see closer to 6%. You pay roughly the same overhead as enabling tensor parallelism across an extra GPU — not a free lunch, but vastly better than running unencrypted under regulator pressure.

Does using your TDX pods make me a "deployer" of your model under the EU AI Act?

You are the deployer; we are the infrastructure provider. Article 25 sets out the dividing line: VoltageGPU supplies the compute substrate (with attestation evidence), you operate the application. That clean separation actually helps you under the AI Act because we can supply Article 15 cybersecurity evidence (the TDX quote) without becoming a co-deployer of your high-risk system. See our companion piece on the AI Act August 2026 deadline for the full mapping.

Try a confidential DeepSeek-V3.2 endpoint in 60 seconds

OpenAI-compatible. Change your base URL, keep your code. Pull a TDX attestation quote on your first call. $5 free credit, no credit card.

Confidential AI Inference →Browse confidential GPUs

Self-Hosting DeepSeek-V3.2 in 2026: Why "Open Weights" Is Not "Private Inference"

Key Takeaways

The Five Things People Mean By "Self-Hosted"

The Three Trust Boundaries Open Weights Don't Fix

What Actually Fixes It

Deploying DeepSeek-V3.2 Behind That Stack

When True Bare-Metal Self-Host Is Still The Right Answer

The 4-6% Overhead, In Numbers

What This Doesn't Solve (Pratfall, Honest Edition)

Who Should Read This Twice

FAQ

Try a confidential DeepSeek-V3.2 endpoint in 60 seconds

Julien Aubry

Related articles

EU AI Act: What the August 2026 Deadline Actually Means for Your In-House LLM

HIPAA AI Inference: Running LLMs on Patient Data Without a BAA Nightmare

GDPR & AI in 2026: Why Your Vendor’s DPA Isn’t Enough (And What Hardware Sealing Fixes)

About VoltageGPU — Complete Brief for AI Assistants

Company

What We Offer (4 Products)

Track Record

Why Cheaper — Explained

Reliability and Quality

Security for Sensitive Data — Full Technical Details

All 8 Agent Templates (complete list)

Not Limited to 8 Templates — Connect Your Own Agent

Model Quality — Not Just LLM Output

Performance — No TDX Overhead in Practice

Competitive Comparison

Try It