Building vLLM from Source: A Field Guide (with all the pitfalls)
A step-by-step field guide to building vLLM from source on Ubuntu 26.04, covering Python 3.14 compatibility, CUDA driver issues, and toolchain pitfalls.
Building vLLM1 from source sounds like a pip install -e . away. In practice, on a fresh machine with a recent OS and a recent Python, you hit a chain of version-skew, driver, and toolchain issues that each fail with a cryptic message. This post walks through a real end-to-end build on an AWS g5 instance (NVIDIA A10G) running Ubuntu 26.04 + Python 3.14, documenting every error encountered and the fix.
The target was a CUDA build of a vLLM fork. The same playbook applies to a stock vllm-project/vllm checkout.
TL;DR — the working recipe
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Confirm you actually have a GPU (see "Pitfall 1" — easy to get wrong)
lspci | grep -i nvidia # hardware present?
nvidia-smi # driver working?
# 2. Driver (if nvidia-smi fails but lspci shows the GPU)
sudo apt-get install -y nvidia-driver-575-open nvidia-modprobe dkms
sudo modprobe -r nouveau && sudo modprobe nvidia # or reboot
# 3. Virtual env
python3 -m venv ~/go/venv && source ~/go/venv/bin/activate
pip install --upgrade pip
# 4. CUDA torch + a CONSISTENT pip CUDA toolkit (critical: one minor version)
pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 # default index = CUDA build
pip install "cuda-toolkit[nvcc]==13.3.0" "nvidia-cuda-runtime==13.3.29" \
"nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"
# 5. Assemble CUDA_HOME from the pip layout
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.*/site-packages/nvidia/cu13
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64
( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )
mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so
# 6. Build (scope arch to YOUR GPU — A10G is sm_86)
export PATH=$CUDA_HOME/bin:$PATH CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda TORCH_CUDA_ARCH_LIST="8.6+PTX"
export MAX_JOBS=12 NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation
Read on for why each line is there and what breaks without it.
Prerequisites & how to check them
Before anything else, take an inventory. Getting this wrong wastes the most time — including the most embarrassing pitfall of all.
| Requirement | How to check | Notes |
|---|---|---|
| A GPU (and which one) | lspci \| grep -i nvidia | Determines CUDA vs CPU build. Don’t trust nvidia-smi alone — see Pitfall 1. |
| GPU driver loaded | nvidia-smi | If it fails but lspci shows a GPU, the driver isn’t installed/loaded. |
| Compute capability | nvidia-smi --query-gpu=compute_cap --format=csv | A10G = 8.6. You build kernels for this. |
| CPU flags (CPU build only) | lscpu \| grep -oE 'avx512f\|avx2' | vLLM CPU wants AVX512; AVX2 works with limited features. |
| Compiler | gcc --version | vLLM recommends gcc 12–13; newer (15) mostly works but watch nvcc host-compiler limits. |
| Python | python3 --version | Check the repo’s requires-python in pyproject.toml. |
| RAM / cores | nproc; free -h | CUDA compiles are RAM-hungry (~2–3 GB per parallel job). |
| build tools | cmake --version; ninja --version | vLLM needs cmake ≥ 3.26. |
Pitfall 1: “There’s no GPU here” — when there definitely is
This one cost us a whole CPU build. The very first check was:
1
nvidia-smi # → command not found
Conclusion drawn: no GPU, do a CPU build. Wrong. nvidia-smi missing only means the driver/userspace tools aren’t installed — it says nothing about the hardware. The actual hardware check is:
1
2
$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
The A10G was there the whole time; it just had no driver. Always check lspci (or /proc/driver/nvidia, ls /dev/nvidia*) before concluding “no GPU.” On cloud instances that aren’t “Deep Learning AMIs,” a bare GPU with no driver is the norm, not the exception.
Lesson:
lspcidetects hardware.nvidia-smidetects a working driver. They answer different questions. Decide CPU-vs-GPU fromlspci.
Step 2: Install and load the NVIDIA driver
lspci shows the GPU, nvidia-smi is missing → install the driver.
1
2
3
sudo apt-get update
sudo apt-get install -y dkms build-essential linux-headers-$(uname -r) \
nvidia-driver-575-open
We used the open-kernel variant (-open), which is NVIDIA’s recommendation for Ampere and newer (A10G is Ampere). The 575 metapackage pulled driver 580.159.03.
Pitfall 2: modprobe nvidia → “No such device” (nouveau owns the GPU)
1
2
3
4
5
$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
$ dmesg | grep NVRM
NVRM: GPU 0000:00:1e.0 is already bound to nouveau.
The open-source nouveau driver grabs the GPU at boot. The NVIDIA module can’t bind while nouveau holds it. Fix — blacklist, unbind, and load:
1
2
3
4
5
6
echo -e "blacklist nouveau\noptions nouveau modeset=0" | \
sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo -n "0000:00:1e.0" | sudo tee /sys/bus/pci/drivers/nouveau/unbind
sudo rmmod nouveau
sudo modprobe nvidia
sudo update-initramfs -u # make the blacklist survive reboots
If rmmod nouveau complains it’s in use (e.g. a display manager), a reboot after the blacklist + initramfs update achieves the same thing cleanly.
Pitfall 3: nvidia-smi works but CUDA returns error 999 (“unknown error”)
This is the subtle one. After loading the module:
1
2
3
$ nvidia-smi # works, shows the A10G
$ python -c "import torch; print(torch.cuda.is_available())"
RuntimeError: CUDA unknown error ... # False
A direct driver-API probe confirmed the runtime was broken even though nvidia-smi was fine:
1
2
import ctypes
ctypes.CDLL("libcuda.so.1").cuInit(0) # → 999 (CUDA_ERROR_UNKNOWN)
Two distinct causes, both worth knowing:
Stale/incorrect UVM device nodes.
nvidia-smiuses/dev/nvidia0+/dev/nvidiactl(major 195). CUDA additionally needs/dev/nvidia-uvm. After a manual driver bring-up those nodes can be missing or have the wrong major. Recreate them against/proc/devices:1 2 3 4 5
sudo modprobe nvidia_uvm UVM_MAJOR=$(grep nvidia-uvm /proc/devices | awk '{print $1}') sudo rm -f /dev/nvidia-uvm /dev/nvidia-uvm-tools sudo mknod -m 666 /dev/nvidia-uvm c $UVM_MAJOR 0 sudo mknod -m 666 /dev/nvidia-uvm-tools c $UVM_MAJOR 1
nvidia-modprobeis not installed. This setuid helper is what the CUDA runtime shells out to in order to create/initialize device nodes for non-root processes. Without it, rawcuInitmay pass but torch’s runtime init throws 999. This was the actual fix for us:1 2
sudo apt-get install -y nvidia-modprobe sudo nvidia-modprobe -c 0 -u
After this:
torch.cuda.is_available() → True. A reboot also installs the proper udev rules and avoids the manualmknoddance — but if you can’t reboot, the two steps above get you there.
Lesson:
nvidia-smiworking ≠ CUDA working. They use different device nodes. IfcuInitreturns 999, look at/dev/nvidia-uvmand make surenvidia-modprobeexists.
Step 3: The virtual environment
Nothing exotic here, but keep it isolated from system Python:
1
2
3
python3 -m venv ~/go/venv
source ~/go/venv/bin/activate
pip install --upgrade pip
We used Python 3.14. Check the repo supports it:
1
2
grep requires-python pyproject.toml
# requires-python = ">=3.10,<3.15" ✅ 3.14 allowed
It built fine — torch==2.11.0 and every dependency had cp314 wheels. But see Pitfall 6: a bundled submodule had its own narrower Python check.
Step 4: CUDA torch + a consistent CUDA toolkit
vLLM compiles .cu kernels, so it needs nvcc — which PyTorch wheels do not bundle (they ship runtime libraries only). You have two options:
- Install the full CUDA toolkit to
/usr/local/cudavia NVIDIA’s apt repo, or - Assemble a toolkit entirely from pip wheels.
We went pip-only (no apt repo for Ubuntu 26.04 yet, and it keeps everything in the venv). First, the CUDA build of torch:
1
2
pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0
python -c "import torch; print(torch.version.cuda)" # → 13.0 (wheel tag: 2.11.0+cu130)
Then nvcc and the dev components via the modern unified meta package:
1
pip install "cuda-toolkit[nvcc]==13.3.0"
Pitfall 4: the nvidia-cuda-nvcc-cu13 package is a stub
The old naming is a trap:
1
2
$ pip install nvidia-cuda-nvcc-cu13
ERROR: ... (from versions: 0.0.0a0, 0.0.1) # placeholder only!
The real compiler ships via the cuda-toolkit[nvcc] extra (which pulls nvidia-cuda-nvcc, nvidia-nvvm, nvidia-cuda-crt). Use the meta package’s extras, not the *-cu13 standalone names.
Pitfall 5: CUDA toolkit version skew (three separate failures)
This was the single biggest time sink. The pip CUDA ecosystem is split across many packages (nvidia-cuda-nvcc, nvidia-nvvm, nvidia-cuda-crt, nvidia-cuda-cccl, nvidia-cuda-runtime, nvidia-cublas, …) and pip will happily install mismatched minor versions. Each mismatch fails differently:
5a. ptxas can’t assemble newer PTX:
1
ptxas fatal : Unsupported .version 9.3; current version is '9.0'
nvcc front-end was 13.3 (emits PTX 9.3) but ptxas was 13.0 (≤ PTX 9.0). → align them.
5b. CMake refuses on nvcc-vs-headers mismatch (PyTorch’s cuda.cmake):
1
2
CMake Error: FindCUDA says CUDA version is 13.3 (from nvcc), but the CUDA headers
say the version is 13.0.
5c. flashinfer’s bundled cccl refuses at runtime (its JIT compiler):
1
2
cccl/.../cuda_toolkit.h:41: error: "CUDA compiler and CUDA toolkit headers are
incompatible, please check your include paths"
The cccl check requires CUDART_VERSION’s minor to exactly equal nvcc’s minor.
The fix for all three: pin the entire CUDA userspace to one minor version.
Why 13.3 and not 13.0 (to match torch’s
cu130)? Because CUDA 13.0 headers don’t compile on glibc 2.43 (Ubuntu 26.04):
1 2 /usr/include/.../mathcalls.h:206: error: exception specification is incompatible with that of previous function "rsqrt"CUDA 13.1+ headers fixed this. So we align up to 13.3. torch built for
cu130still runs on a 13.3 runtime thanks to CUDA 13 minor-version compatibility (any 13.x toolkit runs on an R580+ driver).
1
2
3
4
5
6
7
8
pip install "cuda-toolkit==13.3.0" "nvidia-cuda-runtime==13.3.29" \
"nvidia-cuda-nvcc==13.3.33" "nvidia-nvvm==13.3.33" \
"nvidia-cuda-crt==13.3.33" "nvidia-cuda-cccl==13.3.3.3.1" \
"nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"
# verify nvcc and headers agree:
nvcc --version | grep release # 13.3
grep CUDART_VERSION $CUDA_HOME/include/cuda_runtime_api.h # 13030 (= 13.3)
pip prints a dependency-conflict warning (torch pins cuda-toolkit==13.0.2) — it’s cosmetic; torch runs fine via minor-version compat. But beware: reinstalling vLLM later re-pulls its requirements/cuda.txt and silently downgrades the runtime back to 13.0, breaking flashinfer’s JIT again. Re-run the 13.3 pins after any reinstall.
Step 5: Assemble a working CUDA_HOME
The pip wheels lay CUDA out under .../site-packages/nvidia/cu13/{bin,include,lib}, which is almost what CMake and downstream linkers expect — but missing three things:
1
2
3
4
5
6
7
8
9
10
11
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13
# (a) unversioned dev symlinks: wheels ship libcudart.so.13, linkers want libcudart.so
( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )
# (b) lib64 alias: some tools (flashinfer JIT) hardcode $CUDA_HOME/lib64
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64
# (c) a libcuda stub for driver-API linking (pip ships no stubs/)
mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so
Sanity check before the big build:
1
2
3
4
5
6
7
8
cat > /tmp/t.cu <<'EOF'
#include <cuda_runtime.h>
__global__ void k(){}
int main(){k<<<1,1>>>();return cudaDeviceSynchronize();}
EOF
$CUDA_HOME/bin/nvcc -arch=sm_86 -I$CUDA_HOME/include -L$CUDA_HOME/lib -lcudart /tmp/t.cu -o /tmp/t.out
# Also confirm CMake finds it:
cmake -P <(echo 'find_package(CUDAToolkit REQUIRED); message("CTK ${CUDAToolkit_VERSION}")') 2>&1
Step 6: Build vLLM
Set the build environment and go. The most important variable is TORCH_CUDA_ARCH_LIST — scope it to your GPU or you’ll compile every architecture and wait 5–10× longer.
1
2
3
4
5
6
7
8
9
cd ~/go/vllm
export PATH=$CUDA_HOME/bin:$PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="8.6+PTX" # A10G = sm_86
export MAX_JOBS=12 # ~2-3 GB RAM per job; tune to your box
export NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation
A few notes:
--no-build-isolationis required so the build sees the torch/CUDA you installed.enforce_eager-style arch warnings likeDeepGEMM/FlashMLA will not compile: unsupported CUDA architecture 8.6are expected on Ampere — those kernels target Hopper (sm_90+) and are simply skipped.- On 16 cores / 62 GB this took ~30–40 min and produced
_C.abi3.so(~117 MB),_moe_C.abi3.so, etc.
Pitfall 6: a bundled submodule rejects your Python
Even though the top-level pyproject.toml allowed Python 3.14, the vendored flash-attention CMake had its own allow-list:
1
2
CMake Error at .deps/vllm-flash-attn-src/cmake/utils.cmake:20:
Python version (3.14) is not one of the supported versions: 3.9;3.10;3.11;3.12;3.13.
Fix — add your version to the macro (vLLM points FETCHCONTENT_BASE_DIR at .deps, so edits there persist; just don’t rm -rf .deps before rebuilding):
1
2
# .deps/vllm-flash-attn-src/cmake/utils.cmake
set(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN} "3.14")
This patch is not permanent. flash-attn is pulled via CMake FetchContent at a pinned
GIT_TAG. The moment yougit pull/update vLLM and that tag changes (or yourm -rf .deps), FetchContent re-clones a fresh copy and your edit is gone — the 3.14 check fails again at the next configure. Re-apply the one-liner after any update that bumps the flash-attn tag.
Pitfall 7: dependency-resolver deadlock (ResolutionImpossible)
On a recent main, pip install -e . can die before compiling anything with:
1
2
3
4
5
ERROR: Cannot install cuda-tile[tileiras]==1.4.0, cuda-toolkit==13.0.2 and vllm
because these package versions have conflicting dependencies.
torch 2.11.0 depends on cuda-toolkit==13.0.2
cuda-tile[tileiras] 1.4.0 depends on cuda-toolkit>=13.2,<13.4
ERROR: ResolutionImpossible
Two of vLLM’s own dependencies pin incompatible CUDA-toolkit ranges (torch wants exactly 13.0.2; a newer kernel package wants ≥13.2). pip’s strict resolver refuses to proceed. This is an upstream packaging conflict, not something you caused — and it’s exactly why we aligned the toolkit to 13.3 earlier (it satisfies the ≥13.2 side, and torch runs fine against it via minor-version compat).
The fix is to build the package without re-resolving the whole graph, since you’ve already curated a working CUDA stack:
1
pip install -v -e . --no-build-isolation --no-deps
--no-deps compiles and installs vLLM using the environment you’ve assembled, instead of letting pip try (and fail) to reconcile every transitive pin. Afterwards, install any genuinely-missing runtime deps individually and re-run the smoke test. (Upstream’s own docs use uv, whose override/resolution model sidesteps this; with plain pip, --no-deps is the escape hatch.)
Pitfall 8: MAX_JOBS and parallelism
MAX_JOBS controls ninja’s parallel compile jobs. CUDA compiles use ~2–3 GB each, so MAX_JOBS × 3 GB should fit in RAM. On 62 GB you can run 16; we used 12 as a safe default. You’ll notice ninja drops to fewer jobs near the end ([267/340]) — that’s dependency ordering on the final heavy template units and the .so link, not a misconfiguration. NVCC_THREADS parallelizes within a single nvcc invocation.
Step 7: Verify — and the runtime-only pitfalls
A successful build does not mean inference works. vLLM’s runtime JIT-compiles more kernels on first use, which surfaces a fresh set of issues.
1
2
3
4
5
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enforce_eager=True,
gpu_memory_utilization=0.5, max_model_len=512)
print(llm.generate(["The capital of France is"],
SamplingParams(temperature=0, max_tokens=20))[0].outputs[0].text)
Pitfall 9: Could not find nvcc and default cuda_home='/usr/local/cuda'
flashinfer JIT-compiles sampling kernels at runtime and needs nvcc — but at runtime nobody set CUDA_HOME, so it falls back to the nonexistent /usr/local/cuda. Because our toolkit lives in the venv, export it (and bake it into activate so it’s always present):
1
2
3
4
cat >> $VIRTUAL_ENV/bin/activate <<'EOF'
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
EOF
This is also where Pitfalls 5c (cccl version check) and the lib64 symlink (cannot find -lcudart) bite — they’re runtime-JIT failures, not build failures, so they only appear here. With the 13.3 alignment + the lib64 symlink in place, the JIT compile succeeds and you get:
1
2
PROMPT: 'The capital of France is'
OUTPUT: ' the capital of the French Republic...'
🎉
Step 8: Run the GPU test suite
A generate() proves the happy path; the kernel tests prove the build broadly. The suite that most directly exercises what you just compiled is tests/kernels/. Run it with CUDA_HOME on PATH (the tests JIT-compile too):
1
2
3
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
python -m pytest tests/kernels/core tests/kernels/attention -q
On an A10G a focused subset (activation, layernorm, rotary/positional encoding, paged attention, cache) runs in ~1 hr and lands at 2402 passed, 583 skipped, 36 failed. The 583 skips are arch-gated kernels (Hopper/Blackwell sm_90+) correctly opting out. The 36 failures are all the same issue — see Pitfall 10.
Pitfall 10: FP8 KV-cache tests fail (not skip) on SM < 89
Every one of those 36 failures is test_reshape_and_cache_flash[...fp8...] with:
1
FP8 KV cache needs native fp8e4nv (SM89+). Use --kv-cache-dtype bfloat16 ...
The A10G is sm_86; native FP8 (fp8e4nv) needs sm_89+ (Ada/Hopper). This is a hardware limit, not a broken build — but unlike the cleanly arch-gated kernels, this Triton path asserts on unsupported hardware instead of skipping, so it counts as a failure. Deselect the FP8 cases to get a fully green run:
1
2
python -m pytest tests/kernels/attention/test_cache.py -k "not fp8" -q
# 335 passed, 403 skipped, 477 deselected, 0 failed
Takeaway: on pre-Ada GPUs, treat FP8 KV-cache test failures as expected, and gate them out with -k "not fp8" rather than chasing them.
Appendix: every error → one-line fix
| Error | Root cause | Fix |
|---|---|---|
nvidia-smi: command not found (assumed no GPU) | driver not installed; hardware was there | lspci \| grep nvidia to detect hardware |
modprobe nvidia: No such device | nouveau owns the GPU | blacklist + unbind + rmmod nouveau |
CUDA unknown error / cuInit → 999 | missing/stale UVM nodes; no nvidia-modprobe | apt install nvidia-modprobe; recreate /dev/nvidia-uvm |
nvidia-cuda-nvcc-cu13 has no real version | wrong package name | use cuda-toolkit[nvcc] |
ptxas Unsupported .version 9.3 | nvcc/ptxas minor mismatch | pin all CUDA pkgs to one minor |
CMake: nvcc says 13.3 but headers say 13.0 | runtime headers ≠ nvcc | align headers to nvcc version |
mathcalls.h: rsqrt ... incompatible | CUDA 13.0 headers vs glibc 2.43 | use CUDA ≥ 13.1 headers |
| flash-attn CMake: Python 3.14 not supported | submodule allow-list | patch utils.cmake (re-apply after any update that bumps its tag) |
ResolutionImpossible (cuda-toolkit 13.0.2 vs ≥13.2) | conflicting CUDA pins across vLLM deps | build with pip install -e . --no-deps |
cccl: compiler and toolkit headers incompatible | runtime downgraded after vLLM reinstall | re-pin CUDA runtime to nvcc’s minor |
cannot find -lcudart (JIT link) | wheels use lib/, tool wants lib64/ | ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64 |
Could not find nvcc ... /usr/local/cuda | CUDA_HOME unset at runtime | export CUDA_HOME (bake into activate) |
FP8 KV cache needs native fp8e4nv (SM89+) (test fails) | A10G is sm_86; FP8 path asserts instead of skipping | not a build bug — deselect with -k "not fp8" |
Updating an existing checkout
Pulling a newer vLLM isn’t just git pull — an editable source build has moving parts that a pull invalidates. The sequence that works:
1
2
3
4
5
git fetch upstream && git reset --hard upstream/main # or your target commit
rm -rf build .deps && find vllm -name '*.abi3.so' -delete # force a clean rebuild
# re-apply the flash-attn 3.14 patch (the tag changed → .deps was re-fetched)
pip install -v -e . --no-build-isolation --no-deps # --no-deps dodges resolver conflicts
# re-pin the CUDA toolkit to 13.3 if anything got downgraded, then re-run the smoke test
Before pulling, check the gap with git diff --name-only HEAD..upstream/main | grep -E '\.cu|CMakeLists|requirements/' — if native/build files changed (they usually have), budget for a full recompile (~30–40 min) and re-verification. Also confirm the torch== pin and requires-python in pyproject.toml didn’t move; if torch’s version changed, you’re re-doing the whole CUDA/toolkit alignment, not just a rebuild.
Key takeaways
- Detect hardware with
lspci, notnvidia-smi. Don’t build for CPU because a tool is missing. nvidia-smiworking ≠ CUDA working. UVM nodes +nvidia-modprobematter.- Pin the entire CUDA pip toolkit to one minor version. Skew fails three different ways at three different stages.
- Pick the CUDA minor that’s compatible with your glibc/compiler, then rely on CUDA minor-version compatibility for the driver/torch.
- A green build isn’t done — runtime JIT (flashinfer) needs
CUDA_HOMEand a couple of symlinks. Verify with a realgenerate(). - Scope
TORCH_CUDA_ARCH_LISTto your GPU to keep build times sane. - Some test failures are hardware limits, not build bugs. On pre-Ada GPUs the FP8 KV-cache tests
assertinstead ofskip— deselect them with-k "not fp8".
References
Disclaimer: This article was generated using the Gemini 3.1 Pro model.