How JAX Shards a Computation Across a Mesh

I have eight devices. I want a matmul whose contracting dimension is split across some of them. So I write the matmul normally, say where the two operands live, and compile: mesh = Mesh(np.array(j...

Jul 30, 2026 Compilers, ML-Systems

When XLA Isn't Enough: Pallas, Mosaic, and Triton

This is the last post on the JAX/XLA stack. We went from Python to a jaxpr, from a jaxpr to optimized single-device HLO, and from single-device HLO to a mesh of devices. One door left, and it is th...

Jul 29, 2026 Compilers, ML-Systems

What JAX Traces, and What It Refuses

Coming from PyTorch, the first thing JAX does is offend you. You write a simple function with an if statement, wrap it in jax.jit, and call it. Instead of running, it throws a TracerBoolConversion...

Jul 28, 2026 Compilers, ML-Systems

Programming the TPU: What Its Open-Source Compiler Already Tells You

The TPU has less public hardware documentation than a GPU. There is no vendor ISA manual, no die-level memory-model spec; you get peak FLOPs, HBM capacity, and a block diagram. But if you write JAX...

Jul 27, 2026 Compilers, ML-Systems

XLA Up Close: The Performance Bargain and Its Rigid Price

Here is a small function and the entire program XLA compiled it into: def simp(x): y = (x + 0.0) * 1.0 return jnp.transpose(jnp.transpose(y)) ENTRY %main (x: f32[3,3]) -> f32[3,3] { ...

Jul 25, 2026 Compilers, ML-Systems

A Tour of XLA: The Compiler Beneath JAX, TensorFlow, and PyTorch

If you train or serve models, XLA is probably compiling them, whether or not you have ever named it. It is the compiler behind JAX, behind TensorFlow, and — through PyTorch/XLA — behind much of PyT...

Jul 23, 2026 Systems, Compilers

What torch.compile Sees, and What It's Blind To

The first time I put torch.compile in front of an LLM inference server, it spent the better part of a minute compiling and handed back a 6% speedup. I nearly wrote the feature off. The next morning...

Jul 21, 2026 Compilers, ML-Systems

Triton: The Compiler That Pretends to Be a Library

Triton is a compiler with a Python frontend. The @triton.jit decorator does not decorate a function. It parses the function’s AST, runs it through an MLIR pipeline, and emits a GPU binary. The Pyth...

Jul 19, 2026 Compilers, ML-Systems

Anatomy of a CUDA Binary

When you compile a CUDA kernel, the final artifact is a cubin — a CUDA binary. It is a standard ELF64 file with NVIDIA-specific sections that encode everything the CUDA driver needs to load and lau...

Jul 14, 2026 Systems, GPU Architecture

Where Should Your Code Live?

A survey of code repository hosting in 2026 Most developers do not think about where their Git repository lives until something goes wrong. An account gets suspended. A DMCA notice takes down a pr...

Jul 13, 2026 Systems, Infrastructure