ForgeTrain vs FlashAttention-3 / 4→
Forward-pass attention benchmark on H100 PCIe
/fa-bench · bf16 · causal · D=128
NVIDIA H100 PCIe→
Compute ceilings, memory hierarchy, and Hopper features (wgmma, TMA, cluster DSM) on an H100 PCIe — each paired with Nsight saturation proof, a measured-vs-datasheet comparison, and a GEMM roofline.
/h100-pcie · Hopper · sm_90 · 4 cards × 3 rounds
NVIDIA GeForce RTX 4090→
The same microbench suite on a consumer Ada (sm_89) card as a reference point — where the portable mma.sync path IS the tensor ceiling and FP32-accumulate is halved.
/rtx-4090 · Ada · sm_89 · consumer reference
NVIDIA B200→
Blackwell (sm_100): the tcgen05 + TMEM tensor path measured, with the single-SM microbench floor framed against cuBLAS at 96% of the dense datasheet.
/b200 · Blackwell · sm_100 · first run