meshy

Notes & experiments

Working notes on GPU kernels, training systems, and related experiments.

Entries

ForgeTrain vs FlashAttention-3 / 4→

Forward-pass attention benchmark on H100 PCIe

/fa-bench · bf16 · causal · D=128

NVIDIA H100 PCIe→

Compute ceilings, memory hierarchy, and Hopper features (wgmma, TMA, cluster DSM) on an H100 PCIe — each paired with Nsight saturation proof, a measured-vs-datasheet comparison, and a GEMM roofline.

/h100-pcie · Hopper · sm_90 · 4 cards × 3 rounds

NVIDIA GeForce RTX 4090→

The same microbench suite on a consumer Ada (sm_89) card as a reference point — where the portable mma.sync path IS the tensor ceiling and FP32-accumulate is halved.

/rtx-4090 · Ada · sm_89 · consumer reference

Blackwell (sm_100): the tcgen05 + TMEM tensor path measured, with the single-SM microbench floor framed against cuBLAS at 96% of the dense datasheet.

/b200 · Blackwell · sm_100 · first run