DeepGEMM
DeepGEMM is a library developed for tidy and effective FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3It supports both regular and Mix-of-Experts (MoE) organized GEMMs. Composed in CUDA, the library has no collection requirement throughout setup, by putting together all kernels at runtime utilizing a light-weight Just-In-Time (JIT) module.
Presently, DeepGEMM specifically supports NVIDIA Hopper tensor cores. To resolve the inaccurate FP8 tensor core build-up, it utilizes CUDA-core two-level build-up (promo). While it leverages some ideas from CUTLASS and Adorableit prevents heavy dependence on their design templates or algebras. Rather, the library is developed for simpleness, with just one core kernel function making up around ~ 300 lines of codeThis makes it a tidy and available resource for finding out Hopper FP8 matrix reproduction and optimization methods.
Regardless of its light-weight style, DeepGEMM’s efficiency matches or goes beyond expert-tuned libraries throughout different matrix shapes.
Efficiency
We evaluate all forms possibly utilized in DeepSeek-V3/ R1 reasoning (consisting of both prefilling and decoding, however without tensor parallelism) on H800 with NVCC 12.8. All speedup metrics are computed in contrast to our internally and thoroughly enhanced execution based upon CUTLASS 3.6.
DeepGEMM does not habits extremely well on some shapes, optimization PRs are invited if you are interested.
Typical GEMMs for thick designs
M | N | K | Calculation | Memory bandwidth | Speedup |
---|---|---|---|---|---|
64 | 2112 | 7168 | 206 TFLOPS | 1688 GB/s | 2.7 x |
64 | 24576 | 1536 | 289 TFLOPS | 2455 GB/s | 1.7 x |
64 | 32768 | 512 | 219 TFLOPS | 2143 GB/s | 1.8 x |
64 | 7168 | 16384 | 336 TFLOPS | 2668 GB/s | 1.4 x |
64 | 4096 | 7168 | 287 TFLOPS | 2320 GB/s | 1.4 x |
64 | 7168 | 2048 | 295 TFLOPS | 2470 GB/s | 1.7 x |
128 | 2112 | 7168 | 352 TFLOPS | 1509 GB/s | 2.4 x |
128 | 24576 | 1536 | 535 TFLOPS | 2448 GB/s | 1.6 x |
128 | 32768 | 512 | 358 TFLOPS | 2103 GB/s | 1.5 x |
128 | 7168 | 16384 | 645 TFLOPS | 2604 GB/s | 1.4 x |
128 | 4096 | 7168 | 533 TFLOPS | 2221 GB/s | 2.0 x |
128 | 7168 | 2048 | 510 TFLOPS | 2277 GB/s | 1.7 x |
4096 | 2112 | 7168 | 1058 TFLOPS | 527 GB/s | 1.1 x |
4096 | 24576 | 1536 | 990 TFLOPS | 786 GB/s | 1.0 x |
4096 | 32768 | 512 | 590 TFLOPS | 1232 GB/s | 1.0 x |
4096 | 7168 | 16384 | 1358 TFLOPS | 343 GB/s | 1.2 x |
4096 | 4096 | 7168 | 1304 TFLOPS | 500 GB/s | 1.1 x |
4096 | 7168 | 2048 | 1025 TFLOPS | 697 GB/s | 1.1 x |
Organized GEMMs for MoE designs (adjoining design)
#Groups | M per group | N | K | Calculation | Memory bandwidth | Speedup |
---|---|---|---|---|---|---|
4 | 8192 | 4096 | 7168 | 1297 TFLOPS | 418 GB/s | 1.2 x |
4 | 8192 | 7168 | 2048 | 1099 TFLOPS | 681 GB/s | 1.2 x |
8 | 4096 | 4096 | 7168 | 1288 TFLOPS | 494 GB/s | 1.2 x |
8 | 4096 | 7168 | 2048 | 1093 TFLOPS | 743 GB/s | 1.1 x |
Organized GEMMs for MoE designs (masked design)
#Groups | M per group | N | K | Calculation | Memory bandwidth | Speedup |
---|---|---|---|---|---|---|
1 | 1024 | 4096 | 7168 | 1233 TFLOPS | 924 GB/s | 1.2 x |
1 | 1024 | 7168 | 2048 | 925 TFLOPS | 968 GB/s | 1.2 x |
2 | 512 | 4096 | 7168 | 1040 TFLOPS | 1288 GB/s | 1.2 x |
2 | 512 | 7168 | 2048 | 916 TFLOPS | 1405 GB/s | 1.2 x |
4 | 256 | 4096 | 7168 | 932 TFLOPS | 2064 GB/s | 1.1 x |
4 | 256 | 7168 | 2048 | 815 TFLOPS | 2047 GB/s | 1.2 x |
Quick begin
Requirements
- Hopper architecture GPUs,
sm_90a
need to be supported - Python 3.8 or above
- CUDA 12.3 or above
- We extremely suggest 12.8 or above for the finest efficiency
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (might be cloned by Git submodule)
Advancement
# Submodule must be cloned git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git # Make symbolic links for third-party (CUTLASS and CuTe) include directories python setup.py develop # Test JIT compilation python tests/test_jit.py # Test all GEMM implements (normal, contiguous-grouped and masked-grouped) python tests/test_core.py
Setup
Import deep_gemm
in your Python job, and delight in!
User interfaces
Notifications
This library solely includes GEMM kernels. It needs the LHS scaling element to be TMA-aligned and shifted, and it just supports the NT format (non-transposed LHS and shifted RHS). For transposition or other FP8 casting operations, please carry out or fuse them into previous kernels individually. While the library offers some basic PyTorch energy functions, these might lead to slower efficiency, however our main focus is on enhancing the GEMM kernels themselves.
Regular thick GEMMs (non-grouped)
To carry out a standard non-grouped FP8 GEMM, call the deep_gemm.gemm_fp8_fp8_bf16_nt
function. For more information, please describe the function documents.
Organized GEMMs (adjoining design)
Unlike standard organized GEMMs in CUTLASS, DeepGEMM groups just the M-axis, while N and K need to stay set. This style is customized for situations where specialists in an MoE design share the very same shape.
For training forward passes or reasoning prefilling, where each specialist might process a differing variety of tokens, we concatenate these tokens into a single tensor, described as the “adjoining” design. Keep in mind that each specialist section need to be lined up to the GEMM M block size (get_m_alignment_for_contiguous_layout()
.
To learn more, please describe the m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
function paperwork.
Organized GEMMs (masked design)
Throughout the reasoning translating stage, when CUDA chart is made it possible for and the CPU is uninformed of the variety of tokens each professional gets, we support masked organized GEMMs. By offering a mask tensor, the kernel calculates just the legitimate parts.
Usage m_grouped_gemm_fp8_fp8_bf16_nt_masked
for this function and speak with the appropriate documents. An example use is to utilize the output of low-latency kernels from DeepEP as input.
Energies
The library supplies some energy works besides the above kernels:
deep_gemm.set_num_sms
: set the optimum SM count to utilizedeep_gemm.get_num_sms
: get the present SM optimum countdeep_gemm.get_m_alignment_for_contiguous_layout
: get the group-level positioning requirement for organized adjoining designdeep_gemm.get_tma_aligned_size
: get the needed TMA positioning sizedeep_gemm.get_col_major_tma_aligned_tensor
: get a column-major TMA-aligned tensor
The library likewise supplies some environment variables, which might work:
DG_CACHE_DIR
: string, the cache directory site to save put together kernels,$HOME/.deep_gemm
by defaultDG_NVCC_COMPILER
: string, defined NVCC compiler course; will discover infrom torch.utils.cpp_extension.CUDA_HOME
by defaultDG_DISABLE_FFMA_INTERLEAVE
: 0 or 1, disable FFMA-interleaving optimizationDG_PTXAS_VERBOSE
: 0 or 1, reveal comprehensive PTXAS compiler outputDG_PRINT_REG_REUSE
: 0 or 1, print FFMA-interleaving informationDG_JIT_PRINT_NVCC_COMMAND
: 0 or 1, print NVCC collection commandDG_JIT_DEBUG
: 0 or 1, print more debugging info
For extra examples and information, please describe the test code or evaluate the matching Python paperwork.
Optimizations
We suggest the methods left out from CUTLASS with.
Consistent warp-specialization
Following the CUTLASS style, the kernels in DeepGEMM are warp-specialized, making it possible for overlapping information motion, tensor-core MMA guidelines, and CUDA-core promo. A streamlined figure showing this procedure is revealed listed below:
Hopper TMA includes
The Tensor Memory Accelerator (TMA) is a brand-new hardware function presented by the Hopper architecture, developed for faster and asynchronous information motion. Particularly, we use TMA for:
- TMA load for LHS, LHS scaling elements, and RHS matrices
- TMA shop for the output matrix
- TMA multicast (special to the LHS matrix)
- TMA descriptor prefetching
Typical information optimizations
- Usage of the
stmatrix
PTX direction - Register count control customized for various warpgroups
- Overlapping as much as possible, e.g. overlapping TMA shop and non-TMA RHS scaling aspect load
A merged and enhanced block scheduler
- One scheduler for all non-grouped and organized kernels
- Rasterization to boost L2 cache reuse
Completely JIT style
DeepGEMM utilizes a completely Just-In-Time (JIT) style, without any collection needed at setup. All kernels are put together at runtime utilizing a light-weight JIT execution. This technique uses a number of benefits:
- GEMM shapes, obstruct sizes, and the variety of pipeline phases are dealt with as compile-time constants
- Conserving signs up
- Compilers might do more optimizations
- Automatic choice of block sizes, variety of warpgroups, ideal pipeline phases, and TMA cluster size
- Without auto-tuning, the ideal one is deterministically picked
- Complete unrolling of the MMA pipelines, supplying compilers with more optimization chances
- Really crucial for little shapes
- Describe
launch_k_iterations
in the kernel file for information
In general, JIT substantially enhances efficiency for little shapes, comparable to the technique of the Triton compiler.
Unaligned block sizes
For specific shapes, obstruct sizes lined up to powers of 2 can result in underutilized SMs. With M=256, N=7168
a normal block size task of BLOCK_M=128, BLOCK_N=128
lead to just (256 / 128) * (7168 / 128) = 112
out of 132 SMs being made use of. To resolve this, we support unaligned block sizes like 112, allowing (256 / 128) * (7168 / 112) = 128
SMs to operate in such situations. Executing this strategy together with fine-grained scaling needs cautious optimization however eventually provides efficiency gains.
FFMA SASS interleaving
We observe an efficiency enhancement in the CUTLASS FP8 kernel in between NVCC 12.2 and 12.3. By comparing the put together SASS, we find that a person bit in a series of FADD
guidelines is turned in an interleaving pattern. After referencing some open-source CUDA assembler applications, we determined that this bit manages yield
which might boost warp-level parallelism (simply a guess, yielding the present warp and let other warps work).
To take advantage of this, we establish a comparable script to customize the FFMA
guidelines in the assembled binary. Just customizing the yield
bit, we likewise turn the reuse
bit (signs up can not be recycled if the warp is yielded). This modification enhances efficiency (10%+ in many cases) for fine-grained scaling FP8 GEMMs by developing more chances to overlap MMA guidelines with promo FFMA
guidelines.
Recognition
DeepGEMM is influenced by the CUTLASS task. Thanks and regard to the designers!
License
This code repository is launched under the MIT License
Citation
@misc{deepgemm2025, title={DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling}, author={Chenggang Zhao and Liang Zhao and Jiashi Li and Zhean Xu}, year={2025}, publisher = {GitHub}, howpublished = {url{https://github.com/deepseek-ai/DeepGEMM}}, }