DeepSeek Open Sources DeepGEMM: Clean and effective FP8 GEMM kernels

0
13
DeepSeek Open Sources DeepGEMM: Clean and effective FP8 GEMM kernels

DeepGEMM

DeepGEMM is a library developed for tidy and effective FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3It supports both regular and Mix-of-Experts (MoE) organized GEMMs. Composed in CUDA, the library has no collection requirement throughout setup, by putting together all kernels at runtime utilizing a light-weight Just-In-Time (JIT) module.

Presently, DeepGEMM specifically supports NVIDIA Hopper tensor cores. To resolve the inaccurate FP8 tensor core build-up, it utilizes CUDA-core two-level build-up (promo). While it leverages some ideas from CUTLASS and Adorableit prevents heavy dependence on their design templates or algebras. Rather, the library is developed for simpleness, with just one core kernel function making up around ~ 300 lines of codeThis makes it a tidy and available resource for finding out Hopper FP8 matrix reproduction and optimization methods.

Regardless of its light-weight style, DeepGEMM’s efficiency matches or goes beyond expert-tuned libraries throughout different matrix shapes.

Efficiency

We evaluate all forms possibly utilized in DeepSeek-V3/ R1 reasoning (consisting of both prefilling and decoding, however without tensor parallelism) on H800 with NVCC 12.8. All speedup metrics are computed in contrast to our internally and thoroughly enhanced execution based upon CUTLASS 3.6.

DeepGEMM does not habits extremely well on some shapes, optimization PRs are invited if you are interested.

Typical GEMMs for thick designs

M N K Calculation Memory bandwidth Speedup
64 2112 7168 206 TFLOPS 1688 GB/s 2.7 x
64 24576 1536 289 TFLOPS 2455 GB/s 1.7 x
64 32768 512 219 TFLOPS 2143 GB/s 1.8 x
64 7168 16384 336 TFLOPS 2668 GB/s 1.4 x
64 4096 7168 287 TFLOPS 2320 GB/s 1.4 x
64 7168 2048 295 TFLOPS 2470 GB/s 1.7 x
128 2112 7168 352 TFLOPS 1509 GB/s 2.4 x
128 24576 1536 535 TFLOPS 2448 GB/s 1.6 x
128 32768 512 358 TFLOPS 2103 GB/s 1.5 x
128 7168 16384 645 TFLOPS 2604 GB/s 1.4 x
128 4096 7168 533 TFLOPS 2221 GB/s 2.0 x
128 7168 2048 510 TFLOPS 2277 GB/s 1.7 x
4096 2112 7168 1058 TFLOPS 527 GB/s 1.1 x
4096 24576 1536 990 TFLOPS 786 GB/s 1.0 x
4096 32768 512 590 TFLOPS 1232 GB/s 1.0 x
4096 7168 16384 1358 TFLOPS 343 GB/s 1.2 x
4096 4096 7168 1304 TFLOPS 500 GB/s 1.1 x
4096 7168 2048 1025 TFLOPS 697 GB/s 1.1 x

Organized GEMMs for MoE designs (adjoining design)

#Groups M per group N K Calculation Memory bandwidth Speedup
4 8192 4096 7168 1297 TFLOPS 418 GB/s 1.2 x
4 8192 7168 2048 1099 TFLOPS 681 GB/s 1.2 x
8 4096 4096 7168 1288 TFLOPS 494 GB/s 1.2 x
8 4096 7168 2048 1093 TFLOPS 743 GB/s 1.1 x

Organized GEMMs for MoE designs (masked design)

#Groups M per group N K Calculation Memory bandwidth Speedup
1 1024 4096 7168 1233 TFLOPS 924 GB/s 1.2 x
1 1024 7168 2048 925 TFLOPS 968 GB/s 1.2 x
2 512 4096 7168 1040 TFLOPS 1288 GB/s 1.2 x
2 512 7168 2048 916 TFLOPS 1405 GB/s 1.2 x
4 256 4096 7168 932 TFLOPS 2064 GB/s 1.1 x
4 256 7168 2048 815 TFLOPS 2047 GB/s 1.2 x

Quick begin

Requirements

  • Hopper architecture GPUs, sm_90a need to be supported
  • Python 3.8 or above
  • CUDA 12.3 or above
    • We extremely suggest 12.8 or above for the finest efficiency
  • PyTorch 2.1 or above
  • CUTLASS 3.6 or above (might be cloned by Git submodule)

Advancement

# Submodule must be cloned
git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git

# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop

# Test JIT compilation
python tests/test_jit.py

# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_core.py

Setup

Import deep_gemm in your Python job, and delight in!

User interfaces

Notifications

This library solely includes GEMM kernels. It needs the LHS scaling element to be TMA-aligned and shifted, and it just supports the NT format (non-transposed LHS and shifted RHS). For transposition or other FP8 casting operations, please carry out or fuse them into previous kernels individually. While the library offers some basic PyTorch energy functions, these might lead to slower efficiency, however our main focus is on enhancing the GEMM kernels themselves.

Regular thick GEMMs (non-grouped)

To carry out a standard non-grouped FP8 GEMM, call the deep_gemm.gemm_fp8_fp8_bf16_nt function. For more information, please describe the function documents.

Organized GEMMs (adjoining design)

Unlike standard organized GEMMs in CUTLASS, DeepGEMM groups just the M-axis, while N and K need to stay set. This style is customized for situations where specialists in an MoE design share the very same shape.

For training forward passes or reasoning prefilling, where each specialist might process a differing variety of tokens, we concatenate these tokens into a single tensor, described as the “adjoining” design. Keep in mind that each specialist section need to be lined up to the GEMM M block size (get_m_alignment_for_contiguous_layout().

To learn more, please describe the m_grouped_gemm_fp8_fp8_bf16_nt_contiguous function paperwork.

Organized GEMMs (masked design)

Throughout the reasoning translating stage, when CUDA chart is made it possible for and the CPU is uninformed of the variety of tokens each professional gets, we support masked organized GEMMs. By offering a mask tensor, the kernel calculates just the legitimate parts.

Usage m_grouped_gemm_fp8_fp8_bf16_nt_masked for this function and speak with the appropriate documents. An example use is to utilize the output of low-latency kernels from DeepEP as input.

Energies

The library supplies some energy works besides the above kernels:

  • deep_gemm.set_num_sms: set the optimum SM count to utilize
  • deep_gemm.get_num_sms: get the present SM optimum count
  • deep_gemm.get_m_alignment_for_contiguous_layout: get the group-level positioning requirement for organized adjoining design
  • deep_gemm.get_tma_aligned_size: get the needed TMA positioning size
  • deep_gemm.get_col_major_tma_aligned_tensor: get a column-major TMA-aligned tensor

The library likewise supplies some environment variables, which might work:

  • DG_CACHE_DIR: string, the cache directory site to save put together kernels, $HOME/.deep_gemm by default
  • DG_NVCC_COMPILER: string, defined NVCC compiler course; will discover in from torch.utils.cpp_extension.CUDA_HOME by default
  • DG_DISABLE_FFMA_INTERLEAVE: 0 or 1, disable FFMA-interleaving optimization
  • DG_PTXAS_VERBOSE: 0 or 1, reveal comprehensive PTXAS compiler output
  • DG_PRINT_REG_REUSE: 0 or 1, print FFMA-interleaving information
  • DG_JIT_PRINT_NVCC_COMMAND: 0 or 1, print NVCC collection command
  • DG_JIT_DEBUG: 0 or 1, print more debugging info

For extra examples and information, please describe the test code or evaluate the matching Python paperwork.

Optimizations

We suggest the methods left out from CUTLASS with.

Consistent warp-specialization

Following the CUTLASS style, the kernels in DeepGEMM are warp-specialized, making it possible for overlapping information motion, tensor-core MMA guidelines, and CUDA-core promo. A streamlined figure showing this procedure is revealed listed below:

style

Hopper TMA includes

The Tensor Memory Accelerator (TMA) is a brand-new hardware function presented by the Hopper architecture, developed for faster and asynchronous information motion. Particularly, we use TMA for:

  • TMA load for LHS, LHS scaling elements, and RHS matrices
  • TMA shop for the output matrix
  • TMA multicast (special to the LHS matrix)
  • TMA descriptor prefetching

Typical information optimizations

  • Usage of the stmatrix PTX direction
  • Register count control customized for various warpgroups
  • Overlapping as much as possible, e.g. overlapping TMA shop and non-TMA RHS scaling aspect load

A merged and enhanced block scheduler

Completely JIT style

DeepGEMM utilizes a completely Just-In-Time (JIT) style, without any collection needed at setup. All kernels are put together at runtime utilizing a light-weight JIT execution. This technique uses a number of benefits:

  • GEMM shapes, obstruct sizes, and the variety of pipeline phases are dealt with as compile-time constants
    • Conserving signs up
    • Compilers might do more optimizations
  • Automatic choice of block sizes, variety of warpgroups, ideal pipeline phases, and TMA cluster size
    • Without auto-tuning, the ideal one is deterministically picked
  • Complete unrolling of the MMA pipelines, supplying compilers with more optimization chances
    • Really crucial for little shapes
    • Describe launch_k_iterations in the kernel file for information

In general, JIT substantially enhances efficiency for little shapes, comparable to the technique of the Triton compiler.

Unaligned block sizes

For specific shapes, obstruct sizes lined up to powers of 2 can result in underutilized SMs. With M=256, N=7168a normal block size task of BLOCK_M=128, BLOCK_N=128 lead to just (256 / 128) * (7168 / 128) = 112 out of 132 SMs being made use of. To resolve this, we support unaligned block sizes like 112, allowing (256 / 128) * (7168 / 112) = 128 SMs to operate in such situations. Executing this strategy together with fine-grained scaling needs cautious optimization however eventually provides efficiency gains.

FFMA SASS interleaving

We observe an efficiency enhancement in the CUTLASS FP8 kernel in between NVCC 12.2 and 12.3. By comparing the put together SASS, we find that a person bit in a series of FADD guidelines is turned in an interleaving pattern. After referencing some open-source CUDA assembler applications, we determined that this bit manages yieldwhich might boost warp-level parallelism (simply a guess, yielding the present warp and let other warps work).

To take advantage of this, we establish a comparable script to customize the FFMA guidelines in the assembled binary. Just customizing the yield bit, we likewise turn the reuse bit (signs up can not be recycled if the warp is yielded). This modification enhances efficiency (10%+ in many cases) for fine-grained scaling FP8 GEMMs by developing more chances to overlap MMA guidelines with promo FFMA guidelines.

Recognition

DeepGEMM is influenced by the CUTLASS task. Thanks and regard to the designers!

License

This code repository is launched under the MIT License

Citation

@misc{deepgemm2025,
      title={DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling}, 
      author={Chenggang Zhao and Liang Zhao and Jiashi Li and Zhean Xu},
      year={2025},
      publisher = {GitHub},
      howpublished = {url{https://github.com/deepseek-ai/DeepGEMM}},
}

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here