Skip to content
/ MoDES Public

[CVPR 2026] This is the official PyTorch implementation of "MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping".

License

Notifications You must be signed in to change notification settings

ModelTC/MoDES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MoDES:
Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Licenseย  arXivย  GitHub Starsย 

[ Conference Paper ]

Yushi Huang, Zining Wang, Zhihang Yuan๐Ÿ“ง, Ruihao Gong, Yifu Ding, Jinyang Guo, Xianglong Liu, Jun Zhang๐Ÿ“ง

(๐Ÿ“ง denotes corresponding author.)

This is the official implementation of our paper MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. Through extensive experiments on 3 model series across 13 benchmarks, MoDES significantly outperforms prior methods. For example, when skipping 88% of experts for Qwen3-VL-MoE-30B-A3B-Instruct, MoDES achieves a performance boost of up to 10.67% (97.33% vs. 86.66%). It also improves inference speed with 2.16ร— prefilling and 1.26ร— decoding speedup.

๐Ÿ“– Overview

Overview pipeline of the proposed MoDES. At inference, use a text token (e.g., blue square above) at the $l$-th FFN layer as an example. (a) We compute importance scores $s^{(l)}_i$ ($i\in\{2, 4, M\}$) by combining the offline-calibrated globally-modulated factor $\alpha^{(l)}$ with the local routing probability $\pi^{(l)}_i$. These scores evaluate the top- $k$ ($k=3$) routed experts for the token. (b) We then apply a modality-specific thresholdโ€” $\tau_{\text{t}}$ for text and $\tau_{\text{v}}$ for visionโ€”found by an efficient and effective frontier search. Experts with scores below the threshold are skipped. This method significantly reduces computation while preserving performance for MoE MLLMs.

๐Ÿ”ฅ News

  • Feb 20, 2026: ๐Ÿ”ฅ We release the Python code for expert skipping presented in our paper. Have a try!

  • Feb 20, 2026: ๐ŸŒŸ Our paper has been accepted by CVPR 2026! ๐ŸŽ‰ Cheers!

โœจ Quick Start

Requirements

8ร— H100/H200/H800/A100/A800 GPUs (adjust scripts as needed for fewer GPUs)

Installation

conda create -n modes python=3.11 -y
conda activate modes
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
pip install -e ./lmms-eval
pip install qwen-vl-utils==0.0.14

For Qwen3-VL, use the latest transformers:

git clone https://github.com/huggingface/transformers.git
pip install -e transformers/

Data and Models

# Login to huggingface-cli first
mkdir -p storage/models/Kimi-VL-A3B-Instruct
# Replace with Qwen3-VL-MoE for that model
huggingface-cli download moonshotai/Kimi-VL-A3B-Instruct --local-dir ./storage/models/Kimi-VL-A3B-Instruct

# Download GQA (or VideoMMMU / COCO similarly)
mkdir -p storage/datasets/GQA/testdev_balanced_images
mkdir -p storage/datasets/GQA/testdev_balanced_instructions
huggingface-cli download lmms-lab/GQA testdev_balanced_images/testdev-00000-of-00001.parquet --repo-type dataset ./storage/datasets/GQA/testdev_balanced_images
huggingface-cli download lmms-lab/GQA testdev_balanced_instructions/testdev-00000-of-00001.parquet --repo-type dataset ./storage/datasets/GQA/testdev_balanced_instructions

Calibration and Frontier Search

export prefix=/path/to/your/dir
export PYTHONPATH=$prefix:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Step 1: Calibration (compute layer importance)
accelerate launch --num_processes 8 --main_process_port 5678 \
    get_layer_importance_ddp.py \
    --name_or_path $prefix/storage/models/Kimi-VL-A3B-Instruct \
    --save_dir $prefix/storage/importance \
    --dataset gqa \
    --loss_type kl \
    --batch_size 8 \
    --num_samples 1024

# Step 2: Frontier search (find optimal tau)
export target_skip_proportion=0.8  # or 0.7, 0.6, etc.
accelerate launch --num_processes 8 --main_process_port 5678 \
    grid_search_tau_ddp.py \
    --name_or_path $prefix/storage/models/Kimi-VL-A3B-Instruct \
    --save_dir $prefix/storage/search/ \
    --dataset gqa \
    --loss_type kl \
    --batch_size 8 \
    --num_samples 1024 \
    --layer_importance_path path/to/importance/pkl \
    --target_skip_proportion $target_skip_proportion \
    --grid_num 100 \
    --grid_map exp

Evaluation

export HF_ALLOW_CODE_EVAL=1
export HF_DATASETS_TRUST_REMOTE_CODE=true
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$prefix:$PYTHONPATH
export DECORD_EOF_RETRY_MAX=204800

accelerate launch --main_process_port 5678 \
    eval/kimi.py \
    --model kimi_vl \
    --model_args pretrained=storage/models/Kimi-VL-A3B-Instruct,tau_skip_path=path/to/tau,layer_importance_path=path/to/importance \
    --tasks videomme,gqa,chartqa \
    --batch_size 1 \
    --output_path $prefix/storage/eval/

Note

Replace eval/kimi.py with eval/qwen3.py, kimi_vl with qwen3_vl and storage/models/Kimi-VL-A3B-Instruct with storage/models/Qwen3-VL-30B-A3B-Instruct for Qwen3-VL.

๐Ÿ’ช TODO

  • Fast CUDA implementation for MoE layers.
  • Code for InternVL series.

๐Ÿ™ Acknowledgments

Our code was developed based on transformers and lmms-eval.

๐Ÿ“ Citation

If you find MoDES useful in your research, please cite:

@InProceedings{huang2025modesacceleratingmixtureofexpertsmultimodal,
    title = {MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping}, 
    author = {Yushi Huang and Zining Wang and Zhihang Yuan and Yifu Ding and Ruihao Gong and Jinyang Guo and Xianglong Liu and Jun Zhang},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2026},
}

About

[CVPR 2026] This is the official PyTorch implementation of "MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages