MoDES:
Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Yushi Huang, Zining Wang, Zhihang Yuan📧, Ruihao Gong, Yifu Ding, Jinyang Guo, Xianglong Liu, Jun Zhang📧

(📧 denotes corresponding author.)

This is the official implementation of our paper MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. Through extensive experiments on 3 model series across 13 benchmarks, MoDES significantly outperforms prior methods. For example, when skipping 88% of experts for Qwen3-VL-MoE-30B-A3B-Instruct, MoDES achieves a performance boost of up to 10.67% (97.33% vs. 86.66%). It also improves inference speed with 2.16× prefilling and 1.26× decoding speedup.

📖 Overview

Overview pipeline of the proposed MoDES. At inference, use a text token (e.g., blue square above) at the $l$-th FFN layer as an example. (a) We compute importance scores $s^{(l)}_i$ ($i\in\{2, 4, M\}$) by combining the offline-calibrated globally-modulated factor $\alpha^{(l)}$ with the local routing probability $\pi^{(l)}_i$. These scores evaluate the top- $k$ ($k=3$) routed experts for the token. (b) We then apply a modality-specific threshold— $\tau_{\text{t}}$ for text and $\tau_{\text{v}}$ for vision—found by an efficient and effective frontier search. Experts with scores below the threshold are skipped. This method significantly reduces computation while preserving performance for MoE MLLMs.

🔥 News

Feb 20, 2026: 🔥 We release the Python code for expert skipping presented in our paper. Have a try!
Feb 20, 2026: 🌟 Our paper has been accepted by CVPR 2026! 🎉 Cheers!

✨ Quick Start

Requirements

8× H100/H200/H800/A100/A800 GPUs (adjust scripts as needed for fewer GPUs)

Installation

conda create -n modes python=3.11 -y
conda activate modes
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
pip install -e ./lmms-eval
pip install qwen-vl-utils==0.0.14

For Qwen3-VL, use the latest transformers:

git clone https://github.com/huggingface/transformers.git
pip install -e transformers/

Data and Models

# Login to huggingface-cli first
mkdir -p storage/models/Kimi-VL-A3B-Instruct
# Replace with Qwen3-VL-MoE for that model
huggingface-cli download moonshotai/Kimi-VL-A3B-Instruct --local-dir ./storage/models/Kimi-VL-A3B-Instruct

# Download GQA (or VideoMMMU / COCO similarly)
mkdir -p storage/datasets/GQA/testdev_balanced_images
mkdir -p storage/datasets/GQA/testdev_balanced_instructions
huggingface-cli download lmms-lab/GQA testdev_balanced_images/testdev-00000-of-00001.parquet --repo-type dataset ./storage/datasets/GQA/testdev_balanced_images
huggingface-cli download lmms-lab/GQA testdev_balanced_instructions/testdev-00000-of-00001.parquet --repo-type dataset ./storage/datasets/GQA/testdev_balanced_instructions

Calibration and Frontier Search

export prefix=/path/to/your/dir
export PYTHONPATH=$prefix:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Step 1: Calibration (compute layer importance)
accelerate launch --num_processes 8 --main_process_port 5678 \
    get_layer_importance_ddp.py \
    --name_or_path $prefix/storage/models/Kimi-VL-A3B-Instruct \
    --save_dir $prefix/storage/importance \
    --dataset gqa \
    --loss_type kl \
    --batch_size 8 \
    --num_samples 1024

# Step 2: Frontier search (find optimal tau)
export target_skip_proportion=0.8  # or 0.7, 0.6, etc.
accelerate launch --num_processes 8 --main_process_port 5678 \
    grid_search_tau_ddp.py \
    --name_or_path $prefix/storage/models/Kimi-VL-A3B-Instruct \
    --save_dir $prefix/storage/search/ \
    --dataset gqa \
    --loss_type kl \
    --batch_size 8 \
    --num_samples 1024 \
    --layer_importance_path path/to/importance/pkl \
    --target_skip_proportion $target_skip_proportion \
    --grid_num 100 \
    --grid_map exp

Evaluation

export HF_ALLOW_CODE_EVAL=1
export HF_DATASETS_TRUST_REMOTE_CODE=true
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$prefix:$PYTHONPATH
export DECORD_EOF_RETRY_MAX=204800

accelerate launch --main_process_port 5678 \
    eval/kimi.py \
    --model kimi_vl \
    --model_args pretrained=storage/models/Kimi-VL-A3B-Instruct,tau_skip_path=path/to/tau,layer_importance_path=path/to/importance \
    --tasks videomme,gqa,chartqa \
    --batch_size 1 \
    --output_path $prefix/storage/eval/

Note

Replace eval/kimi.py with eval/qwen3.py, kimi_vl with qwen3_vl and storage/models/Kimi-VL-A3B-Instruct with storage/models/Qwen3-VL-30B-A3B-Instruct for Qwen3-VL.

💪 TODO

Fast CUDA implementation for MoE layers.
Code for InternVL series.

🙏 Acknowledgments

Our code was developed based on transformers and lmms-eval.

📝 Citation

If you find MoDES useful in your research, please cite:

@InProceedings{huang2025modesacceleratingmixtureofexpertsmultimodal,
    title = {MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping}, 
    author = {Yushi Huang and Zining Wang and Zhihang Yuan and Yifu Ding and Ruihao Gong and Jinyang Guo and Xianglong Liu and Jun Zhang},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
eval		eval
models		models
tasks		tasks
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
get_layer_importance_ddp.py		get_layer_importance_ddp.py
grid_search_tau_ddp.py		grid_search_tau_ddp.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoDES:
Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

📖 Overview

🔥 News

✨ Quick Start

Requirements

Installation

Data and Models

Calibration and Frontier Search

Evaluation

💪 TODO

🙏 Acknowledgments

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

ModelTC/MoDES

Folders and files

Latest commit

History

Repository files navigation

MoDES:Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

📖 Overview

🔥 News

✨ Quick Start

Requirements

Installation

Data and Models

Calibration and Frontier Search

Evaluation

💪 TODO

🙏 Acknowledgments

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

MoDES:
Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Packages