Add ACE-Step pipeline for text-to-music generation #13095

ChuxiJ · 2026-02-07T11:24:08Z

What does this PR do?

This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.

New Components

AceStepDiTModel (src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components: AceStepRMSNorm, AceStepRotaryEmbedding, AceStepMLP, AceStepTimestepEmbedding, AceStepAttention, AceStepEncoderLayer, and AceStepDiTLayer.
AceStepConditionEncoder (src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. Includes AceStepLyricEncoder and AceStepTimbreEncoder sub-modules.
AceStepPipeline (src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:
- text2music — generate music from text and lyrics
- cover — generate from audio semantic codes or with timbre transfer via reference audio
- repaint — regenerate a time region within existing audio
- extract — extract a specific track (vocals, drums, etc.) from audio
- lego — generate a specific track given audio context
- complete — complete audio with additional tracks
Conversion script (scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.

Key Features

Multi-task support: 6 task types with automatic instruction routing via _get_task_instruction
Music metadata conditioning: Optional bpm, keyscale, timesignature parameters formatted into the SFT prompt template
Audio-to-audio tasks: Source audio (src_audio) and reference audio (reference_audio) inputs with VAE encoding
Tiled VAE encode/decode: Memory-efficient chunked encoding (_tiled_encode) and decoding (_tiled_decode) for long audio
Classifier-free guidance (CFG): Dual forward pass with configurable guidance_scale, cfg_interval_start, and cfg_interval_end (primarily for base/SFT models; turbo models have guidance distilled into weights)
Audio cover strength blending: Smooth interpolation between cover-conditioned and text-only-conditioned outputs via audio_cover_strength
Audio code parsing: _parse_audio_code_string extracts semantic codes from <|audio_code_N|> tokens for cover tasks
Chunk masking: _build_chunk_mask creates time-region masks for repaint/lego tasks
Anti-clipping normalization: Post-decode normalization to prevent audio clipping
Multi-language lyrics: 50+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
Variable-length generation: Configurable duration from 10 seconds to 10+ minutes
Custom timestep schedules: Pre-defined shifted schedules for shift=1.0/2.0/3.0, or user-provided timesteps
Turbo model variant: Optimized for 8 inference steps with shift=3.0

Architecture

ACE-Step 1.5 comprises three main components:

Oobleck autoencoder (VAE): Compresses 48kHz stereo waveforms into 25Hz latent representations
Qwen3-Embedding-0.6B text encoder: Encodes text prompts and lyrics for conditioning
Diffusion Transformer (DiT): Denoises audio latents using flow matching with an Euler ODE solver

Tests

Pipeline tests (tests/pipelines/ace_step/test_ace_step.py):
- AceStepDiTModelTests — forward shape, return dict, gradient checkpointing
- AceStepConditionEncoderTests — forward shape, save/load config
- AceStepPipelineFastTests (extends PipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and more
Model tests (tests/models/transformers/test_models_transformer_ace_step.py):
- TestAceStepDiTModel (extends ModelTesterMixin) — forward pass, dtype inference, save/load, determinism
- TestAceStepDiTModelMemory (extends MemoryTesterMixin) — layerwise casting, group offloading
- TestAceStepDiTModelTraining (extends TrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precision

All 70 tests pass (39 pipeline + 31 model).

Documentation

docs/source/en/api/pipelines/ace_step.md — Pipeline API documentation with usage examples
docs/source/en/api/models/ace_step_transformer.md — Transformer model documentation

Usage

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/ACE-Step-v1-5-turbo", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
).audios

sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

References

Original codebase: ACE-Step/ACE-Step-1.5
Paper: ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

## What does this PR do? This PR adds support for the ACE-Step pipeline, a text-to-music generation model that generates high-quality music with lyrics from text prompts. ACE-Step generates variable-length stereo music at 48kHz from text prompts and optional lyrics. The implementation includes: - **AceStepDiTModel**: A Diffusion Transformer (DiT) model that operates in the latent space using flow matching - **AceStepPipeline**: The main pipeline for text-to-music generation with support for lyrics conditioning - **AceStepConditionEncoder**: Condition encoder that combines text, lyric, and timbre embeddings - **Conversion script**: Script to convert ACE-Step checkpoint weights to Diffusers format - **Comprehensive tests**: Full test coverage for the pipeline and models - **Documentation**: API documentation for the pipeline and transformer model ## Key Features - Text-to-music generation with optional lyrics support - Multi-language lyrics support (English, Chinese, Japanese, Korean, and more) - Flow matching with custom timestep schedules - Turbo model variant optimized for 8 inference steps - Variable-length audio generation (configurable duration) ## Technical Details ACE-Step comprises three main components: 1. **Oobleck autoencoder (VAE)**: Compresses waveforms into 25Hz latent representations 2. **Qwen3-based text encoder**: Encodes text prompts and lyrics for conditioning 3. **Diffusion Transformer (DiT)**: Operates in the latent space using flow matching The pipeline supports multiple shift parameters (1.0, 2.0, 3.0) for different timestep schedules, with the turbo model designed for 8 inference steps using `shift=3.0`. ## Testing All tests pass successfully: - Model forward pass tests - Pipeline basic functionality tests - Batch processing tests - Latent output tests - Return dict tests Run tests with: ```bash pytest tests/pipelines/ace_step/test_ace_step.py -v ``` ## Code Quality - Code formatted with `make style` - Quality checks passed with `make quality` - All tests passing ## References - Original codebase: [ACE-Step/ACE-Step](https://github.com/ACE-Step/ACE-Step) - Paper: [ACE-Step: A Step Towards Music Generation Foundation Model](https://github.com/ACE-Step/ACE-Step)

- Add gradient checkpointing test for AceStepDiTModel - Add save/load config test for AceStepConditionEncoder - Enhance pipeline tests with PipelineTesterMixin - Update documentation to reflect ACE-Step 1.5 - Add comprehensive transformer model tests - Improve test coverage and code quality

- Add support for multiple task types: text2music, repaint, cover, extract, lego, complete - Add audio normalization and preprocessing utilities - Add tiled encode/decode for handling long audio sequences - Add reference audio support for timbre transfer in cover task - Add repaint functionality for regenerating audio sections - Add metadata handling (BPM, keyscale, timesignature) - Add audio code parsing and chunk mask building utilities - Improve documentation with multi-task usage examples

dg845 · 2026-02-09T22:56:50Z

Hi @ChuxiJ, thanks for the PR! As a preliminary comment, I tried the test script given above but got an error, which I think is due to the fact that the ACE-Step/ACE-Step-v1-5-turbo repo doesn't currently exist on the HF hub.

If I convert the checkpoint locally from a local snapshot of ACE-Step/Ace-Step1.5 at /path/to/acestep-v15-repo using

python scripts/convert_ace_step_to_diffusers.py \
    --checkpoint_dir /path/to/acestep-v15-repo \
    --dit_config acestep-v15-turbo \
    --output_dir /path/to/acestep-v15-diffusers \
    --dtype bf16

and then test it using the following script:

import torch
import soundfile as sf
from diffusers import AceStepPipeline

OUTPUT_SAMPLE_RATE = 48000
model_id = "/path/to/acestep-v15-diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

pipe = AceStepPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe = pipe.to(device)

generator = torch.Generator(device=device).manual_seed(seed)

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
    generator=generator,
).audios

sf.write("acestep_t2m.wav", audio[0, 0].cpu().numpy(), OUTPUT_SAMPLE_RATE)

I get the following sample:

acestep_t2m.wav

The sample quality is lower than expected, so there is probably a bug. Could you look into it?

dg845 · 2026-02-10T01:53:10Z