🎤 YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

Python License
arXiv Paper GitHub Demo Page HuggingFace Space HuggingFace Model Dataset LyricEditBench Discord WeChat Lab

Chunbo Hao1,2 · Junjie Zheng2 · Guobin Ma1 · Yuepeng Jiang1 · Huakang Chen1 · Wenjie Tian1 · Gongyu Chen2 · Zihao Chen2 · Lei Xie1

1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China

Abstract

Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. Code, weights, and the benchmark will be publicly released, with demos available at https://anonymous.4open.science/w/YingMusic-Singer.



model


Sing Edit

The model should preserve the melody from the Original Melody, match the timbre of the Timbre Reference, and faithfully render the Modified Lyrics.

Original Language Edit Task Original Melody Timbre Reference Original Lyrics Modified Lyrics Vevo[1] Ours

Melody Control

The model should preserve the melody from the Original Melody, match the timbre of the Timbre Reference, and faithfully render the Modified Lyrics.

Original Language Edit Task Original Melody Timbre Reference Original Lyrics Modified Lyrics Vevo[1] Ours

Ethics Statement

YingMusic-Singer enables the creation of singing voices with modified lyrics, supporting applications in artistic creation and entertainment. Potential risks include unauthorized voice cloning and copyright infringement. To ensure responsible deployment, users should obtain consent for voice usage, disclose AI involvement, and verify musical originality.

Reference

[1] X. Zhang, J. Zhang, Y. Wang, C. Wang, Y. Chen, D. Jia, Z. Chen, and Z. Wu, "Vevo2: A unified and controllable framework for speech and singing voice generation," CoRR, vol.abs/2508.16332, 2025.

Institutional Logo