MeanVC2

MeanVC 2

Robust Low-Latency Streaming Zero-Shot Voice Conversion

Guobin Ma^1,*, Yuxuan Xia^1,*, Yuepeng Jiang¹, Dake Guo¹, Hanke Xie¹, Jingbin Hu¹, Yanbo Wang², Lei Xie^1,**, Pengcheng Zhu^3,**

¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Software,
Northwestern Polytechnical University, China
²The University of New South Wales, Australia
³WeNet Open Source Community, China

Arxiv | GitHub Repo | Hugging Face

Abstract Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40~ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms.

Contents

Model Overview
Video Demo
Streaming Zero-shot VC
Celebrities Voice Conversion
Ethics Statement
Reference

This page is for research demonstration purposes only.

Model Overview

Figure 1. Overall architecture of our proposed MeanVC 2.

Video Demo

Tutorial

Demo

Streaming Zero-shot VC

StreamVoice+¹: a streaming VC system based on language model.

MeanVC² : a lightweight and streaming zero-shot voice conversion system.

Reference Source StreamVoice+(80ms) MeanVC(80ms) MeanVC(160ms) MeanVC 2(40ms)

Celebrities Voice Conversion

Celebrity Voice Source MeanVC(80ms) MeanVC(160ms) MeanVC 2(40ms)

蔡徐坤 (Xukun Cai)

周杰伦 (Jay Chou)

丁真 (Zhen Ding)

罗永浩 (Yonghao Luo)

余承东 (Chengdong Yu)

Ethics Statement

MeanVC 2 and the demostrating audios in this page are intended solely for academic research and ethical, non-commercial use. Users must obtain explicit consent from all involved speakers (source and target) before any voice conversion, and must not employ the system for impersonation, fraud, disinformation, harassment, or any illegal or deceptive activities. Users are required to comply with all applicable laws regarding privacy, intellectual property, and anti-deepfake regulation. The developers disclaim all liability for unethical or unlawful use and request that any misuse be reported through the designated contact channels.

Reference

[1] Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang, “StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion,” IEEE Signal Process. Lett., vol. 31, pp. 3000–3004, 2024.

[2] Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, Pengcheng Zhu, “MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows,” CoRR, vol. abs/2510.08392, 2025.

Celebrity Voice	Source	MeanVC(80ms)	MeanVC(160ms)	MeanVC 2(40ms)
蔡徐坤 (Xukun Cai)
周杰伦 (Jay Chou)
丁真 (Zhen Ding)
罗永浩 (Yonghao Luo)
余承东 (Chengdong Yu)