MeanVC

MeanVC:

Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Guobin Ma¹, Jixun Yao¹, Ziqian Ning¹, Yuepeng Jiang¹, Lingxin Xiong², Lei Xie¹^*, Pengcheng Zhu²^*

¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi’an, China
² Geely Automobile Research Institute (Ningbo) Company Ltd, Ningbo, China

Arxiv | GitHub Repo | Hugging Face

Abstract Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters.

Contents

Model Overview
Streaming Zero-shot VC
Celebrities Voice Conversion
Ethics Statement
Reference

This page is for research demonstration purposes only.

Model Overview

Figure 1. Overall architecture of our proposed MeanVC.

Streaming Zero-shot VC

StreamVoice¹: a streaming VC system based on language model.

Seed-VC² : a VC system based on diffusion transformers. We use its streaming version here.

Source Reference StreamVoice Seed-VC MeanVC

Celebrities Voice Conversion

Celebrity Voice Source Converted

蔡徐坤 (Xukun Cai)

丁真 (Zhen Ding)

赵忠祥 (Zhongxiang Zhao)

甄嬛 (Zhen Huan)

华妃 (Huafei)

Ethics Statement

MeanVC and the demostrating audios in this page are intended solely for academic research and ethical, non-commercial use. Users must obtain explicit consent from all involved speakers (source and target) before any voice conversion, and must not employ the system for impersonation, fraud, disinformation, harassment, or any illegal or deceptive activities. Users are required to comply with all applicable laws regarding privacy, intellectual property, and anti-deepfake regulation. The developers disclaim all liability for unethical or unlawful use and request that any misuse be reported through the designated contact channels.

Reference

[1] Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, and Yuping Wang, “Streamvoice: Streamable context-aware language modeling for real-time zero-shot voice conversion,” in ACL (1). 2024, pp. 7328–7338, Association for Computational Linguistics.

[2] Songting Liu, “Zero-shot voice conversion with diffusion transformers,” CoRR, vol. abs/2411.09943, 2024.

Celebrity Voice	Source	Converted
蔡徐坤 (Xukun Cai)
丁真 (Zhen Ding)
赵忠祥 (Zhongxiang Zhao)
甄嬛 (Zhen Huan)
华妃 (Huafei)