📢: Good news! 21,800 hours of multi-label Cantonese speech data are also available at  WenetSpeech-Yue                                                                                                              📢: Good news! 21,800 hours of multi-label Cantonese speech data are also available at  WenetSpeech-Yue                                                                                                              📢: Good news! 21,800 hours of multi-label Cantonese speech data are also available at  WenetSpeech-Yue

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus With Rich Annotation For Dialectal Speech Processing

Yuhang Dai1,*, Ziyu Zhang1,*, Shuai Wang4,5, Longhao Li1, Zhao Guo1, Tianlun Zuo1, Shuiyuan Wang1, Hongfei Xue1, Chengyou Wang1, Qing Wang3, Xin Xu2, Hui Bu2, Jie Li3, Jian Kang3, Binbin Zhang5, Lei Xie1,╀

1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2 Beijing AISHELL Technology Co., Ltd.
3 Institute of Artificial Intelligence (TeleAI), China Telecom
4 School of Intelligence Science and Technology, Nanjing University
5 WeNet Open Source Community

📑 Paper    |    🐙 GitHub    |    🤗 HuggingFace
🎤 Demo Page    |    💬 Contact Us

Abstract

The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available. Demos can be found in the supplementary material.

Promotional video

Chuan-Pipeline

Chuan-Pipeline is an automated pipeline specifically designed for building large-scale Cantonese datasets with multi-dimensional annotation. It consists of six components: (A) Audio Collection, (B) Speaker Attributes Annotation, (C) Speech Quality Annotation, (D) Automatic Speech Recognition, (E) Text Post-Processing, and (F) Recognizer Output Voting. The figure below provides an overview of the Chuan-Pipeline.

Chuan-Pipeline

WenetSpeech-Chuan

Dataset Overview

  • Contains 10,000 hours of large-scale Chuan-Yu dialect speech corpus with rich annotations, the largest open-source resource for Chuan-Yu dialect speech research.
  • Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.
  • Covers ten domains: Short videos, Entertainment, Live streams, Documentary, Audiobook, Drama, Interview, News and others.
  • quality_distribution

    Data Samples

    Domain Sample 1 Sample 2
    Short Videos
    来,哥哥再给你唱首歌。好儿,哎呦,把伴奏给我放起来,放就放嘛,还要躲人家钩子。
    Confidence: 0.9792
    Emotion: Happiness | Gender: Male | Age: Middle Age

    对不起,只有二娃才能让我真正体会作为女人的快乐。
    Confidence: 0.9545
    Emotion: Sadness | Gender: Female | Age: Youth
    Live Stream
    我想去逛街,欢迎进入直播间,晚上好,那我的名字是怎么说的呢?
    Confidence: 1.0000
    Emotion: Happiness | Gender: Female | Age: Youth

    梦见的就是你,不行啊,有四川话根本唱不起来,根本唱不起来呀!
    Confidence: 0.9103
    Emotion: Happiness | Gender: Female | Age: Youth
    Drama
    就临走那天挑了个飘了一下嗨呀,弟弟灵魂儿就飞上九霄云,就飘着一下魂都飞了,对不对?
    Confidence: 0.9369
    Emotion: Surprise | Gender: Male | Age: Middle Age

    是不是给人感觉后头是青花亮色的然后说话是很平和的眼神是不慌乱的不散的。
    Confidence: 0.9608
    Emotion: Anger | Gender: Male | Age: Old
    Documentary
    他坐在椅子上,挺直起腰杆,脸上展现出灿烂的笑容。
    Confidence: 0.9841
    Emotion: Neutral | Gender: Male | Age: Middle Age

    唤起路由无限的感慨,使他,更加痛恨官场的欺诈污浊。
    Confidence: 0.9091
    Emotion: Sadness | Gender: Male | Age: Middle Age
    Audio book
    看面貌约五十左右却自称活了两百多岁,在清顺治时出家当过和尚,还有杜蝶为证。
    Confidence: 0.9608
    Emotion: Neutral | Gender: Female | Age: Youth

    其言曰,士大夫以其见闻之广反各有所偏,自有负担杀者有负良骑者。
    Confidence: 0.938
    Emotion: Neutral | Gender: Female | Age: Youth
    News
    据说有网友坐飞机的时候呢,广播全程播报。
    Confidence: 1.0000
    Emotion: Neutral | Gender: Male | Age: Youth

    将溃疡两周以上都应该及时就医,据了解啊小云平时呢都喜欢吃比较烫的饭菜,也喜欢吃麻辣烫火锅之类的高温食物。
    Confidence: 0.9932
    Emotion: Happiness | Gender: Male | Age: Youth
    Entertainment
    绝佳好位置好像我被看到了,就问你敢不敢进来吧你,一套带走猪脚亮。
    Confidence: 0.987
    Emotion: Neutral | Gender: Female | Age: Youth

    两岸猿声啼不住,有家难回车里住。
    Confidence: 0.962
    Emotion: Neutral | Gender: Male | Age: Youth
    Reading
    杨大人一律就退还会再要求,以关注货币,来补助这个差额,天宝年间杨胜坚转任。
    Confidence: 0.9091
    Emotion: Anger | Gender: Male | Age: Middle Age

    做钱的速度还快,这真的是,一个经济爆发式增长的时代。
    Confidence: 0.9565
    Emotion: Surprise | Gender: Male | Age: Youth

    ASR Leaderboard

    Leaderboard shows ASR Results (CER%↓) on Sichuanese Datasets.

    Note: Bold indicates best performance, underlined indicates second-best performance, and light green background indicates models finetuned on a high-quality internal corpus (to show the system's potential as a foundation model).

    Model Model Size WSC-Eval-ASR Magicdata Avg.
    Easy Hard Total Conversation Daily-Use
    with LLM
    Kimi-Audio7B16.6528.6617.6624.675.7718.68
    FireRedASR-LLM8.3B12.8025.2714.4017.686.6915.37
    Qwen2.5-omni3B16.9426.0118.2020.406.3217.69
    Qwen2.5-omni-WSC-Finetune⭐3B14.3624.1415.6118.456.1515.74
    Qwen2.5-omni+internal data⭐3B13.1723.3614.8118.505.8815.14
    Qwen2.5-omni-WSC-Finetune + internal data⭐3B12.9323.1914.2517.955.8914.84
    without LLM
    SenseVoice-small234M17.4328.3818.3923.508.7719.29
    Whisper244M52.0663.9953.5955.8852.0355.51
    FireRedASR-AED1.1B13.2923.6414.6217.846.6915.14
    Paraformer220M14.3424.6115.6619.818.1616.52
    Paraformer-WSC-Finetune⭐220M12.1522.6013.5116.608.0214.58
    Paraformer + internal data⭐220M11.9321.8213.1415.616.7713.85
    Paraformer-WSC-Finetune + internal data⭐220M11.5921.5912.8714.596.2813.38

    TTS Evaluation

    Model WSC-Eval-TTS-easy WSC-Eval-TTS-hard
    CER(%)↓ SIM(%)↑ IMOS↑ SMOS↑ AMOS↑ CER(%)↓ SIM(%)↑ IMOS↑ SMOS↑ AMOS↑
    Step-Audio-TTS[21] 10.8367.663.812.863.15 12.5254.523.752.773.06
    CosyVoice 2.0[22] 7.1470.273.883.103.69 9.0660.103.962.733.81
    Qwen-TTS 4.13-3.95-3.90 7.35-4.02-3.88
    CosyVoice2-WSC⭐ 4.2872.784.133.944.05 8.7862.593.852.783.92
    CosyVoice2-WSC-SFT⭐ 4.0878.844.104.164.20 7.2267.964.013.033.98
    Commercial system with a single fixed speaker; speaker similarity is not evaluated.

    TTS Demo

    Model Comparison

    CosyVoice2-WSC is a finetuned CosyVoice2 model using WenetSpeech-Chuan.

    Llasa-1B-WSC is a finetuned Llasa-1B model using WenetSpeech-Chuan.

    Text
    Loading...

    CosyVoice2-WSC-SFT

    Further supervised fine-tune CosyVoice2-WSC with 100 hours of internal high-quality data from two fixed speakers.

    Text Reference Audio Synthetic Audio
    我跟你说哦,这几天重庆的天气真的是又热又湿,简直没法忍受,出门走几步就全身湿透了。
    我们就当做好事一样,绝对不要让这样子的事情发生。领导答应之后哈,我就联系了他的妈老汉儿,把事情都跟他们的讲清楚了,然后把银行卡号拿到之后,按流程办理了退款。
    晚上去吃烧烤,朋友非要点个变态辣的鸡翅,说是过瘾,结果吃了一口眼泪都出来了,嘴巴像被火烧一样,服务员都吓得给我们送牛奶来。
    你说话能不能不要那么冲,我跟你讲不是每个人都惯着你耍性子的,我脾气也有点儿爆!