WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation

Longhao Li¹*, Zhao Guo¹*, Hongjie Chen², Yuhang Dai¹, Ziyu Zhang¹, Hongfei Xue¹, Tianlun Zuo¹, Chengyou Wang¹, Shuiyuan Wang¹, Xin Xu³, Hui Bu³, Jie Li², Jian Kang², Binbin Zhang⁴, Ruibin Yuan⁵, Ziya Zhou⁵, Wei Xue⁵, Lei Xie¹

¹ Audio, Speech and Language Processing Group (ASLP@NPU) , Northwestern Polytechnical University
² Institute of Artificial Intelligence (TeleAI), China Telecom
³ Beijing AISHELL Technology Co., Ltd.
⁴ WeNet Open Source Community
⁵ Hong Kong University of Science and Technology

📑 Paper | 🐙 GitHub | 🤗 HuggingFace
🖥️ HuggingFace Space | 🎤 Demo Page | 💬 Contact Us

Abstract

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline. The dataset, benchmark, and the ASR and TTS models built upon WenetSpeech-Yue will be open-sourced. Demos can be found in the supplementary material.

Promotional video

Cantonese

English

WenetSpeech-Pipe

WenetSpeech-Pipe is an automated pipeline specifically designed for building large-scale Cantonese datasets with multi-dimensional annotation. It consists of six components: (A) Audio Collection, (B) Speaker Attributes Annotation, (C) Speech Quality Annotation, (D) Automatic Speech Recognition, (E) Text Post-Processing, and (F) Recognizer Output Voting. The figure below provides an overview of the WenetSpeech-Pipe.

WenetSpeech-Yue

Dataset Overview

Contains 21,800 hours of large-scale Cantonese speech corpus with rich annotations, the largest open-source resource for Cantonese speech research.
Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.
Covers ten domains: Storytelling, Entertainment, Drama, Culture, Vlog, Commentary, Education, Podcast, News, and Others.

Data Samples

Domain	Sample 1	Sample 2
Storytelling	两只小企鹅都有嘢食 Confidence: 0.900 Speaker: gd0006277_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 2.76 \| SNR: 13.12 dB	刘备仲马鞭一指蜀兵一齐掩杀过去打到吴兵大败唉刘备八路兵马以雷霆万钧之势啊杀到吴兵啊尸横遍野血流成河 Confidence: 0.953 Speaker: gd0046360_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.87 \| SNR: 48.8 dB
Entertainment	叫做诶诶直入式你个脑部里边咧记得呢一个嘅以前香港有一个广告好出名嘅佢乜嘢都冇噶净系影住喺弥敦道佢哋间铺头嘅啫但系就不停有人嗌啦平平吧平吧 Confidence: 0.807 Speaker: multispk \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.88 \| SNR: 22.9 dB	原来王力宏咧系佢家中里面咧成就最低个吓哇 Confidence: 0.850 Speaker: multispk \| Gender: Male \| Age: Old Sampling rate: 16kHz \| DNSMOS: 3.38 \| SNR: 19.8 dB
Drama	忽然从光线死角嘅阴影度窜出一只大猫 Confidence: 0.912 Speaker: gd0040831_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.85 \| SNR: 72.7 dB	无论你提出任何嘅要求 Confidence: 0.950 Speaker: gd0039300_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.83 \| SNR: 65.6 dB
Vlog	今日我带大家去见识一位九零后嘅靓仔咧 Confidence: 0.944 Speaker: gd0015582_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.52 \| SNR: 6.18 dB	咁咁多样材料咁我哋首先第一步处理咗一件 Confidence: 0.868 Speaker: gd0008289_SPEAKER_00 \| Gender: Male \| Age: Old Sampling rate: 16kHz \| DNSMOS: 2.95 \| SNR: 25.4 dB
Commentary	香港嘅消费市场从此不一样 Confidence: 1.000 Speaker: xg0011541_SPEAKER_02 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.86 \| SNR: 15.1 dB	啲点样对于佢哋嘅服务态度啊不透过呢一年左右嘅时间啦其实大家都静一静啦咁你就会见到香港嘅经济其实 Confidence: 0.938 Speaker: xg0011541_SPEAKER_02 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.76 \| SNR: 15.3 dB
Podcast	景天谂唔到呢个守门嘅弟子竟然咁无礼霎时间面色都变埋 Confidence: 0.98 Speaker: gd0039538_SPEAKER_02 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.74 \| SNR: 70.6 dB	就即刻会同贵正两位八代长老带埋五名七代弟子前啲灵蛇岛想话生擒谢信抢咗屠龙宝刀翻嚟献俾帮主嘅 Confidence: 0.856 Speaker: gd0048640_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.56 \| SNR: 49.9 dB
Education	六个星期嘅课程包括六堂课同两个测验你唔掌握到基本嘅十九个声母五十六个韵母同九个声调我哋仲针对咗广东话学习者会遇到嘅大樽颈啊以国语为母语人士最难掌握嘅五大韵母教课书唔会教你嘅七种变音同十种变调说话生硬唔自然嘅根本性问题提供全新嘅学习方向等你突破难关 Confidence: 0.987 Speaker: xg0054024_SPEAKER_01 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 3.41 \| SNR: 17.9 dB	我知道我的观众大部分都是对广东话有兴趣想学广东话的人 Confidence: 0.962 Speaker: xg0054024_SPEAKER_01 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.99 \| SNR: 24.4 dB
Culture	同意嘅累积唔系阴同阳嘅累积可以讲三既融合咗一同意融合咗阴同阳 Confidence: 0.900 Speaker: 204826042729_SPEAKER_05 \| Gender: Female \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.71 \| SNR: 69.9 dB	诶原来啊我哋中国人呢讲究物极必反 Confidence: 0.833 Speaker: 00960807120_SPEAKER_08 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 2.90 \| SNR: 22.1 dB
News	而较早前已经复航嘅氹仔北安码头星期五开始增设夜间航班不过两个码头暂时都冇凌晨班次有旅客希望尽快恢复可以留喺澳门长啲时间 Confidence: 0.994 Speaker: xg0055639_SPEAKER_01 \| Gender: Female \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.11 \| SNR: 25.9 dB	如果东边道建成咁丹东呢就会成为最近嘅出海港同埋经过哈大线出海相比绥分河则会减少运渠三百五十六公里 Confidence: 0.924 Speaker: 230636120099_SPEAKER_09 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 4.12 \| SNR: 70.5 dB

ASR Leaderboard

Model	#Params (M)	In-House		Open-Source					WSYue-eval
Model	#Params (M)	Dialogue	Reading	yue	HK	MDCC	Daily_Use	Commands	Short	Long
w/o LLM
Conformer-Yue⭐	130	16.57	7.82	7.72	11.42	5.73	5.73	8.97	5.05	8.89
Paraformer	220	83.22	51.97	70.16	68.49	47.67	79.31	69.32	73.64	89.00
SenseVoice-small	234	21.08	6.52	8.05	7.34	6.34	5.74	6.65	6.69	9.95
SenseVoice-s-Yue⭐	234	19.19	6.71	6.87	8.68	5.43	5.24	6.93	5.23	8.63
Dolphin-small	372	59.20	7.38	39.69	51.29	26.39	7.21	9.68	32.32	58.20
TeleASR	700	37.18	7.27	7.02	7.88	6.25	8.02	5.98	6.23	11.33
Whisper-medium	769	75.50	68.69	59.44	62.50	62.31	64.41	80.41	80.82	50.96
Whisper-m-Yue⭐	769	18.69	6.86	6.86	11.03	5.49	4.70	8.51	5.05	8.05
FireRedASR-AED-L	1100	73.70	18.72	43.93	43.33	34.53	48.05	49.99	55.37	50.26
Whisper-large-v3	1550	45.09	15.46	12.85	16.36	14.63	17.84	20.70	12.95	26.86
w/ LLM
Qwen2.5-Omni-3B	3000	72.01	7.49	12.59	11.75	38.91	10.59	25.78	67.95	88.46
Kimi-Audio	7000	68.65	24.34	40.90	38.72	30.72	44.29	45.54	50.86	33.49
FireRedASR-LLM-L	8300	73.70	18.72	43.93	43.33	34.53	48.05	49.99	49.87	45.92
Conformer-LLM-Yue⭐	4200	17.22	6.21	6.23	9.52	4.35	4.57	6.98	4.73	7.91

TTS Evaluation

The table below presents both objective and subjective evaluation results of different TTS systems on the WSYue-TTS-eval benchmark. Objective metrics include Mixed Error Rate (MER) and speaker similarity (SIM) on both the Base and Coverage test sets. Subjective metrics include UTMOSv2, Intelligibility MOS (I-MOS), Speaker Similarity MOS (S-MOS), and Audio Naturalness MOS (A-MOS).
Llasa-1B-Yue is our model trained on large-scale Cantonese data and achieves the best performance on most metrics.

Objective and subjective evaluation results of different TTS systems on the WSYue-TTS-eval benchmark.
Model	Base		Coverage		UTMOSv2	I-MOS	S-MOS	A-MOS
Model	MER (%)	SIM	MER (%)	SIM	UTMOSv2	I-MOS	S-MOS	A-MOS
Llasa-1B	53.31	0.732	43.68	0.754	2.360	2.60 ± 1.01	3.05 ± 0.87	2.32 ± 0.98
Step-Audio-TTS-3B	27.79	0.762	24.25	0.781	2.496	3.22 ± 0.70	3.14 ± 0.58	2.82 ± 0.69
CosyVoice2	14.38	0.812	13.74	0.826	2.989	3.72 ± 0.50	3.52 ± 0.36	3.22 ± 0.60
Edge-TTS^†	8.30	-	9.27	-	2.997	4.12 ± 0.28	-	3.48 ± 0.56
Llasa-1B-Yue	10.89	0.762	12.78	0.772	2.696	4.30 ± 0.23	4.11 ± 0.37	4.34 ± 0.34
CosyVoice2-Yue	10.33	0.821	9.49	0.834	3.021	4.45 ± 0.16	3.78 ± 0.53	4.21 ± 0.27
Llasa-1B-Yue-Updated^*	12.25	0.502	8.18	0.537	2.889	-	-	-

^† Commercial system with a single fixed speaker; speaker similarity is not evaluated.

^* Llasa-1B-Yue-Updated is newly added, so subjective metrics (I-MOS, S-MOS, A-MOS) were not evaluated in this version.

TTS Demo

Text
Loading...

The presented audio samples are outputs of the evaluated models on the WSYue-TTS-eval test set.

Model	Reference Audio	Text	Synthetic Audio
CosyVoice2-Yue-Databaker		爲咗配合市區道路維修工程，運輸署宣佈，由下星期一凌晨零時起，彌敦道部分路段會分階段封閉，預計工程會持續至下月十五號，期間請駕駛人士留意交通改道安排，並遵從現場工作人員指示，以免影響行車安全。
CosyVoice2-Yue-ZoengJyutGaai		严嵩，明朝历史上唯一一个廿年牢牢掌控住内阁嘅首辅。明史认定佢系奸臣之首，佢喺朝廷党羽遍布，权势滔天，但系就算最后家产畀人抄晒、个仔畀人斩咗头，佢都能够独身保命，一直安然无恙，直到寿终正寝。

Model Descriptions:
1. CosyVoice2-Yue-Databaker: Fine-tuned Cosyvoice2-Yue with studio-quality anchor data provided by DataBaker.
2. CosyVoice2-Yue-ZoengJyutGaai: Fine-tuned on The Zoeng Jyut Gaai Story-telling Speech Dataset .

Domain	Sample 1	Sample 2
Storytelling	两只小企鹅都有嘢食 Confidence: 0.900 Speaker: gd0006277_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 2.76 \| SNR: 13.12 dB	刘备仲马鞭一指蜀兵一齐掩杀过去打到吴兵大败唉刘备八路兵马以雷霆万钧之势啊杀到吴兵啊尸横遍野血流成河 Confidence: 0.953 Speaker: gd0046360_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.87 \| SNR: 48.8 dB
Entertainment	叫做诶诶直入式你个脑部里边咧记得呢一个嘅以前香港有一个广告好出名嘅佢乜嘢都冇噶净系影住喺弥敦道佢哋间铺头嘅啫但系就不停有人嗌啦平平吧平吧 Confidence: 0.807 Speaker: multispk \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.88 \| SNR: 22.9 dB	原来王力宏咧系佢家中里面咧成就最低个吓哇 Confidence: 0.850 Speaker: multispk \| Gender: Male \| Age: Old Sampling rate: 16kHz \| DNSMOS: 3.38 \| SNR: 19.8 dB
Drama	忽然从光线死角嘅阴影度窜出一只大猫 Confidence: 0.912 Speaker: gd0040831_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.85 \| SNR: 72.7 dB	无论你提出任何嘅要求 Confidence: 0.950 Speaker: gd0039300_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.83 \| SNR: 65.6 dB
Vlog	今日我带大家去见识一位九零后嘅靓仔咧 Confidence: 0.944 Speaker: gd0015582_SPEAKER_01 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.52 \| SNR: 6.18 dB	咁咁多样材料咁我哋首先第一步处理咗一件 Confidence: 0.868 Speaker: gd0008289_SPEAKER_00 \| Gender: Male \| Age: Old Sampling rate: 16kHz \| DNSMOS: 2.95 \| SNR: 25.4 dB
Commentary	香港嘅消费市场从此不一样 Confidence: 1.000 Speaker: xg0011541_SPEAKER_02 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.86 \| SNR: 15.1 dB	啲点样对于佢哋嘅服务态度啊不透过呢一年左右嘅时间啦其实大家都静一静啦咁你就会见到香港嘅经济其实 Confidence: 0.938 Speaker: xg0011541_SPEAKER_02 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.76 \| SNR: 15.3 dB
Podcast	景天谂唔到呢个守门嘅弟子竟然咁无礼霎时间面色都变埋 Confidence: 0.98 Speaker: gd0039538_SPEAKER_02 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.74 \| SNR: 70.6 dB	就即刻会同贵正两位八代长老带埋五名七代弟子前啲灵蛇岛想话生擒谢信抢咗屠龙宝刀翻嚟献俾帮主嘅 Confidence: 0.856 Speaker: gd0048640_SPEAKER_00 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.56 \| SNR: 49.9 dB
Education	六个星期嘅课程包括六堂课同两个测验你唔掌握到基本嘅十九个声母五十六个韵母同九个声调我哋仲针对咗广东话学习者会遇到嘅大樽颈啊以国语为母语人士最难掌握嘅五大韵母教课书唔会教你嘅七种变音同十种变调说话生硬唔自然嘅根本性问题提供全新嘅学习方向等你突破难关 Confidence: 0.987 Speaker: xg0054024_SPEAKER_01 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 3.41 \| SNR: 17.9 dB	我知道我的观众大部分都是对广东话有兴趣想学广东话的人 Confidence: 0.962 Speaker: xg0054024_SPEAKER_01 \| Gender: Male \| Age: Youth Sampling rate: 16kHz \| DNSMOS: 2.99 \| SNR: 24.4 dB
Culture	同意嘅累积唔系阴同阳嘅累积可以讲三既融合咗一同意融合咗阴同阳 Confidence: 0.900 Speaker: 204826042729_SPEAKER_05 \| Gender: Female \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.71 \| SNR: 69.9 dB	诶原来啊我哋中国人呢讲究物极必反 Confidence: 0.833 Speaker: 00960807120_SPEAKER_08 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 2.90 \| SNR: 22.1 dB
News	而较早前已经复航嘅氹仔北安码头星期五开始增设夜间航班不过两个码头暂时都冇凌晨班次有旅客希望尽快恢复可以留喺澳门长啲时间 Confidence: 0.994 Speaker: xg0055639_SPEAKER_01 \| Gender: Female \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 3.11 \| SNR: 25.9 dB	如果东边道建成咁丹东呢就会成为最近嘅出海港同埋经过哈大线出海相比绥分河则会减少运渠三百五十六公里 Confidence: 0.924 Speaker: 230636120099_SPEAKER_09 \| Gender: Male \| Age: Middle Age Sampling rate: 16kHz \| DNSMOS: 4.12 \| SNR: 70.5 dB