ASLP @ Northwestern Polytechnical University

Speaker-Reasoner

Scaling Interaction Turns and Reasoning Patterns
for Timestamped Speaker-Attributed ASR

Zhennan Lin¹, Shuai Wang², Zhaokai Sun¹, Pengyuan Xie³, Chuan Xie³, Jie Liu³, Qiang Zhang³, Lei Xie^1†

¹ ASLP@NPU, Northwestern Polytechnical University | ² Nanjing University | ³ Shanghai Lingguang Zhaxian Technology

arXiv Paper 🤗 HuggingFace GitHub 🏫 ASLP Lab

↓ scroll to explore

Overview

Agentic Multi-Turn Temporal Reasoning

Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis — jointly modeling speaker identity, gender, timestamps, and transcription.

🔄
Agentic Multi-Turn Reasoning
Iterative global-to-local inference: global speaker summary → boundary prediction → fine-grained segment decoding.

🧠
Speaker-Aware Context Cache
Extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks.

📈
Three-Stage Progressive Training
Multi-task foundation → temporal interaction learning → cache-conditioned decoding for robust performance.

🏆
State-of-the-Art Performance
Outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4.

Results

Comprehensive Multi-Domain Evaluation

Speaker-Reasoner-4194h demonstrates superior performance across diverse scenarios — video domains, meeting datasets, and cross-lingual benchmarks. All metrics are lower-is-better (↓). Toggle between table and chart views.

Video Domain Evaluation

Model	Video-Internal-EvalZH/EN				Video-Internal-Eval-zhZH				Video-Internal-Eval-enEN
Model	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓
Gemini-2.5-Pro	22.47	44.13	74.05	21.66	18.28	40.97	69.35	22.69	55.40	68.82	100.95	13.42
VibeVoice-ASR^[1]	16.45	58.60	47.18	42.15	17.70	62.06	47.65	44.36	7.11	32.65	44.62	25.54
Speaker-Reasoner (Ours, 4194h)	6.27	24.43	15.33	18.16	6.50	25.81	16.68	19.31	4.42	16.31	7.58	11.89

Meeting Benchmark Evaluation

Model	AISHELL4-Eval^[2]ZH				Alimeeting-Far^[3]ZH				AMI-SDM^[5]EN
Model	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓
Gemini-2.5-Pro	19.81	25.11	36.07	5.30	30.16	39.29	56.39	9.13	31.66	39.98	50.28	8.32
VibeVoice-ASR^[1]	22.19	26.16	8.94	3.97	34.31	39.92	19.62	5.61	30.53	35.86	21.00	5.33
Speaker-Reasoner (Ours, 4194h)	7.13	8.14	3.38	1.01	19.72	19.92	6.70	0.20	23.29	25.16	13.56	1.87

MLC-SLM Challenge Evaluation

Model	MLC-SLM-Track1^[4]ENMulti-Accent				MLC-SLM-Track2^[4]ENMulti-Accent
Model	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓
Gemini-2.5-Pro	36.87	41.88	42.33	5.01	26.73	32.19	46.19	5.46
VibeVoice-ASR^[1]	10.30	13.45	6.27	3.15	7.97	11.38	3.14	3.41
Speaker-Reasoner (Ours, 4194h)	9.17	11.74	4.76	2.57	8.54	11.76	4.35	3.22

Demo

Live Transcription Examples

Real-world audio processed by Speaker-Reasoner-4194h, with timestamped speaker-attributed transcription. Press play and watch the transcript scroll in sync. Overlapping speech from multiple speakers highlights simultaneously.

《三国演义》

Models rapid turn-taking and frequent backchannels in multi-talker scenarios.

中文 ZH 3 Speakers

Transcript

《乌海》

Resolves overlapping speech and dynamic acoustic variations in argumentative dialogues.

中文 ZH 2 Speakers

Transcript

《征服》

Transcribes regional accents and informal slang in noisy environments.

中文 ZH 2 Speakers

Transcript

《Found A Homeless Billionaire Husband》

Processes high-dynamic emotional speech under noisy acoustic conditions.

英文 EN 3 Speakers

Transcript

Cite

References

[1]

Z. Peng, J. Yu, Y. Chang et al., “VIBEVOICE-ASR Technical Report,” CoRR, vol. abs/2601.18184, 2026. [Link]

[2]

Y. Fu, L. Cheng, S. Lv et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario,” in Proc. Interspeech, pp. 3665–3669, 2021. [Link]

[3]

Y. Liang, M. Shi, F. Yu et al., “The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR,” in Proc. ASRU, pp. 1–8, IEEE, 2023. [Link]

[4]

B. Mu, P. Guo, Z. Sun et al., “Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods,” in Proc. ICASSP, pp. 19442–19446, IEEE, 2026. [Link]

[5]

J. Carletta, S. Ashby, S. Bourban et al., “The AMI Meeting Corpus: A Pre-announcement,” in Machine Learning for Multimodal Interaction (MLMI), LNCS, pp. 28–39, Springer, 2005. [Link]