ASLP @ Northwestern Polytechnical University

Speaker-Reasoner

Scaling Interaction Turns and Reasoning Patterns
for Timestamped Speaker-Attributed ASR

Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†

1 ASLP@NPU, Northwestern Polytechnical University  |  2 Nanjing University  |  3 Shanghai Lingguang Zhaxian Technology

↓ scroll to explore

Agentic Multi-Turn Temporal Reasoning

Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis — jointly modeling speaker identity, gender, timestamps, and transcription.

🔄
Agentic Multi-Turn Reasoning
Iterative global-to-local inference: global speaker summary → boundary prediction → fine-grained segment decoding.
🧠
Speaker-Aware Context Cache
Extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks.
📈
Three-Stage Progressive Training
Multi-task foundation → temporal interaction learning → cache-conditioned decoding for robust performance.
🏆
State-of-the-Art Performance
Outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4.

Comprehensive Multi-Domain Evaluation

Speaker-Reasoner-4194h demonstrates superior performance across diverse scenarios — video domains, meeting datasets, and cross-lingual benchmarks. All metrics are lower-is-better (↓). Toggle between table and chart views.

Video Domain Evaluation
Model Video-Internal-EvalZH/EN Video-Internal-Eval-zhZH Video-Internal-Eval-enEN
WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓
Gemini-2.5-Pro 22.4744.1374.0521.66 18.2840.9769.3522.69 55.4068.82100.9513.42
VibeVoice-ASR[1] 16.4558.6047.1842.15 17.7062.0647.6544.36 7.1132.6544.6225.54
Speaker-Reasoner (Ours, 4194h) 6.2724.4315.3318.16 6.5025.8116.6819.31 4.4216.317.5811.89
Meeting Benchmark Evaluation
Model AISHELL4-Eval[2]ZH Alimeeting-Far[3]ZH AMI-SDM[5]EN
WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓
Gemini-2.5-Pro 19.8125.1136.075.30 30.1639.2956.399.13 31.6639.9850.288.32
VibeVoice-ASR[1] 22.1926.168.943.97 34.3139.9219.625.61 30.5335.8621.005.33
Speaker-Reasoner (Ours, 4194h) 7.138.143.381.01 19.7219.926.700.20 23.2925.1613.561.87
MLC-SLM Challenge Evaluation
Model MLC-SLM-Track1[4]ENMulti-Accent MLC-SLM-Track2[4]ENMulti-Accent
WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓
Gemini-2.5-Pro 36.8741.8842.335.01 26.7332.1946.195.46
VibeVoice-ASR[1] 10.3013.456.273.15 7.9711.383.143.41
Speaker-Reasoner (Ours, 4194h) 9.1711.744.762.57 8.5411.764.353.22

Live Transcription Examples

Real-world audio processed by Speaker-Reasoner-4194h, with timestamped speaker-attributed transcription. Press play and watch the transcript scroll in sync. Overlapping speech from multiple speakers highlights simultaneously.

1
《三国演义》
Models rapid turn-taking and frequent backchannels in multi-talker scenarios.
中文 ZH 3 Speakers
Transcript
2
《乌海》
Resolves overlapping speech and dynamic acoustic variations in argumentative dialogues.
中文 ZH 2 Speakers
Transcript
3
《征服》
Transcribes regional accents and informal slang in noisy environments.
中文 ZH 2 Speakers
Transcript
4
《Found A Homeless Billionaire Husband》
Processes high-dynamic emotional speech under noisy acoustic conditions.
英文 EN 3 Speakers
Transcript

References

[1]
Z. Peng, J. Yu, Y. Chang et al., “VIBEVOICE-ASR Technical Report,” CoRR, vol. abs/2601.18184, 2026. [Link]
[2]
Y. Fu, L. Cheng, S. Lv et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario,” in Proc. Interspeech, pp. 3665–3669, 2021. [Link]
[3]
Y. Liang, M. Shi, F. Yu et al., “The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR,” in Proc. ASRU, pp. 1–8, IEEE, 2023. [Link]
[4]
B. Mu, P. Guo, Z. Sun et al., “Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods,” in Proc. ICASSP, pp. 19442–19446, IEEE, 2026. [Link]
[5]
J. Carletta, S. Ashby, S. Bourban et al., “The AMI Meeting Corpus: A Pre-announcement,” in Machine Learning for Multimodal Interaction (MLMI), LNCS, pp. 28–39, Springer, 2005. [Link]