Scaling Interaction Turns and Reasoning Patterns
for Timestamped Speaker-Attributed ASR
1 ASLP@NPU, Northwestern Polytechnical University | 2 Nanjing University | 3 Shanghai Lingguang Zhaxian Technology
Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis — jointly modeling speaker identity, gender, timestamps, and transcription.
Speaker-Reasoner-4194h demonstrates superior performance across diverse scenarios — video domains, meeting datasets, and cross-lingual benchmarks. All metrics are lower-is-better (↓). Toggle between table and chart views.
| Model | Video-Internal-EvalZH/EN | Video-Internal-Eval-zhZH | Video-Internal-Eval-enEN | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | |
| Gemini-2.5-Pro | 22.47 | 44.13 | 74.05 | 21.66 | 18.28 | 40.97 | 69.35 | 22.69 | 55.40 | 68.82 | 100.95 | 13.42 |
| VibeVoice-ASR[1] | 16.45 | 58.60 | 47.18 | 42.15 | 17.70 | 62.06 | 47.65 | 44.36 | 7.11 | 32.65 | 44.62 | 25.54 |
| Speaker-Reasoner (Ours, 4194h) | 6.27 | 24.43 | 15.33 | 18.16 | 6.50 | 25.81 | 16.68 | 19.31 | 4.42 | 16.31 | 7.58 | 11.89 |
| Model | AISHELL4-Eval[2]ZH | Alimeeting-Far[3]ZH | AMI-SDM[5]EN | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | |
| Gemini-2.5-Pro | 19.81 | 25.11 | 36.07 | 5.30 | 30.16 | 39.29 | 56.39 | 9.13 | 31.66 | 39.98 | 50.28 | 8.32 |
| VibeVoice-ASR[1] | 22.19 | 26.16 | 8.94 | 3.97 | 34.31 | 39.92 | 19.62 | 5.61 | 30.53 | 35.86 | 21.00 | 5.33 |
| Speaker-Reasoner (Ours, 4194h) | 7.13 | 8.14 | 3.38 | 1.01 | 19.72 | 19.92 | 6.70 | 0.20 | 23.29 | 25.16 | 13.56 | 1.87 |
| Model | MLC-SLM-Track1[4]ENMulti-Accent | MLC-SLM-Track2[4]ENMulti-Accent | ||||||
|---|---|---|---|---|---|---|---|---|
| WER↓ | cpWER↓ | DER↓ | ∆cp↓ | WER↓ | cpWER↓ | DER↓ | ∆cp↓ | |
| Gemini-2.5-Pro | 36.87 | 41.88 | 42.33 | 5.01 | 26.73 | 32.19 | 46.19 | 5.46 |
| VibeVoice-ASR[1] | 10.30 | 13.45 | 6.27 | 3.15 | 7.97 | 11.38 | 3.14 | 3.41 |
| Speaker-Reasoner (Ours, 4194h) | 9.17 | 11.74 | 4.76 | 2.57 | 8.54 | 11.76 | 4.35 | 3.22 |
Real-world audio processed by Speaker-Reasoner-4194h, with timestamped speaker-attributed transcription. Press play and watch the transcript scroll in sync. Overlapping speech from multiple speakers highlights simultaneously.