Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

Guojian Li1, Chengyou Wang1, Hongfei Xue1, Shuiyuan Wang1, Dehui Gao1, Zihan Zhang2, Yuke Lin2, Wenjie Li2, Longshuai Xiao2, Zhonghua Fu1,╀, Lei Xie1,╀

1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2 Huawei Technologies, China

🎤 GitHub 🤖 Easy Turn Model 📑 Paper 🌐 Huggingface

Abstract

Full-duplex interaction is crucial for natural human–machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn—an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.

Demo: Interacting with OSUM-Echat with Easy-Turn runing

Easy Turn

To accelerate research in full-duplex spoken dialogue field, we introduce Easy Turn, an open-source and modular turn-taking detection model. The model accepts user's speech as input and outputs both the corresponding ASR transcription and the dialogue turn state, effectively integrating acoustic and linguistic information. The Easy Turn comprises three key components: an audio encoder, an audio adaptor, and an LLM. Its design draws inspiration from Qwen-Audio, employing Whisper as the audio encoder and Qwen2.5 as the LLM. To better integrate acoustic and linguistic modalities, we adopt an ASR+Turn-Detection paradigm, where the LLM generates ASR transcriptions and fuses them with acoustic features to sequentially predict dialogue turn state labels (complete, incomplete, backchannel or wait).

model

Easy Turn Trainset

The Easy Turn trainset is a large-scale speech dataset designed for training turn-taking detection models, comprising both real and synthetic data. It comprises four dialogue turn states (complete, incomplete, backchannel, wait), totaling approximately 1,145 hours. Detailed statistical data are shown in the table below. The detailed data processing pipeline is shown in the following figure. For more details, please refer to the original paper.

pipeline
trainset

Data Samples

State Real Synthetic
Complete
你有没有发生过一些童年趣事啊<complete>

我觉得画画太难了<complete>

请描述古埃及金字塔的设计与用途<complete>

初中学生的语文学习有什么特点<complete>
Incomplete
因为小时候<incomplete>

我那会儿就会在嗯<incomplete>

我本来想<incomplete>

其实主要想问就是<incomplete>
Backchannel
嗯对对对<backchannel>

啊是啊<backchannel>

哦对<backchannel>

挺好的<backchannel>
Wait
太吵了马上停下<wait>

立即静音<wait>

我有急事要处理先停下<wait>

真受不了你别说了<wait>

EXPERIMENTS

Main Results

We evaluate Easy Turn against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the Easy Turn testset. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results: ACC_cp, ACC_incp, ACC_bc and ACC_wait denote the turn-taking detection accuracy for complete, incomplete, backchannel, and wait states (higher is better). Params, Latency, and Memory represent total model size, average inference time, and GPU usage, where lower values indicate greater efficiency. The symbol “–” indicates that the corresponding model does not support detection of a particular state.

Model Params(MB) Latency(ms) Memory(MB) ACCcp(%) ACCincp(%) ACCbc(%) ACCwait(%)
Paraformer + TEN Turn Detection 7220 204 15419 86.67 89.3 - 91
Smart Turn V2 95 27 370 78.67 62 - -
Easy Turn (Proposed) 850 263 2559 96.33 97.67 91 98

Ablation Study

We conduct ablation experiments on the Easy Turn to evaluate the contribution of individual modalities within its architecture and to assess the impact of the ASR + Turn-Detection paradigm on performance. The primary metric is ACC_avg, computed as the average detection accuracy across four dialogue turn states (complete, incomplete, backchannel, and wait). Easy Turn only-state uses the same architecture as Easy Turn but omits the ASR + Turn-Detection paradigm, directly predicting dialogue turn state labels without first generating ASR transcriptions. Finetuned Whisper + Linear refers to fine-tuning Whisper-Medium audio encoder with an additional linear classifier on our Easy Turn trainset, taking only speech as input and directly predicting dialogue turn state labels, representing the acoustic-only modality. Finetuned Qwen2.5-0.5B-Instruct is fine-tuned on text transcriptions from our Easy Turn trainset, taking only text as input and outputting dialogue turn state labels, representing the linguistic-only modality. Detailed results are shown in the table below.

Model Modality ACC_avg ↑
Easy Turn (Proposed) Acoustic + Linguistic 95.75
Easy Turnonly-state Acoustic + Linguistic 87.88
Finetuned Qwen2.5-0.5B-Instruct Linguistic-only 86.25
Finetuned Whisper + Linear Acoustic-only 85.50