SLT2026

SmartGlasses Challenge: Egocentric Speech Interaction on AI Glasses

Benchmarking Egocentric Speech Interaction for Next-Generation AI Glasses in Real-World Environments

Rush In

Challenge Call

Challenge and Scenario Description

Driven by the rapid advancement of Large Language Models (LLMs) and Multimodal LLMs, AI-powered smart glasses are emerging as a next-generation platform for human-computer interaction. Equipped with microphone arrays and cameras, smart glasses naturally capture the wearer’s egocentric (first-person) perspective, enabling hands-free multimodal communication throughout daily life.

However, deploying robust speech-centric interaction systems on smart glasses introduces distinct challenges compared with traditional stationary devices such as smart speakers or handheld devices such as smartphones. Smart glasses operate in highly dynamic acoustic environments, including environmental noise, user-generated motion noise, and speech from surrounding people.

To address these challenges, the SmartGlasses Challenge introduces a new benchmark for evaluating Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU) in real-world egocentric interaction scenarios, including human–machine dialogue, dyadic conversation, and multi-party meetings.

Latest News

Stay updated with the latest announcements from the organizers

Here are the latest updates and news from the ESmartGlasses Challenge organizers:

  1. [Date TBD] Challenge registration opens.
  2. [Date TBD] Dataset release announcement.
  3. [Date TBD] Workshop details announced.

Track Overview

Details about the Tracks to be completed

Track 1: Human–Machine Command Dialogue

Track 1 Illustration

Scenario: Daily human–AI glasses dialogue recorded on-device, covering both single-turn commands and multi-turn conversational interactions grounded in audio and visual context.

ASR Task: Evaluate speech recognition robustness under diverse real-world environments including noisy streets, cycling, and whisper speech. The evaluation metric is Word Error Rate (WER).

SLU Task: Evaluate the system’s ability to interpret the user’s intent from spoken commands. In multi-turn interactions, the system must leverage dialogue history to resolve contextual dependencies.

Track 2: Dyadic Dialogue Understanding

Track 2 Illustration

Scenario: Face-to-face two-person conversations in everyday settings, involving overlapping speech, background interference, topic shifts, and complex semantic structures.

TSA-ASR Task: Evaluate speaker-attributed transcription with time alignment in overlapping speech scenarios. The metric is time-constrained minimum permutation WER (tcpWER).

SLU Task: Evaluate the system’s ability to capture factual details, track logical flow, and understand relationships between speakers within dyadic dialogues.

Track 3: Multi-Party Meeting Understanding

Track 3 Illustration

Scenario: Multi-speaker meetings with varying numbers of participants, frequent turn-taking, long conversational contexts, and domain-specific vocabulary.

TSA-ASR Task: Evaluate multi-speaker speech recognition with speaker diarization and temporal alignment in highly overlapped environments. The metric is tcpWER.

SLU Task: Evaluate the system’s ability to understand complex meeting discussions, extract key information, and summarize speaker-wise viewpoints from long-form conversations.

Data Description

Explore the datasets and evaluation metrics for the challenge

The SmartGlasses dataset comprises about 200 hours of 4-channel audio-visual recordings, entirely recorded in real-life scenarios using commercial-grade AI glasses. To ensure the comprehensiveness and fairness of the evaluation, the dataset covers diverse acoustic environments, varying speech volumes, rich user profiles, and introduces highly challenging multispeaker conversation scenarios.

Track Scene Data Modality Audio Channels Total Duration Tasks
Track 1 Human–Machine Command Dialogue Audio & Video 4 TBD ASR & SLU
Track 2 Dyadic Dialogue Understanding Audio 4 TBD TSA-ASR & SLU
Track 3 Multi-Party Meeting Understanding Audio 4 TBD TSA-ASR & SLU

Evaluation Metric

Understand how your results will be evaluated

ASR Evaluation: Track 1 adopts the standard Word Error Rate (WER). Tracks 2 and 3 involve multi-speaker scenarios and adopt Time-Stamped Speaker-Attributed ASR (TSA-ASR). The evaluation metric is time-constrained minimum permutation WER (tcpWER).

SLU Evaluation: All tracks evaluate semantic understanding using objective Question Answering (QA) accuracy based on multiple-choice questions constructed from the dialogue context.

Registration Guidelines

Follow the steps below to complete your registration for SmartGlasses

Step 1: Registration

Teams wishing to participate in the challenge should register via the provided Registration Form (TBD) . Please submit the following details for each participant:

  • Team name
  • Team member's name
  • Organization
  • Email address

Step 2: Dataset Access

After successful registration, teams will be provided with access to the 200-hour SmartGlasses dataset, including training and validation sets.

Contact Information

If you have any questions, please contact the organizers.

Submission Guidelines

  • Each team should submit their predictions for the evaluation tasks (Recognition & Understanding).
  • Track 1 requires WER for recognition and QA Accuracy for understanding.
  • Track 2 & 3 require tcpWER for recognition and QA Accuracy for understanding.

Challenge Timeline

The tentative timeline for running the challenge

The tentative timeline for running the challenge is as follows (TBD):

  1. [Date TBD] Challenge begins. Release of training and validation data.
  2. [Date TBD] Release of testing data.
  3. [Date TBD] Result submission deadline.
  4. [Date TBD] Release of challenge results and rankings.

Leaderboard

Rankings will be available once the challenge results are released.

Leaderboard Coming Soon...

License

The SmartGlasses dataset is available for academic research. The following conditions require your compliance:

  • References to the SmartGlasses dataset need to be included in any work using the dataset.
  • For the baseline research paper, please cite the paper listed on our website.
  • You may not use the SmartGlasses dataset or any derivative works for other purposes.

All rights not expressly granted to you are reserved by the organizers of this challenge.

Organizers

Lei Xie

Northwestern Polytechnical University, China

Longshuai Xiao

Huawei, China

Zhaohong Ni

Meta, USA

Xie Chen

Shanghai Jiao Tong University, China

Jun Du

USTC, China

Eng‑Siong Chng

Nanyang Technological University, Singapore

Jun Zhou

Rokid, China

Dehui Gao

Northwestern Polytechnical University, China

Zhaokai Sun

Northwestern Polytechnical University, China

Zhixian Zhao

Northwestern Polytechnical University, China

Runduo Han

Northwestern Polytechnical University, China

Yujie Liao

Northwestern Polytechnical University, China