SLT2026

SmartGlasses Challenge: Egocentric Speech Interaction on AI Glasses

Benchmarking Egocentric Speech Interaction for Next-Generation AI Glasses in Real-World Environments

Rush In

Challenge Call

Driven by the rapid advancement of Large Language Models (LLMs) and Multimodal LLMs, AI-powered smart glasses are emerging as a next-generation platform for human-computer interaction. Equipped with microphone arrays and cameras, smart glasses naturally capture the wearer’s egocentric (first-person) perspective, enabling hands-free multimodal communication throughout daily life.

However, deploying robust speech-centric interaction systems on smart glasses introduces distinct challenges compared with traditional stationary devices such as smart speakers or handheld devices such as smartphones. Smart glasses operate in highly dynamic acoustic environments, including environmental noise, user-generated motion noise, and speech from surrounding people.

To address these challenges, the SmartGlasses Challenge introduces a new benchmark for evaluating Time-Stamped Speaker-Attributed ASR (TSA-ASR) and Spoken Language Understanding (SLU) in real-world egocentric interaction scenarios, including dyadic conversation, and multi-party meetings.

Latest News

Here are the latest updates and news from the SmartGlasses Challenge organizers:

  • 2026-05-15 (New!): The training data and validation data (Part 1) have now been released! All participating teams, please check the email address used for challenge registration to obtain the data download link. The rest of the training data and validation data will be released in the next few days.
  • 2026-05-07: The release of the training set and validation set is rescheduled to May 14th
  • 2026-04-15: Registration Opens

Track Overview

Track 1: Dyadic Dialogue Understanding

Track 2 Illustration

Scenario: Face-to-face two-person conversations in everyday settings, involving overlapping speech, background interference, topic shifts, and complex semantic structures.

TSA-ASR Task: Evaluate speaker-attributed transcription with time alignment in overlapping speech scenarios. The metric is time-constrained minimum permutation WER (tcpWER).

SLU Task: Evaluate the system’s ability to capture factual details, track logical flow, and understand relationships between speakers within dyadic dialogues.

Track 2: Multi-Party Meeting Understanding

Track 3 Illustration

Scenario: Multi-speaker meetings with varying numbers of participants, frequent turn-taking, long conversational contexts, and domain-specific vocabulary.

TSA-ASR Task: Evaluate multi-speaker speech recognition with speaker diarization and temporal alignment in highly overlapped environments. The metric is tcpWER.

SLU Task: Evaluate the system’s ability to understand complex meeting discussions, extract key information, and summarize speaker-wise viewpoints from long-form conversations.

Data Description

The organizing committee has provided independent download links for each track. After downloading and extracting the files, taking Track 1 as an example, the file structure of the dataset's root directory (SmartGlasses-Track1) is as follows (the structure for Track 2 is identical):

SmartGlasses-Track1/
├── Train/                 # Training set
│   └── Part1/
│       ├── audio/         # Audio files (.wav)
│       └── textgrid/      # Text and timestamp annotation files (.TextGrid)
├── Dev/                   # Development (Validation) set
│   └── Part1/
│       ├── audio/
│       ├── textgrid/
│       └── QA/            # QA annotation files (.json) - Provided only in the Dev set
└── data.jsonl             # Global data index and metadata file

Detailed folder descriptions

  • audio/: Contains the four-channel dialogue audio files (.wav format) for the respective track.
  • textgrid/: Contains the .TextGrid annotation files corresponding to the audio. These files include speaker-level timestamp boundaries and their corresponding text transcriptions.
  • QA/: (Provided only in the Dev set.) This folder contains the .json files used for the objective multiple-choice evaluation. Each JSON file corresponds to an audio segment and includes the question, options, and ground-truth answer required for the evaluation.

    Special note on QA data: The multiple-choice QA data in the Dev set is automatically generated by large language models (LLMs) and has not undergone strict human verification; therefore, it may contain minor flaws or noise. We are fully open-sourcing this data and its answers to serve as reference examples, helping participating teams build their pipelines and debug their models. For the final hidden test set, we will use a similar approach to construct complex speech understanding and reasoning multiple-choice questions. However, all test questions and ground-truth answers will undergo strict human review and refinement to ensure absolute fairness and scientific rigor in the final evaluation.

Global index file: data.jsonl

A data.jsonl file is provided in the root directory of each track. This file serves as the global index for the entire dataset, where each line represents the metadata for a single data sample.

Note: The additionally collected data will be released later as Part 2. Part 2 is merely a chronological update in the release schedule; its data format and folder structure will be exactly identical to the current Part 1.

Dataset statistics

Below is the statistical overview of the currently released Part 1 data (this table will be updated upon the release of Part 2):

Track 1: Dyadic Dialogue Understanding

Split Sessions Total Duration (hrs) Avg. Duration (sec)
Train (Part 1) 332 29.12 315.71
Dev (Part 1) 66 5.59 304.82
Total 398 34.71 313.90

Track 2: Multi-party Meeting Understanding

Split Sessions Total Duration (hrs) Avg. Duration (sec)
Train (Part 1) 105 34.13 1170.09
Dev (Part 1) 21 7.06 1210.47
Total 126 41.19 1176.82

Smart Glasses Microphone Array Layout

SmartGlasses Microphone Array

Channel-to-Microphone Mapping

The 4 channels of the audio files are mapped to the physical micro-electro-mechanical systems (MEMS) microphone array integrated onto the smart glasses frames as follows:

  • Channel 1 (mic1): Right temple, rear position
  • Channel 2 (mic2): Right temple, front position
  • Channel 3 (mic3): Left temple, front position
  • Channel 4 (mic4): Left temple, rear position

Physical Array Geometry

The spatial coordinates and geometric constraints of the acoustic centers of the four microphones are specified below:

Horizontal Projection Displacements:

  • Intra-temple separation (Right): The axial distance between mic1 and mic2 is 47 mm.
  • Intra-temple separation (Left): The axial distance between mic3 and mic4 is 50 mm.
  • Inter-temple span (Front): The cross-lateral distance between mic2 and mic3 is 145 mm.
  • Inter-temple span (Rear): The cross-lateral distance between mic1 and mic4 is 146 mm.

Vertical & Lateral Offsets:

  • Compared to the right-rear microphone (mic1), the right-front microphone (mic2) features a positive vertical elevation of 10 mm and an outward lateral offset of 1 mm along the frame thickness direction.
  • The acoustic centers of mic1, mic3, and mic4 reside on the same horizontal reference plane (zero vertical offset).
  • The left and right temples are orthogonal to the plane of the lenses.
  • The baseline connecting mic1 and mic4 is strictly parallel to the plane of the lenses.

Evaluation Metric

TSA-ASR Evaluation: Tracks 1 and 2 involve multi-speaker scenarios and adopt Time-Stamped Speaker-Attributed ASR (TSA-ASR). The evaluation metric is time-constrained minimum permutation WER (tcpWER).

SLU Evaluation: All tracks evaluate semantic understanding using objective Question Answering (QA) accuracy based on multiple-choice questions constructed from the dialogue context.

Registration Guidelines

Step 1: Registration

Please complete your registration by filling out a registration form.

After you submit the registration form, the organizing committee will send a confirmation email within 1 business day. Please check your inbox in time; if you do not receive it, please check your spam folder first, or contact us via the email addresses below.

Step 2: Dataset Access

After successful registration and agreeing to the challenge rules, participating teams will be granted access to the SmartGlasses dataset, and the download link will be sent via email.

Contact Information

If you have any questions, please contact the organizers:

WeChat Group QR Code
WeChat group QR code
If the QR code is expired, please email us to request the latest QR code.

The organizers of the challenge reserve the right to interpret, modify, and amend the participation terms and challenge rules.

Challenge Timeline

The tentative timeline for running the challenge is as follows:

  • 2026-04-15: Registration Opens
  • 2026-05-07: Release of Training Set, Validation Set
  • 2026-06-01: Registration Closes
  • 2026-06-05: Release of Test Set
  • 2026-06-19: Results Submission Deadline
  • 2026-07-03: System Description Submission Deadline
  • 2026-07-08: SLT Official Paper Submission Deadline
  • 2026-09-01: Paper Notification

Results

Results will be announced after submission evaluation procedure...

FAQ

1. Data

Q: Why does the current download only include Part 1? Will Part 2 have a different format?

A: We adopted a staged release strategy to allow participating teams to acquire data and set up baseline pipelines as early as possible. Part 1 contains the complete directory structure and sufficient data to run through the entire process. Part 2 will be released gradually within a week. Part 2 simply represents an expansion in data volume; its file format, four-channel audio properties, and directory structure will be exactly identical to Part 1. When released, you will only need to append the new data to the corresponding folders.

Q: I noticed that the multiple-choice QA in the Dev set occasionally contains logical flaws or noise. Will the Test set be the same?

A: The QA pairs in the Dev set were automatically generated by Large Language Models (LLMs) without strict human verification, and are provided merely as "reference examples" for teams to debug their pipelines. For the hidden Test set used for final leaderboard ranking, all questions and Ground Truth answers will undergo rigorous double human verification and refinement.

Q: The audio is four-channel. Do I have to use all the channels?

A: We provide complete four-channel audio to preserve the most authentic acoustic spatial information. Participating teams are free to decide whether to utilize the multi-channel information for beamforming / front-end signal processing, or simply extract a single channel for model training. This depends entirely on your algorithm design.

Q: If I only participate in a single track, can I use the data from the other track for training?

A: Yes, this is fully permitted. You are welcome to use the official datasets across tracks (e.g., using Track 2 data to assist in training a model for Track 1) to augment your training data and improve the model's generalizability.

2. Models & Rules

Q: How is the "One System" defined in the rules? Can I use different models to handle their corresponding tasks (e.g., using an if-else script for task routing)?

A: The core principle of this challenge is "One system handles two tasks". The single system submitted by your team may include necessary pre-processing and post-processing modules, and we encourage internal architectural innovations as long as they are part of an integrated system design. However, it is strictly prohibited to use manually hard-coded heuristic rules (e.g., simple if-else scripts) to forcibly route test samples with different characteristics to completely independent and heterogeneous core models for "pseudo-ensembling".

Q: Do I have to use the exact same model to participate in both Track 1 (Two-party) and Track 2 (Multi-party)?

A: This is not mandatory. You can design one specific model for Track 1 and a different model for Track 2. However, we encourage teams with sufficient resources to explore foundational Omni-modal/Audio-Language Models capable of handling both tracks simultaneously.

Q: Can we use external data and pre-trained foundation models?

A: Yes, this is allowed. You may use open-source pre-trained foundation models (e.g., Whisper, LLaMA, Qwen) and external, open-source datasets (e.g., LibriSpeech). However, the use of any private data is strictly prohibited. All external resources and data augmentation methods used must be explicitly disclosed in the final System Description Paper.

Data License

By downloading and using the SmartGlasses Challenge dataset, participating teams agree to and commit to strictly abiding by the following terms:

1.1 Usage Restrictions

  • The authorization of this dataset is strictly limited to non-commercial academic research.
  • The dataset may only be used for participating in the "IEEE SLT 2026 SmartGlasses Challenge" and subsequent related academic research after the challenge concludes. It is strictly prohibited to use this dataset or any of its derivative versions for any commercial purposes, product development, or profitable services.

1.2 Distribution & Confidentiality

  • This dataset is accessible only to officially registered teams. Participating teams must not publish, leak, transfer, or distribute the dataset (including audio files, text annotations, and any subsets) to unregistered third-party individuals or organizations.

1.3 Mandatory Citation

Any research outcomes generated using this dataset (including but not limited to academic papers, technical reports, public presentations, or open-source projects) must comply with the following citation guidelines:

  • During the challenge and before the official paper publication: The official website of the SmartGlasses Challenge (https://aslp-lab.github.io/SmartGlasses) must be explicitly cited in the acknowledgments or references.
  • After the official paper publication: Once the organizing committee officially publishes the Overview Paper or baseline papers for the SmartGlasses Challenge, all subsequent works utilizing this dataset must mandatorily cite the official paper.

1.4 Rights & Disclaimer

  • All intellectual property rights of the dataset belong to the SmartGlasses Challenge Organizing Committee and its affiliated institutions.
  • The dataset is provided "As is". The organizing committee makes no warranties regarding the dataset's suitability for any specific scenarios and shall not be held liable for any direct or indirect damages arising from the use of this data.

Organizers

Lei Xie

Northwestern Polytechnical University, China

Longshuai Xiao

Huawei, China

Xie Chen

Shanghai Jiao Tong University, China

Jun Du

USTC, China

Shuai Wang

Nanjing University

Liumeng Xue

Nanjing University

Eng‑Siong Chng

Nanyang Technological University, Singapore

Jun Zhou

Rokid, China

Dehui Gao

Northwestern Polytechnical University, China

Zhixian Zhao

Northwestern Polytechnical University, China

Yike Zhu

Northwestern Polytechnical University, China

Yujie Liao

Northwestern Polytechnical University, China