Hyppää pääsisältöön
Väitös

Huang Xie: AI system learns to understand everyday sounds through natural language

Tampereen yliopisto
SijaintiKorkeakoulunkatu 6, Tampere
Hervannan kampus, Konetalo, auditorio K1702 ja etäyhteys
Ajankohta8.12.2025 12.00–16.00
Kielienglanti
PääsymaksuMaksuton tapahtuma
Huang Xie.
Kuva: Haiyuan Lin
In his doctoral dissertation, M.Sc. Huang Xie investigated how machines can better connect what they hear with the words people use to describe sounds. His research develops new methods for linking audio signals and natural language, enabling systems to interpret complex sound scenes more accurately and more like humans.

Everyday technologies—from voice assistants to smart home devices and media search engines—depend increasingly on the ability to understand sounds. Yet teaching machines to relate acoustic events to meaningful language remains a challenge. In his dissertation, Huang Xie explores cross-modal learning, an approach in which models learn a shared semantic space for both audio and text.

Xie’s research addresses three major obstacles in current audio–language technology: how to match short, time-localized sound events with the words that describe them; how the choice of “negative examples” in contrastive learning influences what a model learns; and how to account for the fact that the relationship between a sound and a caption is not simply correct or incorrect, but often subjective and graded.

To tackle these issues, the dissertation introduces several new methods. One is an unsupervised framework that aligns sound segments with relevant textual phrases without requiring manual annotation—making it easier to train systems on large, unlabelled datasets. Another contribution is a systematic evaluation of negative sampling strategies, showing that sampling design has a much greater impact on model quality than previously recognized.

The work also places humans at the center of evaluation. Xie created a crowdsourced dataset in which listeners rate how strongly a caption matches a sound. These graded human judgements are then incorporated into a dual-objective training method that allows models to learn more nuanced representations. Additionally, the dissertation proposes a regression-based approach for automatically generating continuous relevance scores, offering a scalable alternative to manual annotation.

“Understanding sound is more than detecting what noise occurred—it’s about connecting what we hear with the language we use. By modelling these connections more precisely, we can build systems that search, describe, and interpret audio in ways that better match human perception”, Xie explains.

The findings advance both the theory and practice of audio–language modelling. They contribute to ongoing public discussions around trustworthy AI, human-centered evaluation, and the future of multimodal machine learning.

Public defence on Monday 8 December

The doctoral dissertation of M.Sc. Huang Xie in the field of Signal Processing and Machine Learning titled Integrating Audio and Natural Language through Cross-Modal Learning will be publicly examined at the Faculty of Information Technology and Communication Sciences at Tampere University on Monday 8.12.2025. The Opponents will be Professor Tomi Kinnunen from University of Eastern Finland and Director of Research, PhD Emmanouil Benetos from Queen Mary University of London. The Custos will be Professor Tuomas Virtanen from the Faculty of Information Technology and Communication Sciences, Tampere University.