Shuyang Zhao: Clustering analysis and active learning for sound event detection and classification
The state-of-the-art sound event detection and classification systems use acoustic models developed using machine learning techniques. The training of acoustic models typically relies on a large amount of labeled audio data. Manually assigning labels to audio data is often the most time-consuming part in a model development process. Unlabeled data is abundant in many practical cases, but the amount of annotations that can be made is limited. Thus, the practical problem is optimizing the accuracies of acoustic models with a limited amount of annotations.
Shuyang Zhao proposed sample selection based on k-medoids clustering. Medoids are selected for annotation, and the label assigned to a medoid is propagated to its cluster. Mismatch-first farthest-traversal criteria is proposed for sample selection after the medoids are annotated. The proposed methods largely outperformed reference methods based on random sampling and uncertainty sampling.
The dissertation further investigates active learning methods for sound event detection, where sound events may overlap in time. Sound segments were generated based on change point detection within each recording. The sound segments were selected for annotation based on mismatch-first farthest-traversal. During the training of acoustic models, each recording was used as an input of a recurrent convolutional neural network. The training loss was derived from frames corresponding to only annotated segments. In the experiments on a dataset where sound events are rare, the proposed active learning method required annotating only 2\% of the training data to achieve similar accuracy, with respect to annotating all the training data.
The doctoral dissertation of MSc(Tech) Shuyang Zhao in the field of computing and electrical engineering titled Clustering Analysis and Active Learning for Sound Event Detection and Classification will be publicly examined in the Faculty of Information Technology and Communication Sciences at Tampere University at time on Wednesday 19.1.2022 at 12.00 noon, in the auditorium RG202 of the Rakennustalo, Hervanta campus, Korkeakoulunkatu 5, Tampere. The Opponent will be PhD Frederic Font from Universitat Pompeu Fabra and PhD Karol J. Piczak from Jagiellonian University. The Custos will be PhD Tuomas Virtanen in the Faculty of Information Technology and Communication Sciences.
The event can be followed via remote connection.
The dissertation is available online at the http://urn.fi/URN:ISBN:978-952-03-2266-3