Dissertation: Deep Neural Networks for Sound Event Detection

Hearing is an essential part of cognition. For instance, humans are very good at interpreting and reacting to sounds even with the lack of corresponding visual cues. Nevertheless, the link between the sounds and what they represent are challenging to explain through a set of hand-made formulas. In his dissertation, Emre Cakir shows it is possible to achieve this using deep learning techniques such as convolutional and recurrent neural networks.

Sound event is defined as "an audio segment that can be labeled as a distinctive concept in an audio signal". Examples can be listed as dog bark, door bell, baby crying, footsteps etc. Automatic detection of the sound events can be used for applications including audio surveillance, wildlife monitoring, context-aware devices and smart home systems.

The complex nature of relationship between the sound event representations (in both time and frequency domain) and their labels makes it hard to come up with an engineered solution for sound event detection. At this point, deep learning comes to help. Deep Neural Networks (DNN) are used to automatically learn a highly nonlinear input-output relationship through a hierarchical neuron layer structure.In supervised learning with DNNs, a differentiable loss function is defined to measure the difference between the estimated and target output, and the neuron parameters are iteratively updated using gradient descent based methods to minimize the loss.

In his dissertation, Emre Cakir investigates several deep learning techniques such as deep feedforward, convolutional and recurrent nets for sound event detection and evaluates them in various settings such as low vs. high polyphony and real-life vs. synthetic recordings.

“The abstract learning capabilities of deep learning techniques prove to be crucial for a robust sound event detection system. We utilize a low level time-frequency representation of the sound event (such as mel spectrogram) as input and learn a mapping between this and the binary sound event label through deep learning. Especially, when the local-shift invariant feature extraction capabilities of convolutional layers and the temporal modeling capabilities of recurrent layers are combined in a single model, the performance more than doubles compared to established techniques such as Gaussian Mixture Models over a set of 61 sound events recorded in real-life environments.”

Public defence of a doctoral dissertation on Tuesday, 22 January 2019

Emre Cakir’s doctoral dissertation "Deep Neural Networks for Sound Event Detection" will be publicly examined in Tampere University in Auditorium TB207 in the Tietotalo building (address: Korkeakoulunkatu 1, Tampere, Finland) at 12 noon on Tuesday 22.1.2019.

The opponents will be Dr. Dan Stowell from Queen Mary University of London (UK) and Associate Professor Tomi Kinnunen from the University of Eastern Finland. Professor Tuomas Virtanen from Tampere University will act as Chairman.

Emre Cakir (27) comes from Ankara, Turkey. He has been a member of Audio Research Group (ARG) in Tampere University since 2014. He has been developing machine hearing algorithms for applications including sound event detection, sound direction estimation, acoustic scene recognition and musical instrument synthesis.

The dissertation is available online at: http://urn.fi/URN:ISBN:978-952-03-0962-6