Sound event recognition based on joint training of separation and classification
* Präsentierender Autor
Zusammenfassung:Selective listening in complex sound scenes is still a huge challenge for machine hearing. With recent advances from deep learning, the separation and recognition of speech made great progress. The separation and classification of arbitrary environment sounds, however, is still far from that of speech. In this paper, we propose a network based on joint training of separation and classification to address the recognition of environmental sound events. It includes three main stages: Time-frequency-transformation, separation mapping and classification mapping. It is trained on a large weakly labelled dataset where audio classes are labelled without onset and offset. These audio clips were recorded from real acoustic scenes and include overlapping sound events. The separation and classification are mutually reinforcing. Based on the estimated masks of the separation mapping stage, we calculate the global probability for every event as output for classification. The results show that joint training can improve the accuracy of sound event recognition, and performance can even be comparable with networks trained with strongly labelled datasets.