Summary: | Surveillance systems based on image analysis can automatically detect road accidents to ensure a quick intervention by rescue teams. However, in some situations, the visual information is insufficiently reliable, whereas the use of a sound detector can greatly improve the overall reliability of the surveillance system. In this paper, we focus on detecting two classes of anomalous sounds for audio surveillance on roads, i.e., tire skidding and car crash, whose occurrences are an evidently acoustic indication of road accidents or disruptions. In the proposed method, we extract a feature of deep audio representation (DAR) and then use a classifier of a bidirectional long short-term memory network to determine the class of the sound to which each test audio segment belongs. We propose a framework based on multiple-stage deep autoencoder network (DAN) to extract the DAR, which fuses complementary information from several input features and thus can be more discriminative and robust than those input features. In the experiments, we discuss the influences of the parameter settings of the DAN's hidden layers on the performance of DAR and compare the DAR with other features. Furthermore, the proposed method is compared to the state-of-the-art methods. In evaluating the data with various signal-to-noise ratios, the results show that the DAR outperforms other features, and the proposed method is superior to the state-of-the-art methods for detecting anomalous sounds on roads.
|