Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

We present an audio-visual model for generating food texture sounds from silent eating videos. We designed a deep network-based model that takes the visual features of the detected faces as input and outputs a magnitude spectrogram that aligns with the visual streams. Generating raw waveform samples...

Full description

Bibliographic Details
Main Authors:	Kodai Uchiyama, Kazuhiko Kawamoto
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Multi-modal deep neural network autonomous sensory meridian response eating sound generation
Online Access:	https://ieeexplore.ieee.org/document/9388653/

id	doaj-799d96852ff649f4b762e772e76d499e
record_format	Article
spelling	doaj-799d96852ff649f4b762e772e76d499e2021-04-05T17:38:25ZengIEEEIEEE Access2169-35362021-01-019501065011110.1109/ACCESS.2021.30692679388653Audio-Visual Model for Generating Eating Sounds Using Food ASMR VideosKodai Uchiyama0Kazuhiko Kawamoto1https://orcid.org/0000-0003-3701-1961Graduate School of Science and Engineering, Chiba University, Chiba, JapanGraduate School of Engineering, Chiba University, Chiba, JapanWe present an audio-visual model for generating food texture sounds from silent eating videos. We designed a deep network-based model that takes the visual features of the detected faces as input and outputs a magnitude spectrogram that aligns with the visual streams. Generating raw waveform samples directly from a given input visual stream is challenging; in this study, we used the Griffin-Lim algorithm for phase recovery from the predicted magnitude to generate raw waveform samples using inverse short-time Fourier transform. Additionally, we produced waveforms from these magnitude spectrograms using an example-based synthesis procedure. To train the model, we created a dataset containing several food autonomous sensory meridian response videos. We evaluated our model on this dataset and found that the predicted sound features exhibit appropriate temporal synchronization with the visual inputs. Our subjective evaluation experiments demonstrated that the predicted sounds are considerably realistic to fool participants in a “real” or “fake” psychophysical experiment.https://ieeexplore.ieee.org/document/9388653/Multi-modal deep neural networkautonomous sensory meridian responseeating sound generation
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Kodai Uchiyama Kazuhiko Kawamoto
spellingShingle	Kodai Uchiyama Kazuhiko Kawamoto Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos IEEE Access Multi-modal deep neural network autonomous sensory meridian response eating sound generation
author_facet	Kodai Uchiyama Kazuhiko Kawamoto
author_sort	Kodai Uchiyama
title	Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
title_short	Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
title_full	Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
title_fullStr	Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
title_full_unstemmed	Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos
title_sort	audio-visual model for generating eating sounds using food asmr videos
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	We present an audio-visual model for generating food texture sounds from silent eating videos. We designed a deep network-based model that takes the visual features of the detected faces as input and outputs a magnitude spectrogram that aligns with the visual streams. Generating raw waveform samples directly from a given input visual stream is challenging; in this study, we used the Griffin-Lim algorithm for phase recovery from the predicted magnitude to generate raw waveform samples using inverse short-time Fourier transform. Additionally, we produced waveforms from these magnitude spectrograms using an example-based synthesis procedure. To train the model, we created a dataset containing several food autonomous sensory meridian response videos. We evaluated our model on this dataset and found that the predicted sound features exhibit appropriate temporal synchronization with the visual inputs. Our subjective evaluation experiments demonstrated that the predicted sounds are considerably realistic to fool participants in a “real” or “fake” psychophysical experiment.
topic	Multi-modal deep neural network autonomous sensory meridian response eating sound generation
url	https://ieeexplore.ieee.org/document/9388653/
work_keys_str_mv	AT kodaiuchiyama audiovisualmodelforgeneratingeatingsoundsusingfoodasmrvideos AT kazuhikokawamoto audiovisualmodelforgeneratingeatingsoundsusingfoodasmrvideos
_version_	1721539180146720768

Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

Similar Items