Make That Sound More 'Metallic': Towards a Perceptually Relevant Control of the Timbre of Synthesizer Sounds Using a Variational Autoencoder

In this article, we propose a new method of sound transformation based on control parameters that are intuitive and relevant for musicians. This method uses a variational autoencoder (VAE) model that is first trained in an unsupervised manner on a large dataset of synthesizer sounds. Then, a percept...

Full description

Bibliographic Details
Main Authors: Fanny Roche, Thomas Hueber, Maëva Garnier, Samuel Limier, Laurent Girin
Format: Article
Language:English
Published: Ubiquity Press 2021-05-01
Series:Transactions of the International Society for Music Information Retrieval
Subjects:
Online Access:https://transactions.ismir.net/articles/76
Description
Summary:In this article, we propose a new method of sound transformation based on control parameters that are intuitive and relevant for musicians. This method uses a variational autoencoder (VAE) model that is first trained in an unsupervised manner on a large dataset of synthesizer sounds. Then, a perceptual regularization term is added to the loss function to be optimized, and a supervised fine-tuning of the model is carried out using a small subset of perceptually labeled sounds. The labels were obtained from a perceptual test of Verbal Attribute Magnitude Estimation in which listeners rated this training sound dataset along eight perceptual dimensions (French equivalents of 'metallic, warm, breathy, vibrating, percussive, resonating, evolving, aggressive'). These dimensions were identified as relevant for the description of synthesizer sounds in a first Free Verbalization test. The resulting VAE model was evaluated by objective reconstruction measures and a perceptual test. Both showed that the model was able, to a certain extent, to capture the acoustic properties of most of the perceptual dimensions and to transform sound timbre along at least two of them ('aggressive' and 'vibrating') in a perceptually relevant manner. Moreover, it was able to generalize to unseen samples even though a small set of labeled sounds was used.
ISSN:2514-3298