Accompanying website to the paper Synthetic training set generation using text-to-audio models for environmental sound classification, Francesca Ronchini, Luca Comanducci, Fabio Antonacci, accepted for oral presentation @ DCASE Workshop 2024
Abstract
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
Audio Examples
In this page, we present audio data generated using AudioLDM2 and MusicGen via simple prompt and via ChatGPT prompts (namely AudioLDM2gpt and MusicGengpt). We present results for each of the 10 classes contained in the UrbanSound8K (US8K) dataset: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, street_music. For each class, we present three examples per each model.
1) air_conditioner
- Simple prompt: “A clear sound of an air conditiner in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of an air conditioner in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2
AudioLDM2gpt
2) car_horn
- Simple prompt: “A clear sound of an car horning in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of an car horning in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
3) children_playing
- Simple prompt: “A clear sound of a children playing between them in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of children playing between them in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
4) dog_bark
- Simple prompt: “A clear sound of a dog barking in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of a dog barking in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
5) drilling
- Simple prompt: “A clear sound of a drilling in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of a drilling in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
6) engine_idling
- Simple prompt: “A clear sound of an engine idling in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of an engine idling in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
7) gun_shot
- Simple prompt: “A clear sound of a gun shot in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of a gun shot in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
8) jackhammer
- Simple prompt: “A clear sound of a jackhammer in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of a jackhammer in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
9) siren
- Simple prompt: “A clear sound of a siren coming from an emergency vehicle in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of the sound of a siren coming from an emergency vehicle in a urban environment.”
USK8 example:
AudioGen
AudioGengpt
AudioLDM2gpt
AudioLDM2gpt
10) street_music
- Simple prompt: “A clear sound of street music in a urban context.”
- ChatGPT prompt: “Generate a realistic audio representation of street music in a urban environment.”
USK8 example: