Synthetic training set generation using text-to-audio models for environmental sound classification

Accompanying website to the paper Synthetic training set generation using text-to-audio models for environmental sound classification, Francesca Ronchini, Luca Comanducci, Fabio Antonacci, accepted for oral presentation @ DCASE Workshop 2024

Abstract

In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.

Audio Examples

In this page, we present audio data generated using AudioLDM2 and MusicGen via simple prompt and via ChatGPT prompts (namely AudioLDM2_gpt and MusicGen_gpt). We present results for each of the 10 classes contained in the UrbanSound8K (US8K) dataset: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, street_music. For each class, we present three examples per each model.

1) air_conditioner

Simple prompt: “A clear sound of an air conditiner in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of an air conditioner in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2

AudioLDM2_gpt

2) car_horn

Simple prompt: “A clear sound of an car horning in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of an car horning in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

3) children_playing

Simple prompt: “A clear sound of a children playing between them in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of children playing between them in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

4) dog_bark

Simple prompt: “A clear sound of a dog barking in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of a dog barking in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

5) drilling

Simple prompt: “A clear sound of a drilling in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of a drilling in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

6) engine_idling

Simple prompt: “A clear sound of an engine idling in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of an engine idling in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

7) gun_shot

Simple prompt: “A clear sound of a gun shot in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of a gun shot in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

8) jackhammer

Simple prompt: “A clear sound of a jackhammer in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of a jackhammer in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

9) siren

Simple prompt: “A clear sound of a siren coming from an emergency vehicle in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of the sound of a siren coming from an emergency vehicle in a urban environment.”

USK8 example:

AudioGen

AudioGen_gpt

AudioLDM2_gpt

10) street_music

Simple prompt: “A clear sound of street music in a urban context.”
ChatGPT prompt: “Generate a realistic audio representation of street music in a urban environment.”

USK8 example:

Abstract

Audio Examples

1) air_conditioner

AudioGen

AudioGengpt

AudioLDM2

AudioLDM2gpt

2) car_horn

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

3) children_playing

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

4) dog_bark

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

5) drilling

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

6) engine_idling

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

7) gun_shot

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

8) jackhammer

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

9) siren

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

10) street_music

AudioGen

AudioGengpt

AudioLDM2gpt

AudioLDM2gpt

AudioGen_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt

AudioGen_gpt

AudioLDM2_gpt

AudioLDM2_gpt