Improving Text-To-Audio Models with
Synthetic Captions

Zhifeng Kong*1, Sang-gil Lee*1, Deepanway Ghosal2, Navonil Majumder2,
Ambuj Mehrish2, Rafael Valle1, Soujanya Poria2, Bryan Catanzaro1

2DeCLaRe Lab, Singapore University of Technology and Design, Singapore


It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged text-only language models to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new state-of-the-art.

Model Architecture

Figure 1: TANGO, as depicted in this figure, has three major components: i) textual-prompt encoder, ii) latent diffusion model (LDM), and iii) mel-spectogram/audio VAE. The textual-prompt encoder (FLAN-T5) encodes the input description of the audio. Subsequently, the textual representation is used to construct a latent representation of the audio or audio prior from standard Gaussian noise, using reverse diffusion. Thereafter the decoder of the mel-spectogram VAE constructs a mel-spectogram from the latent audio representation. This mel-spectogram is fed to a vocoder to generate the final audio.

Comparative Musical Examples:

The instrumental music features an ensemble that resembles the orchestra. The melody is being played by a brass section while strings provide harmonic accompaniment. At the end of the music excerpt one can hear a double bass playing a long note and then a percussive noise. AudioLDM2:
This middle eastern folk song features a male voice. This is accompanied by a string instrument called the oud playing the melody in between lines. A variety of middle-eastern percussion instruments are played in the background. A tambourine is played on every count. A darbuka plays a simple beat. This folk song can be played in a movie scene set in a Moroccan market. AudioLDM2:
Two female voices are singing in harmony to a ukulele strumming chords. One of the voices is deeper and the other one higher. This song may be jamming together at home. AudioLDM2:
This is a jazz music piece with a saxophone solo as the lead. There is a clean-toned synth bass in the background creating an epic atmosphere. It would fit perfectly in a movie/TV show setting as a soundtrack opening. AudioLDM2:
Someone is playing a distorted e-guitar. Finger-tapping a classical music piece. This is an amateur recording and not of the best sound-quality. This song may be playing at home showing off your skills to a friend. AudioLDM2:
The drum is playing a four on the floor groove with little fill-ins on the snare crash. A synthesizer is playing a 2000s pad sounding lead melody. The distorted e-guitar that is playing underlining the melody with some long sounding chords. This song may be played at an event to animate and pump up the people. AudioLDM2:
The song is an instrumental. The song is medium tempo with a string section accompaniment, timpani playing and other percussion instruments. There are other funny sounds like ball bouncing and other percussion tones. The song is an ad jingle soundtrack. The audio track is of poor quality. AudioLDM2:
Someone is playing a distorted e-guitar solo melody with a lot of reverb over another e-guitar finger picking chords. This is an amateur recording. This song may be playing guitar at home. AudioLDM2:
The instrumental music features a string section playing an ascending long-note melody while at the same time providing harmonic support. A harp is playing arpeggios. Although it feels as if time passes slowly, the tempo is medium. Listening to this music I get a sense of profoundness that's moving me. This music could work well as the soundtrack for a movie. AudioLDM2:
This music is a percussion instrumental . The tempo is medium fast with a rhythmic tambourine beat. The music is minimal with the sound of the tambourine metal plates clanging and a dull drum head beat. The music is catchy, enthusiastic and has a rhythmic beat. AudioLDM2:
This song is led by a e-guitar solo playing a lightly distorted guitar on the upper register. An acoustic guitar panned to the left and right side of the speakers is playing chords along with a bass that gets slapped and an acoustic drums in the background. This song may be played at a live concert. AudioLDM2:
This is an animation theme. There is a brass section playing the melody with the saxophone being the lead. The hi-hat cymbals are played with a feeling of swing on the acoustic drums. The atmosphere is sneaky and filled with intrigue. These characteristics make it a perfect soundtrack for a heist/detective cartoon or an animation movie. It could also work well in the soundtrack of a video game from a similar genre. AudioLDM2:
This instrumental pop song features an acoustic guitar being played fingerstyle. The notes are being plucked. Guitar harmonics are played in between the plucked notes. This song is played at a moderate tempo. There is no percussion in this song. There are no other instruments in this song. The mood of this song is relaxing. This song can be played in a coffee shop. AudioLDM2:
This is a jazz piece played in the background of a video game. A trumpet plays the main melody while a xylophone and a bass guitar is supporting the tune in the background. A playful jazz drum beat carries the rhythmic background. Occasional electric guitar fills in the form a strum can be heard. There are a lot of sound effects related to the game such as squeaking, chewing and explosions. AudioLDM2:
The song is instrumental. The song is medium tempo with a groovy bass line, keyboard harmony, various percussion hits and a steady drumming rhythm. The song has a dance rhythm and is exciting. AudioLDM2:
The song has a traditional jazzy feel to it. The piano chord progression is bouncy and light. The electric guitar has a chorus applied to it, and we hear various licks being played. AudioLDM2:
This is an instrumental rock music jam. The only instrument is a clean-sounding electric guitar playing an arpeggio with an added echo effect. The atmosphere is calming. Although the recording quality is a bit low, this piece could be playing in the background of a rock bar after a decent mix is applied. AudioLDM2:
This is a yodeling music piece. There is a female vocalist that is singing happily in the lead. The melody is provided by medium and high pitch woodwinds. In the background, the bass line is played by an upright bass while the rhythm is provided by an acoustic drum. The atmosphere is very lively. This piece could be used in the soundtrack of a comedy movie or a children's show. AudioLDM2:
The track features a Christmas song with no vocals. The melody is very simple and it's played by a Schoenhut Piano that's accompanied by subtle strings. The atmosphere is positive and very Christmas-like. AudioLDM2:
Someone is strumming chords on an e-guitar. The e-guitar is slightly out of tune. This song may be playing at home trying out sounds on the e-guitar. AudioLDM2:
Animated vocalists singing a choral harmony. The song is medium tempo with no instrumentation, but just voices singing in various harmonic ranges. The song is cheerful and cartoonish. The song is a parody of a modern pop song with western classical music influences. AudioLDM2:
The ambient song features crickets, frogs and other night insects and animals sound effects, widely spread throughout the stereo image, while there is a reversed dark, mellow piano melody playing. It sounds calming and relaxing. AudioLDM2:
This music is instrumental. The tempo is fast with an electric guitar playing a bright lead with infectious drumming and syncopated bass lines. The music is upbeat, energetic, youthful, animated, spirited, enthusiastic, groovy and funky. AudioLDM2:
This is a classical music waltz piece played on a glass harp instrument. The melody is being played on the smaller glasses at a higher pitch while the rhythm is being played on the bigger glass at a medium pitch. The piece is being played at a cathedral which gives a nice resonance and natural reverb effect. The piece has a unique character. It can be played in the soundtracks of children's movies/TV shows. AudioLDM2:
This is a rock music piece consisting of three instruments. There is a distorted electric lead guitar playing a solo that is increasing from medium to high pitch. The bass guitar is providing tune in the background. There is a loud acoustic drum beat that includes a lot of fills. The song has a vigorous, high energy feel to it. This piece could be played at rock bars. AudioLDM2:
This song contains a full orchestra playing a melody with long and short strings; flutes adding short melodies on top and a cymbal hit to close the phrase. A mixed choir is singing opera then a male solo singer takes over with a higher voice. This song may be playing in a musical or theater opera performance. AudioLDM2:
This is a repeated chord progression that's arpeggiated on an electric guitar. The guitar signal has a chorus pedal effect applied to it, and so it doesn't sound like a clean tone. The arpeggio itself is mellow yet uplifting. AudioLDM2:
This music clip is an instrumental. The tempo is slow with a violin playing sharp ,sudden notes. The scale change is random with no harmony or melody. This music sounds like a violin being tuned. AudioLDM2:
This amateur recording features the main melody being played by a theremin. This is accompanied by a piano. There are no other instruments played in this clip. The mood of this song is relaxed. There are no voices in this song. This song can be played in a slow motion action scene in a superhero movie. AudioLDM2:
This song is an animated instrumental. The tempo is fast with synthesiser arrangements, intense drumming, groovy bass lines and digital sound effects. The music is upbeat, catchy, compelling, youthful, psychedelic, trance like and intense with an upbeat dance groove. This song is contemporary Synth Pop. AudioLDM2:
A male vocalist sings this meditative chant. The tempo is slow with keyboard harmony, steady drumming, rhythmic acoustic guitar,tambourine and female vocal backup. The song is a vocal riff of a chant sung melodiously. It is calming, peaceful, meditative, pensive, soothing, prayerful and devotional. This song is sung during Hindu festivities or Pujas. AudioLDM2:
Someone is playing an acoustic guitar along with someone playing a solo melody on top with another plucked string instrument. They are accompanied by someone playing a djembe or another percussive instrument. This song may be playing sitting around a bonfire. AudioLDM2:


