On Thursday, researchers from Google announced a new generative AI model called MusicLM that can create 24 KHz musical audio from text descriptions, such as “a calming violin melody backed by a distorted guitar riff.” It can also transform a hummed melody into a different musical style and output music for several minutes.
MusicLM uses an AI model trained on what Google calls “a large dataset of unlabeled music,” along with captions from MusicCaps, a new dataset composed of 5,521 music-text pairs. MusicCaps gets its text descriptions from human experts and its matching audio clips from Google’s AudioSet, a collection of over 2 million labeled 10-second sound clips pulled from YouTube videos.
Generally speaking, MusicLM works in two main parts: first, it takes a sequence of audio tokens (pieces of sound) and maps them to semantic tokens (words that represent meaning) in captions for training. The second part receives user captions and/or input audio and generates acoustic tokens (pieces of sound that make up the resulting song output). The system relies on an earlier AI model called AudioLM (introduced by Google in September) along with other components such as SoundStream and MuLan.