
Introduction
Language models or Large Language Models (LLM) have disrupted the world around us, especially with OpenAI’s GPT models including ChatGPT and Codex. Both of these models are capable of generating text and code efficiently through a given prompt. These models trained on large datasets can be used for a variety of NLP tasks, including sentiment analysis, chatbot systems, summarization, machine translation, document classification, et cetera. Although these models have limitations, it presents us with a vision of where the LLMs are heading i.e. to understand languages and build applications that can enhance human lives. Though many are concerned that it may replace humans in many areas the idea behind these models is to increase productivity and provide a new way to explore and understand language as a whole.
Since language holds an elemental place in human civilization it is, therefore, essential to build language models that decode the given text description and execute the required tasks such as generating text, images, audio, music, et cetera. This article primarily focuses on the music language model which is similar to models like ChatGPT and Dalle but instead of generating texts or images respectively, it creates music.
Music is complicated and it is very dynamic. It is generally an orchestration of many musical instruments that harmonize together to fit into a context. It includes individual notes, to a combination of notes (chords), speech such as phonemes or syllables, into words and sentences. Creating a mathematical model that can extract information from such a busy dataset is a daunting task. But once a model is established we can generate realistic audio similar to the one humans can generate.
We will understand the core idea of the music language model and how you can generate music. So let’s get started.
What is a music language model?
MusicLM, like other language models, uses various machine learning techniques, such as deep learning and natural language processing, to analyze and find hidden representations to generate music. These models use datasets pertaining to music samples to extract information and find patterns which enables them to learn a wide variety of music styles and genres.
MusicLM can use to automate various tasks such as writing a music score by analyzing the music, recommending a new chord progression in the existing music or even generating a new sound et cetera, essentially it can help introduce new forms of musical expression and creativity. Tools like this can enhance musicians’ skills and help musicians learn the same.
What is Google MusicLM?
Google MusicLM is a language model capable of generating music when given a text description. For example, “a calming guitar melody in 6/8 time signature riff”.
MusicLM is similar to other language models but it is totally dedicated to music. It is created by the folks at Google. It is built on top of AudioLM. Using no transcripts or symbolic music representations, AudioLM was developed to provide high-quality, intelligible speech and piano music continuations. It relied on converting the input audio by learning patterns and structures to a series of discrete tokens and producing audio sequences in a long-term consistency.
AudioLM has two tokenizers (see the image below):
- SoundStream tokenizer which produces acoustic tokens
- w2v-BERT tokenizer which produces semantic tokens
Source: https://arxiv.org/pdf/2209.03143.pdf
These tokenizers play a crucial role in information extraction.
Source: https://arxiv.org/pdf/2209.03143.pdf
Now, let us look at the three hierarchical stages that AudioLM possesses:
- Semantic modeling: It involves long-term structural coherence. It extracts the high-level structure of the input signal.
- Coarse acoustic modeling: It produces acoustic tokens which are then concatenated or conditioned on semantic tokens.
- Fine acoustic modeling: The final audio is given even more depth in the third stage, which involves processing the coarse acoustic tokens with fine acoustic tokens. Lastly, in order to recreate a waveform, acoustic tokens are fed to the SoundStream decoder.
MusicLM leverages AudioLM’s multi-stage autoregressive modeling as the generative component while extending it to incorporate text conditioning. See the image below.
Source: https://arxiv.org/pdf/2301.11325.pdf
The audio file is passed into three components: SoundStream, w2v-BERT, and MuLan. We already discussed the working SoundStream and w2v-BERT they both process and tokenize the input audio signal. MuLan on the other hand is a joint embedding model for music and text. It has two embedding towers, one for each modality i.e. text and audio.
Source: https://arxiv.org/pdf/2301.11325.pdf
So essentially, the audio is fed into all three components but the text description is fed only to MuLan. The MuLan embeddings are quantized in order to provide a homogeneous representation based on discrete tokens for both the conditioning signal and the audio. The output from the MuLan is then sent to the semantic modeling stage where the model learns the mapping from the audio token to the semantic tokens. The rest of the process is similar to AudioLM. For a better understanding see the image below.
Source: https://arxiv.org/pdf/2301.11325.pdf
Because MusicLm is built on top of AudioLM and MuLan it provides three advantages:
- It can generate music with text description.
- It can take input melody as input to extend its functionality. For instance, if you provide a humming melody and ask MusicLM to convert it is as a guitar riff it can do that.
- It generates long sequences of any musical instrument.
Dataset
The datasets used to train MusicLM is essentially composed of 5.5k music-text pairs. This includes more than 200,000 hours of music, with rich text descriptions provided by human experts. Google has released the dataset in Kaggle called MusicCaps which can be found at this link: https://www.kaggle.com/datasets/googleai/musiccaps.
Overview of the dataset.
Source: https://www.kaggle.com/datasets/googleai/musiccaps.
Generating music using MusicLM
Unfortunately, due to the need for additional work, Google states that it has “no plans to distribute models at this stage.” But in the white paper released by Google, there is numerous example that demonstrates how you can generate music through text description.
Here are ways in which you can generate music:
- Rich Captions: For example “The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.”
- Long Generation: It essentially generates 5 minutes of continuous, consistent, and high-fidelity audio. You can use text prompts like “Heavy metal”, “soothing reggae” et cetera.
- Story Mode: It is one of the best features of MusicLM where you can instruct the model to generate a music sequence by providing a series of text prompt. For example, “time to meditate (0:00-0:15), time to wake up (0:15-0:30), time to run (0:30-0:45), time to give 100% (0:45-0:60)”
- Text and Melody Conditioning: You can also produce music that adheres to the provided melody (such as humming or whistling) with respecting the text prompt. Basically converting an audio sequence to the desired audio sequence.
- Painting Caption Conditioning: This essentially means that you can generate music through painting description. For example, “His melting-clock imagery mocks the rigidity of chronometric time. The watches themselves look like soft cheese—indeed, by Dali s own account, they were inspired by hallucinations after eating Camembert cheese. In the center of the picture, under one of the watches, is a distorted human face in profile. The ants on the plate represent decay.” By Gromley, Jessica. “The Persistence of Memory”. Encyclopedia Britannica, 14 Apr. 2022.
- Places: You can generate music through a place description. For instance, “a sunny and peaceful time by the beach”.
- Other example includes:
- 10s Audio Generation From Text
- Musician Experience Level
- Epochs
- Accordion Solos
Final Words
The goal of music language models is to enable computers to understand and generate music in a way that is similar to human-generated music, and to use this understanding to create new and innovative musical works. Somehow MusicLM is capable of generating music with high fidelity which is amazing. This shows the capability of the human mind more than AI itself because now we can create a mind that can solve general intelligence to some extent.
Although this is amazing it will pose a lot of ethical concerns and backlash from the musical community. The same can be seen from the release of image-generating models such as Dalle and mid-journey and the same holds true for ChatGPT.
The Google researchers are aware of numerous ethical issues that a system like MusicLM raises, including a propensity for incorporating copyrighted content from training data into the produced songs. During an experiment, they discovered that 1% of the music the system produced directly replicated the songs on which it had been trained. This percentage was apparently too high for them to release MusicLM in its current form.
So it is less likely that we will see MusicLM as a public application anytime soon. But definitely, there will be other open-sourced music models that will be reverse-engineered by rogue developers.