Improvised Sessions with Real-time Music Generation AI
AI DJ Project#2 Ubiquitous Rhythm is an improvisational DJ performance that uses AI to generate music in real-time. Without preparing music sequences in advance, the DJ controls the music composed (generated) by the AI on the spot, another AI reacts to the sound, and the performance unfolds. The complex interaction between the multiple AI models and the DJ creates a unique and organic musical experience.
AI continuously generates two-bar rhythm patterns and corresponding basslines within this performance. Another AI model also keeps selecting loops that fit the rhythm and bassline. The DJ listens to the AI-generated parts and adjusts the sound of the drum machine and synthesizer on the spot. The DJ also controls the volume and audio effects of each track to build up the musical development.
The DJ can also use turntables to mix records with the AI-generated music. Then, AI selects a rhythm loop that would perfectly fit the mixed sound and uses it as input to the rhythm generation model to produce a new rhythm pattern. This way, the DJ intervenes as a disturbance in the interaction of multiple AI models to create a fluctuating feedback loop for music generation.
The performance realizes a new form of DJing, unachievable for any human DJs (i.e., composing music in real-time). It is also an exceptionally improvisational and unpredictable performance. While using machines, the performance is one-time, and the same performance would never be created twice. Furthermore, the DJ cannot predict what kind of musical sequences the AI will play. The AI always comes up with unexpected new phrases as if it were testing the creativity and musicality of the DJ himself.
As a result, the DJ was expected to act like an “AI Jockey” who could “tame” and “ride” machines and AI instead of discs (records). The symbiotic relationship between AI and the human DJ in the performance shows us a glimpse of the future of creativity, where all creators will/should be AI Jockeys in general.
Since around 2015, We have been developing an AI DJ system and giving DJ performances worldwide. By having a human DJ (mainly Tokui, the CEO of Qosmo) and the AI take turns selecting songs one at a time, the project was designed to realize a true “interaction” and “communication” between humans and AI through music.
The motivation behind the project was not to automate DJing in any way but to explore the essence of DJing by imitating human actions as AI. We also use AI’s somewhat surprising music selections to bring the DJs out of their comfort zone. In 2019, we were even invited to give a keynote preshow at Google I/O and had the opportunity to perform AI DJing in front of an audience of over 15,000 people.
Two years since then. While the previous AI DJ was a challenge to let AI take over some of the actions of a human DJ, this time, we tried to do something impossible for any human DJs to do: compose music in real-time and play it on the spot during a performance, with the help of AI. It might be not too far-fetched to call it an attempt to redefine the very act of DJing, which is to select and play precomposed songs fixated on musical media.
The following three AI models for music generation were used in the realization of this performance. I will explain them in order.
The rhythm generation system, which is the core of the performance, is based on a software plugin that Tokui has been developing since 2019. Created for Ableton Live, one of the most popular music production software (DAW), this plugin allows artists to drag and drop music data(MIDI data) and train their own rhythm generation models without programming or other hassles.
This model, based on the Variational Autoencoder (VAE) architecture, uses a neural network to learn how to compress (encode) complex rhythmic patterns into low-dimensional vectors (two-dimensional vectors in the case of this plugin) and then restore (decode) the original data. Once the training process is complete, the network can generate numerous rhythm patterns by inputting this low-dimensional vector to the decoder.
To control the characteristics of the generated patterns to a certain extent, the density of the bass drum and hi-hats can be input as conditions for rhythm generation. In this way, DJs can loosely control the generated rhythms by changing these conditions during the performance. In addition to commercial collections of MIDI data of dance music, we also used drum patterns of experimental electronic music as training data for this performance.
The bassline, along with the rhythm, is an essential element that forms the foundation of a dance music track. It’s no exaggeration to say that the drums and bassline determine most of the groove of a song.
What if we think that a drum pattern and a bassline in a track both “speak” a rhythmic concept in two different instruments? Just as both “cat” and “ネコ” in Japanese refer to the same fluffy animal in two different languages. If we can train an AI model that translates Japanese into English, we can also create a model that “translates” drum patterns into basslines by collecting a large amount of music data. This is the original idea of the second model.
First, we collected a large number of MIDI files. We then extracted the parts where the drums and bass are playing simultaneously. Then, using the drum patterns as input, we trained a model (seq2seq model using LSTM) that predicts the bassline that would be played simultaneously with the input drum pattern. We have also packaged the trained model as a plugin for Ableton Live, as in the case of the rhythm generation model.
The third model selects appropriate loops (melodies, vocal samples, sound effects, i.e., other than rhythm and bassline) for the rhythm and bass generated. For example, a funky synth riff or a funky vocal sample might be a good fit for a house music track. In contrast, a distorted guitar riff might be a better fit for a rock-style 8-bit rhythm.
If you’ve ever produced a song in Ableton Live, you’ve probably wondered which of the many loops to select. The role of the third model is to find loops that match the rhythm and bass among the countless options.
To train this model, we first collect a large number of songs and create a large number of two-bar loops (This process also uses another machine learning technique). We then used sound source separation techniques to extract the rhythm, bass, and other tracks (piano, guitar melody, etc.) from collected loops.
While rhythm/bass and bass/melody pairs taken from the same song should work well, combinations of loops taken from two randomly selected songs are not guaranteed to be a good match. Sometimes the combination can be unbearable to listen to. So, we trained a model that can predict the compatibility of a given pair of loops taken from the same song as a “good combination” and one taken from random pair of two songs as a “bad combination” (CNN-based Siamese Network trained with Triplet Loss).
With this model, it is possible to search for the most appropriate loops among many.
In this performance, we used three different sets of sound sources: the “other (melody, etc.)” part taken from a publicly available music dataset, field-recorded soundscape sounds (environmental sounds, etc.), and sounds with human voices explicitly selected from the soundscape dataset. We used the same machine learning techniques to extract two-measure loops from each of them.
The performance starts by randomly generating a rhythm pattern. Then, the baseline model “translates” the rhythm to a baseline. Next, the loop selection model picks appropriate loops that fit the rhythm and bass sound from the respective loop source set (ambient sound, voice, music).
The DJ can use turntables to mix the sound of the record. He/she can also manipulate the timbre and volumes of a drum machine and a synthesizer (in this case, ARTURIA DrumBrute Impact and SEQUENTIAL Prophet 6) to create a musical development. The DJ controls the “Conditioning” function of the rhythm generation model to create a broader flow in the performance. The DJ’s manipulations are reflected in the sounds input to the loop selection model, which indirectly affects the loop selection.
We exported all rhythm patterns used to train the rhythm generation model as audio loops in advance. Hence, the loop selection model can also select suitable rhythm patterns. Then, the selected rhythm pattern is used as an input to the rhythm generation model. It, in turn, triggers the development of the following rhythm (remember that the rhythm generation model uses a VAE model consisting of an encoder and decoder).
This way, the direct and indirect interactions and feedback loops between the three AI models and a single DJ produced the appropriate unpredictability and fluctuation in the generated music.
The visual in this performance has two layers: an interface layer to directly represent what is happening in the AI system and a visual expression layer. Using Augmented Reality (AR) technology, the interface is displayed virtually in the space in front of the DJ, showing the AI-generated sequences and updating them moment by moment. As the music unfolds, minimalistic visualizations of the sound are projected behind the DJ, enhancing the atmosphere of the performance. (The AR camera system was created and provided by the visual team of Dentsu Craft Tokyo.)
|2021/10/28||Creativity4Better (Bucharest)||Online Screening|
Nao Tokui (Qosmo, Inc.)
Shoya Dozono (Qosmo, Inc.)
Ryosuke Nakajima (Qosmo, Inc.)
Hiroyoshi Murata (Dentsu Creative X, Inc.), Yuki Tanabe (Dentsu Creative X, Inc.)
Sota Suzuki (Dentsu Creative X, Inc.)
Ryotaro Omori (Dentsu Creative X, Inc.)
Naoki Ise (Qosmo, Inc.)