The minimum requirement for a DJ is to maintain the “flow” of music, so it is a common practice to select a next track, which sounds somewhat similar to what is being played, but has something new in its rhythm structure/sound texture. . . etc at the same time. Also, DJs usually use instruments or sometimes prominent drum-machine sounds used in tracks as clues for music selection (i.e., a track with piano solo to a track with organ riff, Two tracks both with Roland TR-808 snare)
Based on these observations, we trained three different neural networks. Our models and datasets used for each model are the following:
- Genre Inference (wasabeat dance music dataset)
- Instrument Inference (IRMAS dataset)
- Drum Machine Inference (200.Drum.Machines dataset)
Each model is a convolutional neural network similar to , which takes spectrogram images of sounds and infers genres(minimal techno/tech house/hip-hop. . . ), instruments(piano/trumpet. . . ) and drum machines (TR-808/TR-909. . . ).
Once we got the network trained, we can use the same model to extract auditory features in a high dimensional vector. When human DJ is playing, the system feeds the incoming audio into the model and generate a feature vector. The vector will be compared with those of all tracks in our pre-selected record box (with over 350 tracks for the present), so that the system can select the closest track, which presumably has similar musical tone/mood/texture, as the next track to play.
It’s worth noting that we initially collected and analyzed DJ playlist dataset (visualized in the image) and used it to select the most likely candidate according to the data as in the collaborative filter. We soon realized, however, that it ended up banal music selections, then decided to ignore all metadata associated with the music (genre, artist name, label, etc.) and focus only on the audio data.