First appearance: Nao Tokui on Medium.
This is a blog post on my latest project, Imaginary Soundscape: Cross-Modal Approach to Generate Pseudo Sound Environments [PDF], presented at NIPS 2017 workshop, Machine Learning for Creativity and Design.
Imaginary Soundscape is a web-based sound installation, where viewers can freely walk around Google Street View and immerse themselves into the artificial soundscape “imagined” by our deep learning models. (I know imagine is not scientifically correct to describe the process. That’s why I double quoted it.)
Please be noted that the site is compatible with Chrome and Firefox browsers on PC only. The support for Safari and smartphone browsers is coming later.
This is a sound file recorded in 2002 by myself. Back then, I was so intent on field-recording, always carried and used a hand-made binaural microphone and an MD-recorder. Every time I listen to the sound, it recalls the memory of my first trip to abroad with the feeling of anxiety, nervousness, and excitement I felt as a 20-something student.
Needless to say, the sound has a unique potential to bring scenes/memories to mind (often more potent than the vision). When you hear something, you can imagine sights (and even smells) based on the sound, as if you were there. I personally have been intrigued by the concept of “soundscape” and the imagination stirred by the sound. Back in 2015, I made a microphone for 360-degree cameras, based on an old technical theory called Ambisonics.
A soundscape is any collection of sounds, almost like a painting is a collection of visual attractions. When you listen carefully to the soundscape it becomes quite miraculous.—R. Murray Schafer
In 2016, I was fortunate to get involved with a music video of Brian Eno (He is THE idol for me!!). The Ship — Brian Eno’s Generative Movie was a website, where a CNN-based system keeps analyzing incoming photos from various news sources, finds “similar” photos from a historical photo archive and juxtaposes them in a very very slow way. The concept was to make AI look back the history of the human being.
When we tested the system, something like the following happened frequently (left: input photo / right: photo picked by AI). It’s easy to say “it’s not correct”, but at the same time, we found it fascinating as if AI fantasized based on the input images.
These interests in the soundscape and fantasizing AI led to my latest project, Imaginary Soundscape.
As I wrote, one can imagine scenes from a sound. Conversely, by taking a glance at a photo, we can imagine sounds we might hear if we were there. Can an AI system do the same? If so, what if we apply the method to images of Google Street View, so that we can walk around with the generated soundscape? This relatively straight-forward fantasy ended up as a website called Imaginary Soundscape.
By taking a glance at a photo, we can imagine sounds we might hear if we were there. Can AI do the same?
Many researchers around the world have been working on Cross-modal information retrieval, such as image-to-audio, sound-to-image, sound-to-text, using Deep Learning. We have introduced some of these researches on our website, createwith.ai (in Japanese). Our implementation of this project was based on research done by researchers at MIT.
In this research, they used two types of Convolutional Neural Networks(CNN), one for images of video frames and the other for spectrogram images of audio and Flickr 100M video dataset for training. For images, pre-trained standard CNN models for image recognition were used (namely VGG model for ImageNet and PlacesNet for Places dataset). Then they trained the second network(SoundNet), such that the output of SoundNet has similar distribution with the output of pre-trained well-established models for the image as much as possible, with respect to the input of a sound and its corresponding video frame.
Once trained, the rest was straightforward. For a given image from Google Street View, we can find the best-matched sound file from a pre-collected sound dataset, such that the output of SoundNet with the sound input is the most similar to the output of the CNN model for the image. As the sound dataset, we collected 15000 sound files from internet published under Creative Commons license and filtered with another CNN model on spectrogram trained to distinguish environmental/ambient sound from other types of audio (music, speech, etc.).
Our short paper on this project got accepted at NIPS 2017 workshop, Machine Learning for Creativity and Design. If you are interested in more detail about the project, please refer to the paper.
Here is a short video to show how the website works. I find it interesting to see/hear that the model (apparently) takes into account the acoustic features of the scenery. For example, in a scene of a church (La Sagrada Familia in Barcelona), you can hear sounds with very strong reverberation.
The system cannot infer matched sound with accuracy and makes mistakes (In the video, the model seems to misunderstand the inside of Tokyo Dome ballpark as a racing circuit or something). Even though, or for this very reason, I find it intriguing.
Feel free to search your favorite places and walk around them. I’d love to hear what you think! Special shout out Yuma Kajihara for working on the website and the paper!