Imaginary Soundscape#2

Website to Experience Soundscapes "Imagined" by AI



We, as human beings, can imagine the sounds that we would hear if we were present at the location depicted in the picture of a landscape. For example, one might imagine the sound of rippling waves from a picture of a beach, or the sounds of crowds and traffic signals from a picture of a scramble crossing in Shibuya. “Imaginary Soundscape” was launched in 2017 as a project to make AI imagine such soundscapes that people would unconsciously imagine, and it has been used by many people around the world.

We have updated Imaginary Soundscape to improve the matching algorithm and greatly expand the sound library. This allows for more detailed and nuanced matching of both Street View and user-uploaded images. In addition, the UI has been improved and Japanese notation has been added in addition to the previous English notation, making it more enjoyable for a wider range of users.

You can try it out using this link. (


Imaginary Sounscape has been using a SoundNet-based model to achieve image to sound matching by inputting images into the model and searching the library for sound files with features that are close to the output features (high-dimensional vectors representing image features). The technology that combines information from different domains, such as image and sound, is called multimodal technology, and was still in the process of developing at the time this project was launched.

Then, this multimodal technology will be greatly improved by CLIP, a technology published by Open AI in 2021. CLIP learns the relationship between images and texts, allowing it to search for images from input text and vice versa. Since CLIP was released, many other multimodal technologies have been released and developed.

In this project, which evokes soundscapes using AI, we thought that we could create a more interesting experience by using the multimodal training method presented in CLIP.


This update incorporates a training method called Contrastive Pre-Training, which is used in CLIP. In CLIP, Open AI collected a large number of image and text pairs from the web and train their relationships by doing Contrastive Pre-Training. In this Contrastive Pre-Training, a feature extractor called Encoder is prepared for both image and text, and the image and text data prepared as training data are used as inputs to these Encoders. The image and text are then converted into features (high-dimensional vectors). In the actual training step, the Encoder computes the similarity between images and text based on these features, and the Encoder is trained such that similar images and text have a higher similarity, while images and text that are different have a lower similarity. After this training, the model is able to understand the relationship between images and text, and CLIP has succeeded in building a model with generic performance by collecting and training on 40 billion image-text pair data.

Figure: Conceptual Diagram of Contrastive Pre-Training (cited from

Imaginary Soundscape applies this CLIP mechanism to make AI learn the relationship between images and sounds rather than images and text. By breaking down a large amount of video data into images and sounds, and using the same mechanism as CLIP to learn these pairs as training data, it is now possible to select sounds that are close to the input image, and vice versa.

Figure: system conceptual diagram of Img2Sound (image to sound)

In addition, we have newly added various genres of sound files to our library. This makes it possible to select more appropriate sounds for the various types of images uploaded by users as well as Street View images.

In this update, we aimed to improve the user experience by incorporating the latest multimodal research into Imaginary Soundscape. The improved image-sound matching accuracy allowed more appropriate sounds to be selected, and improved the quality of soundscape imagined by AI. Now that AI is able to handle the relationship between image and sound, an area that is sensory and difficult to quantify even for humans, it is expected to stimulate our imagination even more and create new forms of expression.

Since the deep learning model developed in this project is compatible with CLIP, so it has a lot of potential for further applications, such as matching text and sound. In fact, we are currently working on further applications such as image/music matching and video/music matching. We will continue our efforts to expand human creativity using AI through the creation of such works.

We now offer the license of the technology developed through this project to quantify the relationship between images and sound as the “Img2Sound (Image to Sound)” engine.



  • Project Direction

    Akira Shibata (Qosmo, Inc.)

  • Technical Direction

    Nao Tokui (Qosmo, Inc.)

  • Front-end

    Robin Jungers (Qosmo, Inc.)

  • Back-end

    Bogdan Teleaga (Qosmo, Inc.)

  • Machine Learning

    Ryosuke Nakajima (Qosmo, Inc.)

  • Web Design

    Tomoyuki Yanagawa


This project uses Qosmo Music & Sound AI.

Get in touch with us here!