Google DeepMind’s new tool can turn text into a movie soundtrack

Google’s AI research laboratory, Google DeepMind, has released details about its latest generative AI model. The video-to-audio technology uses video pixels and text prompts to generate rich soundtracks to match your videos.

DeepMind’s new tool: Video-to-audio

Google published details of its AI model in a blog post on the DeepMind website. The post acknowledged that AI video generation models have advanced at an incredible pace. However, many of the current systems can only generate silent output. Google also said that one of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos. The company’s video-to audio (V2A) technology aims to do exactly that.

What can it do?

Google says that the V2A technology can be paired with AI video generation like its own Veo model. However, it also works with traditional video footage including archival material and silent films. Presumably, that means it will work just as well with your own video clips. In addition, V2A can create a dramatic score, realistic sound effects or dialogue. Google claims that the audio will match the characters and tone of the video that was used to create it.

Unlimited soundtracks

Google says that its V2A AI model can create an unlimited number of soundtracks for any video input. In addition, you can use text prompts to help shape the generated audio output. A “positive prompt” can be defined to guide the generated output toward desired sounds. Alternatively, a “negative prompt” will guide the AI to avoid using undesired sounds.

How it works

Google says that the V2A system starts by encoding video input into a compressed representation. Then, a diffusion model iteratively refines the audio from random noise. In addition, this process is guided by the visual input from the video and any text prompts given. As a result, the AI model generates synchronized, realistic audio that closely aligns with the prompts and the video content. Finally, the audio output is decoded, turned into an audio waveform and combined with the video data.

Sample videos

Google shared some sample videos in the blog post to showcase the capabilities of its V2A model. One example, titled “V2A Horror,” shows a figure walking down a dimly lit corridor with suitably eerie music. The text prompt used was “Cinematic, thriller, horror film, music, tension, ambiance, footsteps on concrete.” Another features an animated dinosaur hatching from an egg, with audio generated from the prompt “Cute baby dinosaur chirps, jungle ambiance, egg cracking.”

Synchronized sound

Some of the sample videos give an impressive demonstration of how the V2A technology can generate sound that is synchronized with the visuals. The “V2A Drums” video features a close-up shot of drumsticks on a snare drum. The audio was generated with the prompt “A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd.” However, the drum sounds appear to match the actions in the video. Similarly, the prompt “Wolf howling at the moon” added audio of a wolf howling in sync with an animated video clip of a moving wolf.

What next

Google said that its research stands out from existing video-to-audio solutions because it can understand raw pixels and supports a text prompt. However, the company acknowledges the limitations of the technology in its current form. Video quality is important as artifacts or distortions in the video can lead to “a noticeable drop in audio quality.” In addition, the company said it is working to improve lip synchronization for videos that involve speech.

What we think

Google DeepMind’s V2A technology is another demonstration of how quickly AI technologies are developing. The sample videos are all under 10 seconds but do show the potential for the system in the future. The versatility of the AI model is also shown by a clip of a spaceship that has three completely different audio tracks. The only difference in the creation of the soundtrack was the text prompt which guided the AI. Importantly, Google has committed to incorporating its SynthID toolkit into its V2A research. This watermarks all AI-generated content to help safeguard against the potential for misuse of the technology.


Google hasn’t given a date for the release of the V2A AI model. The company said: “Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.” We will let you know when we have an update.

Pete Tomkies
Pete Tomkies
Pete Tomkies is a freelance cinematographer and camera operator from Manchester, UK. He also produces and directs short films as Duck66 Films. Pete's latest short Once Bitten... won 15 awards and was selected for 105 film festivals around the world.

Related Content