Add realistic sound effects, dialogs, and soundtracks to your AI-generated videos.
In the past few weeks, we’ve seen a wave of text-to-video and image-to-video tools like Google Veo, Kuaishou’s Kling, Luma Lab’s Dream Machine, and the newly announced Runway Gen-3 Alpha.
These AI video tools generate impressive results, but they share a common limitation — they are all silent.
No dialog, no soundtrack, and no sound effects.
Today, Google shared an update about an internal technology they are developing that can generate audio from video input.
Google’s video-to-audio (V2A) combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.
V2A not only creates realistic sound effects and dialogue that match the characters and tone of a video, but it can also generate soundtracks for various traditional footage, including archival material, silent films, and more.
Here are five examples that the Google Deepmind team shared in a blog post:
1. Drums
Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd
2. Cars
Prompt for audio: cars skidding, car engine throttling, angelic electronic music
3. Wolf
Prompt for audio: Wolf howling at the moon
4. Underwater Jellyfish
Prompt for audio: jellyfish pulsating under water, marine life, ocean
5. Horror Scene
Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete
These are wild!
While there are limitations, such as artifacts and distortions, the overall quality of the output is still enough to significantly enhance the video experience.
It’s high time these AI-generated videos are paired with an audio generator, and V2A is a promising step in that direction.
Google experimented with various approaches to find the most scalable AI architecture for audio generation, and the diffusion-based method provided the most realistic results for synchronizing video and audio.
Diffusion is the process by which an AI model is trained to recompose visuals (still or moving) of concepts from pixellated “noise,” based on learning those concepts from annotated images or video and text pairs.
The V2A system begins by encoding video input into a compressed form. Using a diffusion model, the audio is iteratively refined from random noise, guided by the visual input and natural language prompts to generate synchronized, realistic audio. The final audio output is then decoded, turned into an audio waveform, and combined with the video.
To enhance audio quality and guide the model toward specific sounds, the researchers incorporated AI-generated annotations with detailed sound descriptions and transcripts of spoken dialogue during the training process. This allows the technology to associate specific audio events with various visual scenes based on the provided annotations or transcripts.
For more details, check out Google’s blog post here.
Despite the advancements, there are still several limitations Google is working to address:
The team working on this tech says that further research is underway to address these limitations and enhance the capabilities of the V2A system.
Despite being in the preview phase, the initial results of Google’s V2A technology are already impressive. Video generators are advancing at an unprecedented pace, and it’s high time these AI videos are paired with an audio generator.
I can’t wait to hear the audio of all the memes people are generating with AI video generators.
However, the timeline for public access to V2A remains unclear. According to Google, this tech will have to undergo some rigorous tests before giving access to the public.
Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.
Nevertheless, it’s encouraging to know that such technology is in development, and we could soon see AI video generators seamlessly integrating audio.
‍
Software engineer, writer, solopreneur