Key Takeaways:
Researchers at Google’s AI lab, DeepMind, have developed an artificial intelligence model that can generate soundtracks and dialogues for videos. In an official blog post, DeepMind said it sees the Video-to-Audio (V2A) model as an “essential piece of the AI-generated media puzzle”.
DeepMind Announces V2A, an AI that can Generate Soundtrack and Dialogue for Videos
While many firms have developed video and sound-generating AI models, none of them have been able to create sound effects that sync perfectly with the videos they produce. V2A’s development is seen by experts as a significant step towards using AI to create fully audiovisual experiences.
DeepMind says that V2A is a diffusion model that was trained on a combination of sounds, dialogue transcripts, and video clips. How the model works is genius. It takes the description of a soundtrack paired with a video waiting to be generated and then creates music, sound effects, and even dialogue that matches the characters and tone of the content.
The video and audio produced by V2A are watermarked by SynthID, a deep fake-combating technology developed by the company. The model is said to work well with videos generated by Google Veo, a video-generation tool announced at the Google I/O 2024.
V2A combines video information with text prompts. Veo users can also provide additional instructions, guiding the model towards producing specific sounds they want for the video. This also allows for greater creative control over the generated soundtrack.
V2A Can Understand Raw Pixels in Videos and Sync Sounds Automatically
DeepMind claims the technology behind V2A is unique because it can understand the raw pixels from a video and sync generated sounds with the video automatically by scanning the user description for more clues.
Some use cases for the model include producing soundtracks for silent videos or traditional footage, such as archival materials and silent movies. The company pitches its V2A technology as a useful tool for archivists and those working with historical footage.
DeepMind also acknowledged that V2A isn’t perfect because its underlying model wasn’t trained on a lot of videos with artifacts or distortions. Moreover, it is not able to create particularly high-quality videos, and in general, the audio it generates isn’t super convincing.
Considering these issues and to prevent any misuse, DeepMind says it won’t be releasing the technology to the public anytime soon. Despite the concerns, researchers have found that the input and lip movements in video generated by other models are no match for the ones produced by V2A.
Public Release for DeepMind’s V2A Not Expected Anytime Soon
AI-powered soundtrack-generating tools are not novel. AI startups like ElevenLabs and Stability AI have launched models that can create video sound effects based on user prompts. Microsoft is reportedly working on an AI technology that can generate talking and singing videos from a still image.
Other platforms like Pika and GenreX have trained their models to be capable of guessing what music or effects are appropriate for a given scene in a video.
DeepMind did not confirm whether any of the video and audio data used to train V2A was copyrighted or if its creators were informed of the work the company was doing. The company said the technology will undergo “rigorous safety assessments and testing” before its public release.
More News: Samsung Is Bringing Foldable-Specific AI Features To The Galaxy Z Fold 6 And Z Flip 6