Google Gemini Omni transforms text, images, and audio into video content

Google Gemini Omni can generate videos from text, images, and audio inputs, showcasing Google's latest advances in multimodal AI content creation.

Shivangi Yadav

May 21, 2026 - 20:03

Google Gemini Omni transforms text, images, and audio into video content

Image Credits: Google

Google has unveiled a major new step in its long-term artificial intelligence strategy with the introduction of Gemini Omni, a new family of multimodal AI models designed to generate content across multiple formats. The announcement was made during the company’s annual Google I/O developer conference, where executives described the technology as a significant advancement toward creating AI systems capable of understanding and generating information across text, images, audio, and video within a unified framework.

When Google first introduced Gemini three years ago, the company’s vision centred on building a truly multimodal large language model. Rather than training separate systems for different media formats, the goal was to create a single neural network capable of understanding text, images, audio, video, and code simultaneously while generating outputs in any of those formats.

According to Google CEO Sundar Pichai, Gemini Omni represents the next major milestone toward that objective. During the conference, Pichai described the technology as a family of models that will eventually be able to “create anything from any input.”

The first implementation of that vision focuses primarily on video generation. With Gemini Omni, users can combine a variety of inputs—including text prompts, images, audio clips, and video footage—and the model processes those inputs collectively rather than treating them as isolated pieces of information.

Instead of simply stitching media elements together, Omni analyses relationships among all provided inputs and reasons across them to create a coherent output. Google says this allows the model to generate videos that demonstrate an understanding of scientific concepts, historical context, cultural references, and physical behaviours, resulting in more consistent and realistic content.

The platform also expands beyond video creation. Users can edit photographs through simple natural-language instructions rather than relying on traditional image-editing software or complex technical workflows. The experience resembles the functionality introduced through Google’s Nano Banana project, which allows image modifications through conversational prompts.

Google already offers a dedicated video-generation model known as Veo, which enables users to transform text prompts and images into videos and provides options for avatar customisation and creative direction. However, company executives emphasised that Gemini Omni should not be viewed merely as an update to Veo.

Nicole Brichtova, director of product management at Google DeepMind, described the launch as a broader evolution of Google’s AI ecosystem.

“It’s the next step towards the progression of combining the intelligence of Gemini with the rendering capabilities of our media models,” Brichtova explained.

To demonstrate the system’s capabilities, Google DeepMind Chief Technologist Koray Kavukcuoglu shared examples during a media briefing before the announcement. In one demonstration, the model received a simple prompt requesting “a claymation explainer of protein folding.”

Within a short period, Gemini Omni generated a stop-motion-style educational video complete with narration explaining the scientific process. The voice-over described how proteins begin as chains of amino acids and eventually fold into structures such as alpha helices and beta sheets, creating complex three-dimensional forms.

The demonstration highlighted Google’s broader ambition for the technology. Beyond generating videos from mixed inputs, the company envisions a future in which Gemini Omni can seamlessly transform information across media types. Potential future applications include generating images from audio recordings, creating audio from video footage, and producing entirely new forms of multimedia content from a wide range of inputs.

Reflecting on Gemini’s origins, Pichai explained that multimodality has been central to the project since its inception.

“When we first announced Gemini, it was our first AI model to be natively multimodal,” Pichai said. “We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world.”

He added that advances in world models are helping artificial intelligence move beyond predicting text and toward simulating aspects of reality.

“With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction,” Pichai said.

As part of the launch, Google is also introducing personalised digital avatar creation tools. Users will be able to generate videos featuring digital versions of themselves, a concept that gained popularity through products such as OpenAI’s former Cameos feature within its now-discontinued Sora application.

To address concerns surrounding deepfakes and identity misuse, Google is implementing a dedicated verification process for avatar creation. According to Brichtova, users must complete an onboarding procedure that includes recording themselves and speaking a sequence of numbers. Once verified, the digital avatar is securely stored for future use.

The company is also integrating content authentication technology into every video generated through Gemini Omni. Each generated video will include Google’s SynthID digital watermark, enabling users and platforms to verify whether the content originated from Gemini-powered products.

Google says the watermarking system serves as an additional safeguard designed to improve transparency as AI-generated media becomes increasingly sophisticated and difficult to distinguish from authentic recordings.

The first release within the Gemini Omni family is called Gemini Omni Flash. Beginning immediately, Flash will be available across several Google products, including the Gemini application, YouTube Shorts, and the company’s AI creative platform, Flow.

Initially, Flash will support creating videos up to 10 seconds long. Brichtova clarified that this limitation is not due to technical constraints within the model itself. Instead, Google intentionally chose the shorter format to make the technology broadly accessible while aligning with the expectation that many users will initially focus on creating brief content.

She also indicated that support for longer video durations is already being developed and is expected to arrive in future updates.

Google appears to be positioning Omni Flash primarily as a consumer-focused product during its early rollout. Examples presented by Brichtova and DeepMind research engineer Gabe Barth-Maron emphasised personal and entertainment-oriented use cases.

Among the demonstrations were videos depicting users winning prestigious awards, travelling to the moon, or removing unwanted individuals from the background of vacation footage. These examples were designed to showcase how ordinary consumers could use the technology to create personalised content.

Barth-Maron summarised the concept succinctly, describing the generated videos as “personalised memes.”

“We definitely did focus on making this easy to use for consumers,” Brichtova said. “Not many video models have breached that chasm with consumers, so this is our play to do that.”

However, the simplified user experience comes with certain considerations. Both Brichtova and Barth-Maron noted that users will need to provide highly detailed editing instructions when modifying content. Vague prompts may cause the model to make broader changes than intended or alter elements users hoped to preserve.

This challenge mirrors issues previously encountered by users of Nano Banana and other AI editing systems, where insufficiently specific instructions occasionally produced unexpected results.

Although Google is emphasising consumer accessibility in the short term, the broader implications for creative professionals and enterprise customers are substantial. The company confirmed that Gemini Omni will become available through APIs in the coming weeks, enabling developers and businesses to integrate its capabilities into their own products and workflows.

Google expects content creators to make significant use of the avatar-generation features already available through YouTube Shorts. More broadly, the company sees end-to-end multimodal content creation as a potentially transformative tool for industries such as filmmaking, advertising, marketing, and digital media production.

The vision aligns with developments occurring elsewhere in the industry. For example, startup Luma AI is building a similar agentic system that can generate complete advertising campaigns from a short creative brief and a single product image, using its own unified AI model.

Brichtova noted that one area in which Google believes Gemini Omni performs particularly well is text rendering in generated media. Accurate text generation is often critical in advertising applications, where slogans, product names, branding elements, and promotional messaging must appear precisely as intended.

“We’re actually pretty proud of the model’s text-rendering capabilities, which is really useful for things like advertising,” Brichtova said. “If you want a product somewhere, or even just a slogan, it needs to be accurate.”

She added that Google expects filmmakers and other professional creators to adopt the technology as its capabilities continue to expand.

For more advanced creative and enterprise workloads, Google is developing a higher-tier model known as Gemini Omni Pro. The company expects Pro to outperform Flash across the full range of Omni capabilities, though an official release date has not yet been announced.

According to Brichtova, Google plans to release the professional-grade version once it achieves what the company considers a substantial performance improvement over the Flash model.

As Google continues integrating multimodal intelligence with advanced media generation technologies, Gemini Omni represents one of the company’s most ambitious efforts yet to create AI systems capable of understanding and generating content across every major form of digital media, bringing it closer to its original vision of a truly unified multimodal artificial intelligence platform.

Tags:

What's Your Reaction?

Like 0

Dislike 0

Love 0

Funny 0

Angry 0

Sad 0

Wow 0

Shivangi Yadav Shivangi Yadav reports on startups, technology policy, and other significant technology-focused developments in India for TechAmerica.Ai. She previously worked as a research intern at ORF.