Gemini Omni Flash makes video Gemini’s next multimodal output
May 29, 2026

Google introduced Gemini Omni Flash: a new model that can combine text, images, audio and video as input, then generate or conversationally edit video output.
Google has introduced Gemini Omni, a new model family that pushes Gemini’s multimodal capabilities further into video production. The first public member is Gemini Omni Flash. According to Google, the model can combine text, images, audio and video as input and generate new videos from them. At launch, the focus is clearly on video; additional output modalities such as image and audio are planned for later.
The key point is not just video generation, but conversational editing. Users should be able to take an existing video and modify it with natural language: swap objects, change actions, adjust camera angles, alter lighting or transform the style of a scene. Google emphasizes consistency across multiple prompts: characters should remain recognizable, the scene should remember previous steps and physical behavior should become more plausible.
That positions Omni not as a simple text-to-video generator, but as a creative multimodal tool. The most important capability is combining different references: an image can define the character, a video can provide motion, audio can provide rhythm and text can define the desired scene. Omni is then supposed to turn those inputs into one coherent clip. For creators, this matters because existing material becomes controllable prompt context rather than just raw footage for traditional editing.
The business value is obvious: product videos, social clips, training material, explainers and fast visual prototypes become cheaper and faster to produce. At the same time, the pressure on media literacy and provenance increases. Google says videos created with Omni include SynthID and C2PA Content Credentials. That matters, but it does not solve everything: platforms, newsrooms and companies still need processes to label AI-generated or AI-edited video and detect misuse.
According to Google, Gemini Omni Flash is rolling out first to Google AI Pro and Ultra subscribers in the Gemini app and Google Flow. It is also being made available at no additional cost to users of YouTube Shorts and YouTube Create starting this week. Developers and enterprise customers are expected to get API access in the coming weeks.
The takeaway: Gemini Omni is a clear signal that AI video competition is no longer only about producing pretty clips. The decisive questions are how well models handle real references, how stable iterative editing becomes and whether they can reliably combine world knowledge, physics and style control. That is exactly where Google is now making its move.
💡 In plain English
Gemini Omni Flash is like a video assistant: you give it text, images, audio or an existing video, and it can create a new video or edit the old one through chat.
Key Takeaways
- →Gemini Omni Flash combines text, image, audio and video inputs for video output.
- →The focus is conversational video editing, not just text-to-video generation.
- →Google starts with the Gemini app, Google Flow, YouTube Shorts and YouTube Create; APIs are planned next.
- →SynthID and C2PA are intended to make provenance and AI generation more transparent.
FAQ
What is Gemini Omni Flash?
Gemini Omni Flash is the first model in the new Gemini Omni family and can generate or edit videos from multimodal inputs.
Can Gemini Omni already output images and audio?
At launch, the focus is video. Google says additional output modalities such as image and audio will follow later.
Is there an API?
Google says developers and enterprise customers should get API access in the coming weeks.