Gemini Omni Is Here! Google’s Edge Is Still in Multimodal Models, Right?!
Google's Gemini Omni is the first natively multimodal model for video understanding and generation, enabling arbitrary combinations of image, text, video, and audio inputs with conversational editing and physics-aware reasoning, significantly outperforming prior models like Veo.
入选理由:Gemini Omni 支持图、文、视频、音频任意组合输入,实现多轮对话式视频编辑,无需重述完整提示词。
