T
traeai
Sign in

traeai topic radar

机器人、具身智能与多模态模型进展

聚合 robotics、具身智能、空间理解、机器人基础模型、仿真训练与产业应用内容。

What searchers are trying to solve

想追踪机器人和具身智能领域的新模型、新系统和真实应用案例。

Why this is worth tracking

具身智能正在把模型能力带入物理世界,是 AI 长周期趋势中最值得持续观察的方向之一。

机器人具身智能roboticsembodied AI空间理解机器人基础模型多模态

长尾组合

这个主题可以沿着工具、实践、对比等搜索意图持续扩展,不靠空壳换词,而是用真实材料更新。

机器人 工具机器人 实践机器人 对比具身智能 工具具身智能 实践具身智能 对比robotics 工具robotics 实践

可自动化内容模块

精选材料

持续抓取与 机器人与具身智能 相关的高分文章、播客、视频和推文。

趋势判断

把最近变化、反复出现的观点和争议点整理成稳定摘要。

实体关联

自动连接相关公司、模型、产品、人物和概念,形成可继续深挖的入口。

Featured content

Filtered by relevance, score, and recency.

Search more
Shanghai Jiao Tong University x ShangHai Creation x Rui Jin Hospital Unveil CX-Mind: Chest X-ray Diagnosis Enters the Era of 'Verifiable Reasoning'

Shanghai Jiao Tong University x ShangHai Creation x Rui Jin Hospital Unveil CX-Mind: Chest X-ray Diagnosis Enters the Era of 'Verifiable Reasoning',through multimodal large models and reinforcement learning technology, enhancing the explainability and clinical utility of medical imaging AI.

入选理由:CX-Mind is the first multimodal large model to bring chest X-ray diagnosis into

FeaturedArticle#Medical AI#Chest X-ray Diagnosis#Verifiable Reasoning#Multimodal Large Models#Reinforcement Learning中文
Gemma 4 12B: The Developer Guide

Gemma 4 12B: The Developer Guide

Google Developers Blog1171 字 (约 5 分钟)
92

Gemma 4 12B features an encoder-free multimodal architecture that runs locally on 16GB VRAM devices with native audio support. By eliminating separate vision and audio encoders, it reduces latency and pairs with a dedicated MTP model for faster inference, marking the first mid-sized multimodal model with a macOS desktop app for fully offline interaction.

入选理由:Gemma 4 12B removes separate encoders; vision uses a 35M-param embedder and audi

FeaturedArticle#Gemma 4#Multimodal LLM#Encoder-Free Architecture#Local AI#Google英文
Nearly $200M Raised! VAST Unveils World Model Roadmap with Project Eden

VAST secured nearly $200M in new funding and officially disclosed its world model roadmap, Project Eden, pioneering a decoupled architecture of state evolution and visual rendering to enable persistent multi-user interaction, modular reuse, and linearly scalable compute for AI-native sandboxes and embodied intelligence simulation.

入选理由:VAST raised nearly $200M in A+/A++ rounds, backed by Yancey Capital, China Life

FeaturedArticle#VAST#World Model#Project Eden#AI 3D#Embodied Intelligence中文
Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

NVIDIA Cosmos 3 is the first open-source omni-model for physical AI, integrating world generation, physical reasoning, and action generation into one unified system. Built on MoT architecture, it supports robotics, autonomous driving, and synthetic data pipelines via Hugging Face and Diffusers.

入选理由:Cosmos 3 is the first open model unifying world generation, physical reasoning,

FeaturedArticle#NVIDIA#Physical AI#Omni-model#Hugging Face#MoT Architecture英文
Introducing NVIDIA Cosmos 3: Unified Multimodal Model for Physical AI

Introducing NVIDIA Cosmos 3: Unified Multimodal Model for Physical AI

NVIDIA Developer543 字 (约 3 分钟)
92

NVIDIA launches Cosmos 3, the first unified multimodal model integrating language, video, sound, and action inputs/outputs, built on Mixture of Transformer architecture, open-sourced with weights available on Hugging Face, achieving top scores across physical AI benchmarks including Robo Lab, PiBench, and Vintage.

入选理由:Cosmos 3 is the first omni-model combining language, video, audio, and action mo

FeaturedVideo#NVIDIA#Physical AI#Multimodal Model#Mixture of Transformers#Open Source英文
Just Released: The World’s First Event-Level Prediction Embodied AI World Model!

ZiBianLiang Robotics launched WALL-WM, the world’s first event-level prediction embodied world model, replacing frame-based prediction with semantic events (e.g., 'grasp', 'place'), significantly improving cross-scenario generalization and action robustness.

入选理由:WALL-WM uses semantic events (e.g., grasp, lift) as modeling units instead of fi

FeaturedArticle#Embodied AI#World Model#VLA#Event Modeling#Robot Learning中文
iFLYTEK’s First AI Glasses: Leveraging 40g to Reshape AI Workflows

iFLYTEK’s First AI Glasses: Leveraging 40g to Reshape AI Workflows

爱范儿4643 字 (约 19 分钟)
92

iFLYTEK’s first AI glasses—weighing only 40g, featuring end-to-end speech translation and lip-motion noise cancellation—embed translation into real-world workflows, directly addressing the industry’s 30%–50% return rate; its success stems from system-level engineering and years of translation scenario data, not hardware spec racing.

入选理由:iFLYTEK’s AI glasses weigh just 40g (with display), the lightest in class, achie

FeaturedArticle#AI Glasses#Multimodal Interaction#Edge AI#iFLYTEK#Speech Translation中文
7B Beats o3 and GPT-5! Medical AI Agents Learn ‘Where to Look and How to Look’

Ophiuchus-7B achieves a mean score of 68.0 on 8 medical VQA benchmarks, surpassing OpenAI-o3 (62.2), Gemini 2.5 Pro (61.8), and GPT-5 (59.9). The core breakthrough is the new ‘Think with Images/Videos’ paradigm: models actively invoke tools like SAM2 and BiomedParse during reasoning to re-examine key regions/moments, making visual evidence an integral part of cognition—not just input.

入选理由:Ophiuchus-7B scores 68.0 on 8 medical VQA benchmarks, significantly outperformin

FeaturedArticle#Medical AI#Multimodal LLM#Agent#ICML 2026#Visual Reasoning中文
AI Paper Review: GPT-4 Technical Report (GPT-4)

AI Paper Review: GPT-4 Technical Report (GPT-4)

freeCodeCamp.org9755 字 (约 40 分钟)
92

GPT-4标志着大型语言模型从实验性研究向实用化AI平台的转变,引入多模态处理和对齐技术。

入选理由:GPT-4支持文本与图像输入,推动AI系统向通用化发展。

FeaturedArticle#GPT-4#AI#多模态#OpenAI中文
The Most Impressive Robot Demo of the Year Just Dropped!

The Most Impressive Robot Demo of the Year Just Dropped!

量子位2760 字 (约 12 分钟)
92

Genesis AI unveiled GENE-26.5, its first general-purpose robot foundation model, capable of complex tasks like cracking eggs, solving Rubik's cubes, and playing piano—all autonomously with minimal real-world fine-tuning data.

入选理由:GENE-26.5 uses a unified model for multi-task control with multimodal inputs, re

FeaturedArticle#Robotics#Foundation Model#Embodied Intelligence#Genesis AI#Simulation中文
Beyond Banana and GPT Image: A 15-Person Chinese Team Builds an AI Image Generation黑马

A 15-person Chinese team, Luma AI, launched Uni-1.1, an AI image model that integrates reasoning and generation, slashes costs by 50%, and achieves top-3 global ranking on Arena.ai—offering the most controllable, scalable solution for brand visual production beyond OpenAI and Google.

入选理由:Uni-1.1 unifies reasoning and generation in one model, enabling brand consistenc

FeaturedArticle#AI Image Generation#Luma AI#Uni-1.1#Advertising Automation#Multimodal Reasoning中文
Most people use vector databases for chatbots and RAG pipelines. 𝗦𝗲𝗻𝗾𝗶 𝗔𝗜 𝘂𝘀𝗲𝘀 ...

Senqi AI 使用 Milvus 向物理机器人注入长期语义记忆能力,解决真实世界任务中环境动态、任务无界、指令模糊和错误高成本等核心挑战。

入选理由:物理机器人Agent需实时重规划,因环境持续变化且任务无明确终点

FeaturedTweet#Milvus#RAG#机器人#向量数据库#AI Agent中文
#519.普林斯顿Zhuang Liu谈架构、数据与记忆的真相

#519.普林斯顿Zhuang Liu谈架构、数据与记忆的真相

跨国串门儿计划1412 字 (约 6 分钟)
92

普林斯顿Zhuang Liu指出:AI性能瓶颈不在架构创新,而在数据质量与记忆机制;视觉是多模态枢纽但受算力制约;语言模型已具备强抽象世界模型。

入选理由:架构细节(归一化、激活函数等)的组合效应远超核心组件选择

FeaturedPodcast#AI架构#多模态#数据驱动#世界模型#记忆机制中文
国产多模态Agent拿下医学分割SOTA!不用改模型、不加token

IBISAgent通过多步交互决策重新定义医学图像分割,解决了隐式token导致的推理退化问题,显著提升分割精度。

入选理由:将分割任务建模为多步马尔可夫决策过程,保留语言推理能力

FeaturedArticle#医学图像分割#多模态模型#强化学习#CVPR中文
Cosmos 3 is here

Cosmos 3 is here

NVIDIA Developer268 字 (约 2 分钟)
90

NVIDIA launches Cosmos 3, an open omni-model for physical AI based on a novel mixture-of-transformers architecture, capable of generating physics-accurate synthetic video, serving as a world model and simulator, and enabling training for robotic and mobile intelligent systems.

入选理由:Cosmos 3 uses a novel hybrid Transformer architecture combining autoregressive a

FeaturedVideo#NVIDIA#AI#Physical AI#Transformer#World Model英文
Robotics Control Training Enters the Minute-Level Era! Tsinghua AIR Open-Sources UniLab: 3 Minutes to Train Humanoid Robots, 10x Speed Boost, Runs on Mac

Tsinghua University's AIR DISCOVER Lab open-sources UniLab, achieving 3-10x end-to-end training speedup through heterogeneous architecture, supporting local training on Mac and enabling humanoid robot training in minutes, marking the arrival of the minute-level era for embodied intelligence.

入选理由:UniLab uses a CPU-simulation + GPU-training heterogeneous architecture to achiev

FeaturedArticle#Robotics#Reinforcement Learning#Embodied Intelligence#Open Source#Heterogeneous Computing中文
FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs

Stephen Batifol from Black Forest Labs introduces FLUX, an open-source visual generation model series emphasizing open research for sustainable AI development, with performance rivaling leading closed-source models.

入选理由:FLUX supports 1024×1024 resolution image generation, matching top-tier closed-so

FeaturedVideo#FLUX#Visual AI#Open Source Model#Black Forest Labs#Multimodal英文
突破视觉仿真算力瓶颈!新一代具身智能仿真框架开源:高吞吐并行高保真渲染助力规模化训练

清华大学AIR DISCOVER Lab等机构联合推出GS-Playground,这是一个专为视觉中心的机器人学习设计的新一代仿真框架,实现了高吞吐量并行物理仿真与高保真视觉渲染的融合,助力具身智能规模化训练,已被RSS 2026顶级会议录用。

入选理由:GS-Playground解决了高保真视觉渲染与大规模训练之间的矛盾,提供稳定高效的仿真平台。

FeaturedArticle#具身智能#机器人学习#视觉仿真#物理引擎#清华大学中文
全球首个世界统一模型发布,机器人家庭成员来了!

全球首个世界统一模型发布,机器人家庭成员来了!

量子位4359 字 (约 18 分钟)
90

自变量机器人发布全球首个世界统一模型WALL-B,打通视觉、听觉、语言和触觉模块,赋予机器人原生多模态能力和持续进化能力。

入选理由:WALL-B基于世界统一模型,解决了传统VLA架构中模块间数据搬运的问题。

FeaturedArticle#机器人#人工智能#具身智能#WALL-B中文
突破视觉仿真算力瓶颈!新一代具身智能仿真框架开源:高吞吐并行高保真渲染助力规模化训练

清华AIR联合多家机构开源GS-Playground仿真框架,首次融合高吞吐并行物理仿真与高保真视觉渲染,显著提升具身智能规模化训练效率。

入选理由:支持CPU/GPU双后端及全系统原生运行,适配四足/人形/机械臂等多类机器人

FeaturedArticle#具身智能#仿真框架#GS-Playground#清华AIR#RSS中文
World's Top Metamodel Shifts Hands: Crossover Intelligence Claims World Arena

Crossover Intelligence topped the World Arena Track 2 with its DSCFuncWorld, outpacing the second-place model by a significant margin and validating end-to-end data generation, strategy training, and task execution capabilities.

入选理由:Crossover Intelligence's DSCFuncWorld topped World Arena Track 2, outperforming

FeaturedArticle#World Arena#Metamodel#Crossover Intelligence#Data Engine#DexWorldModel中文
Gemma-4 12B + Hermes, Google AI Edge: EASY, GOOD & LOCAL!

Gemma-4 12B + Hermes, Google AI Edge: EASY, GOOD & LOCAL!

AICodeKing3109 字 (约 13 分钟)
87

Gemma-4 12B is an encoder-free, unified multimodal model that runs directly on laptops with 16GB VRAM. It matches the performance of the 26B MOE with less than half the memory footprint, ships with Hermes and agent tools, macOS Edge Gallery, and RTLM, and is released under Apache 2.0.

入选理由:Image and audio inputs flow directly into the LLM, eliminating separate encoders

FeaturedVideo#Gemma#412B#Multimodal#Local Deployment#Hermes英文

Related topics

跨材料问答 · 机器人、具身智能与多模态模型进展

回答基于:机器人、具身智能与多模态模型进展 主题下 25 条材料
    0 / 500

    AI may generate inaccurate information. Please verify important content.