Native Robot World Action Model Launched! First Spatiotemporal Integrated Architecture, Developed by Fudan Affiliated Team

TL;DR · AI Summary
Fudan-affiliated team Moshen Intelligence launched STI-WM, the world’s first native robot world action model with spatiotemporal integration, solving physical interaction, long-horizon planning, and real-world deployment challenges; secured 5 rounds of funding in half a year, partnering with multiple industry giants.
Key Takeaways
- STI-WM uses spatiotemporal integrated architecture to support 100-second task pl
- Built-in physics engine integrates collision detection and dynamics constraints
- Raised 5 funding rounds in 6 months, with $45M Pre-A round oversubscribed 5x; pa
Outline
Jump quickly between sections.
Moshen Intelligence launched STI-WM, the first spatiotemporally integrated world action model designed natively for robots, solving VLA model deployment limitations.
The model unifies spatial structure, temporal evolution, physical consistency, and execution robustness, supporting RGB/point cloud inputs and closed-loop control.
Includes spatiotemporal native modeling, 3D perception, physics engine, long-horizon planning, edge deployment, and few-shot generalization, forming hard-to-replicate advantages.
Core team from Fudan University, led by Prof. Chen Tao, with Dr. Zhang Yimin (ex-Intel) and NVIDIA experts driving engineering, and Gen-Z entrepreneurs handling commercialization.
Company raised 5 funding rounds in half a year, with $45M Pre-A oversubscribed 5x; partnered with 10+ listed companies including 5+ trillion-yuan industrial leaders.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- STI-WM机器人原生世界动作模型
- 核心技术架构
- 时空一体化建模
- 物理一致性引擎
- 端到端原生融合
- 应用与落地
- 工业制造
- 居家康养
- 商业服务
- 团队与资本
- 复旦系核心团队
- 半年5轮融资
- 锁定10亿订单
Highlights
Key sentences worth saving and sharing.
STI-WM is the first native embodied brain for robots, built on spatiotemporal integration, physics consistency, and end-to-end fusion.
Over 90% of core R&D staff are from Fudan University, having strategically invested in world models, 3D perception, and temporal action generation since 2021.
STI-WM supports 100-second task planning, outputs precise actions, and corrects dynamically via real-time observation, forming a complete physical intelligence loop.
Secured 5 funding rounds in 6 months, with $45M Pre-A round oversubscribed 5x; signed strategic partnerships with 5+ trillion-yuan industry leaders.
< img id="wx_img" src="https://www.qbitai.com/wp-content/uploads/imgs/qbitai-logo-1.png" width="400" height="400">
The World Action Model Natively Designed for Robots Unveiled! First Spatiotemporal Integrated Architecture, Developed by Fudan-affiliated Team
May 31, 2026 18:13:21 | Source: QbitAI
Five funding rounds in half a year
By Yunzhong, from Ao Fei Temple
QbitAI | Official WeChat Account
The battle for General Artificial Intelligence has fully shifted from the virtual digital realm to the real physical world.
Undoubtedly, embodied intelligence and robotic brains have become the most competitive and core battlegrounds in today’s AGI landscape.
Current mainstream solutions — such as VLA (Vision-Language-Action) models, general world models, and video prediction frameworks — suffer from multiple critical shortcomings including insufficient spatial perception accuracy, lack of physical logic constraints, weak long-term planning capabilities, and poor robustness in real robot deployment. These limitations prevent robots from achieving true autonomous perception, reasoning, decision-making, and stable interaction.
At this pivotal moment of rapid iteration in the physical AI industry, Moshen Intelligence, a Fudan-affiliated tech startup with five years of deep expertise in world action model foundational technology, officially unveils its STI-WM Spatiotemporally Integrated World Model.
As a general embodied brain specifically designed for robots, this model centers on spatiotemporal integrated modeling, physical consistency constraints, and end-to-end native fusion — breaking through traditional model limitations and charting the optimal technical path for deploying AGI in the physical world.
Fudan + Intel + NVIDIA: Cutting-edge Academic Research Ranks Among Global Leaders
Moshen Intelligence’s technological breakthrough stems from sustained academic research and full-stack engineering capabilities.
The company’s core team originates from Fudan University’s Deep Learning Lab, forming a top-tier team structure integrating academic research, engineering implementation, and industrial commercialization:
Led by Professor Chen Tao, Director of Fudan University’s Future Information Innovation Academy and Head of the Deep Learning Lab, who anchors the research foundation; Dr. Zhang Yimin, former Chief Scientist at Intel China; and technical leads from NVIDIA, who drive engineering deployment; and Mu Zelin, a 95-born Fudan alumnus and serial entrepreneur, who spearheads commercial strategy — together forming the formidable “Fudan Trio” leadership core.
Over 90% of the company’s core R&D staff come from Fudan University, gathering over a hundred master’s and doctoral graduates. Since 2021 — before the industry trend became apparent — they proactively invested in three foundational technologies: world modeling, 3D perception, and temporal action generation, continuously advancing technical breakthroughs.
For years, the team has won numerous global championship titles and top-tier academic honors:
- Introduced the world’s first humanoid action generation large model, MotionGPT;
- Developed the 3D world model HL3DWM;
- Won the ICCV 2023 global 3D object recognition championship and CVPR 2024 3D dense semantic reasoning championship;
- Received the IJCAI 2025 Outstanding Paper Award — the only embodied intelligence team in China in the past five years to win this honor;
- Their technical lead ranked among the Top 20 Newcomers in China’s Embodied Intelligence EAI List for 2025.
Their original technologies have been cited by international leading labs such as NVIDIA’s DAIR, firmly establishing their innovation and engineering execution capabilities among the global elite tier.

Redefining Industry Paradigms: Five Years of R&D, Pioneering the World Action Model Path
Most current industry approaches still rely on modified combinations of general world models and VLA modules, suffering from modality fragmentation, severe information loss, and absence of real-world physical constraints — capable only of generating visually plausible outputs, but failing to meet the practical demands of real robot deployment.
Moshen Intelligence, starting from the essence of AGI, was the first to establish the native integration route for world action models:
All interactions between robots and the physical world ultimately manifest as actions. Only by accurately understanding spatiotemporal evolution patterns, adhering to physical logic, and enabling end-to-end native mapping can we truly solve the industry’s chronic problems of poor generalization and difficult deployment for robots.
In 2022, the team innovatively proposed the world’s first end-to-end spatiotemporal language-action mapping model (MLD), published at CVPR 2023. This core concept was later validated and referenced by NVIDIA’s DAIR lab in May 2025.
After five years of iterative development, the team has completed seven generations of action model upgrades, accumulating deep expertise in multimodal end-to-end fusion, high-precision action generation, and temporal logic inference — consistently leading the industry in action accuracy, inference speed, and task generalization capability.
Four-Dimensional Unified Robot-Native Architecture Solves Core Deployment Pain Points
Unlike industry solutions that retrofit large language models for adaptation, STI-WM Spatiotemporally Integrated World Model is a native embodiment intelligence framework designed from the ground up for long-term robotic planning, online closed-loop control, and real physical interaction — achieving four-dimensional unification: spatial structure, temporal evolution, physical consistency, and execution robustness.
The model supports multimodal sensory inputs including RGB images, depth point clouds, and robot body data, encoding complex environmental information into compact, efficient spatiotemporal latent world states. It enables upper-level long-horizon (up to hundreds of seconds) task simulation and global trajectory planning, while outputting precise, controllable fine-grained action segments at the lower level.
Simultaneously, it leverages real-time environmental observation for dynamic correction and online replanning, constructing a complete physical intelligence feedback loop: understand world → simulate future → plan actions → execute and correct.
Compared to Dreamer-style models focused on environment prediction but neglecting real robot control, or LWM/PWM-type abstract action models with fragmented spatiotemporal representation, and video generation models prioritizing visual realism over physical feasibility, STI-WM breaks free from pure visual simulation traps. With core pillars of 3D geometric constraints, dynamics validation, and real robot closed-loop execution, it comprehensively resolves the fundamental pain points of traditional models — information distortion, weak generalization, and deployment difficulty — enabling robots to truly understand 3D space, obey physical laws, autonomously plan tasks, and execute stably in闭环.
△ Moshen STI-WM 1.0 Spatiotemporally Integrated World Model Architecture
Six Core Technical Barriers Support Large-Scale Robot Commercialization
Built upon five years of full-stack self-researched accumulation, STI-WM establishes formidable technical advantages difficult for competitors to replicate:
- Native Spatiotemporal Integrated Modeling: Real-time coupling of spatial structure and temporal dynamics eliminates information loss from modular stacking, significantly improving inference efficiency and decision accuracy.
- Native 3D Perception Capability: Direct reconstruction of real physical space from point clouds, eliminating inherent flaws of 2D vision — such as depth ambiguity and spatial misjudgment.
- Built-in Physical Consistency Engine: Integrates collision detection and dynamics constraints to prevent physically impossible actions and environmental collapse, ensuring safe and stable real-world execution.
- Long-Horizon High-Level Planning: Breaks beyond short-segment action limits, supporting autonomous simulation of complex continuous tasks lasting up to hundreds of seconds — suitable for real-world complex operational scenarios.
- Edge Lightweight Deployment: Proprietary model compression, quantization, and distillation technologies enable cost-effective deployment of billion-parameter models on robot edge chips, dramatically lowering industrialization computational barriers.
- Few-Shot Strong Generalization: Achieves efficient adaptation to unfamiliar environments and long-tail tasks via large-scale pretraining in virtual worlds + minimal real-world fine-tuning, drastically reducing data dependency.
△ Moshen’s “One Brain, Multiple Forms” Cross-Platform General Brain
Capital and Commercial Boom Accelerates Industrialization
Thanks to its original architecture innovation, full-stack self-developed technical moats, and deployable commercial capabilities, Moshen Intelligence is experiencing rapid growth — completing five funding rounds within half a year, with its 300 million RMB Pre-A round receiving fivefold oversubscription. It has earned strong recognition from national investment platforms, top industrial capital, and securities firms.
Commercial deployment is also accelerating rapidly. The company has established deep collaborations with industry leaders such as Unitree Robotics, Hechuan Technology, and Yijia Elderly Care, applying its technology across diverse real-world scenarios including industrial manufacturing, home healthcare, and commercial services.
Currently, Moshen Intelligence has signed strategic partnerships with nearly ten listed companies, including more than five trillion-yuan industry giants. It expects to secure 1 billion RMB in orders over the next three years, with industrialization speed far exceeding industry averages.
△ Moshen Intelligence’s Strategic Partnership with Elderly Care Leader Yijia
Today, the AGI competition has officially entered the era of Physical Intelligence. The native embodied brain centered around world action models has become the core infrastructure for general-purpose robots.
Going forward, Moshen Intelligence will continue iterating the STI-WM model ecosystem, empowering all types of hardware including humanoid robots, quadruped robots, industrial arms, and service robots — accelerating the large-scale deployment of general embodied intelligence, driving Chinese-native physical AI technology to lead globally, and ushering in a new era of AGI in the physical world.
*Copyright © 2026. Unauthorized reproduction or use in any form is strictly prohibited. Offenders will be prosecuted.*