Key technologies and typical applications of “AI+” empowered embodied intelligence robots

doi:10.12267/j.issn.2096-5931.2025.08.003

Abstract

Abstract:

The deep convergence of artificial intelligence (AI) and robots has become a decisive catalyst for the next leap in robots technology, giving rise to new forms of intelligent agents. Among these, embodied intelligent robots stand out due to their core emphasis on physical embodiment and environmental interaction. Focusing on this specific form enabled by “AI+”, this paper offers a comprehensive survey of the conceptual evolution and current development of embodied intelligence robots, highlighting how AI reshapes perception, cognition, decision-making, execution, and data foundations. By examining key technologies, namely, multimodal perception, large language models, and deep reinforcement learning, and demonstrating their deployment in industrial manufacturing, healthcare, and household services, this paper illustrates the concrete achievements of “AI+” empowered embodied intelligence robots. The paper also identifies practical bottlenecks, including high computational demands and limited algorithmic generalization and robustness, and discusses future directions such as more efficient model architectures, cross-modal synergies, and broader domain expansion. These insights aim to provide references for both technological innovation and industrial adoption of embodied intelligence robots.

Key words: AI+, embodied intelligence, multimodal perception

CLC Number:

LI Tengda, ZHU Ziyu, HAN Ziqi. Key technologies and typical applications of “AI+” empowered embodied intelligence robots[J]. Information and Communications Technology and Policy, 2025, 51(8): 15-25.

Add to citation manager EndNote|Ris|BibTeX

URL:

http://ictp.caict.ac.cn/EN/10.12267/j.issn.2096-5931.2025.08.003

http://ictp.caict.ac.cn/EN/Y2025/V51/I8/15

Figures/Tables 4

References 22

[1]	张钹, 朱军, 苏航. 迈向第三代人工智能[J]. 中国科学:信息科学, 2020, 50(9):1281-1302.
[2]	BROOKS R A. Intelligence without representation[J]. Artificial Intelligence, 1991, 47(1-3):139-159.
[3]	刘华平, 郭迪, 孙富春, 等. 基于形态的具身智能研究:历史回顾与前沿进展[J]. 自动化学报, 2023, 49(6):1131-1154.
[4]	沈甜雨, 陶子锐, 王亚东, 等. 具身智能研究的关键问题:自主感知、行动与进化[J]. 自动化学报, 2025, 51(1): 43-71.
[5]	王文晟, 谭宁, 黄凯, 等. 基于大模型的具身智能系统综述[J]. 自动化学报, 2025, 51(1):1-19.
[6]	YANG Z, LI L, LIN K, et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision)[J]. arXiv Preprint, arXiv:2309.17421, 2024.
[7]	WANG J, SHI E, HU H, et al. Large language models for robotics: opportunities, challenges, and perspectives[J]. Journal of Automation and Intelligence, 2025, 4(1):52-64.
[8]	HU Y, LIN F, ZHANG T, et al. Look before you leap: unveiling the power of GPT-4V in robotic vision-language planning[J]. arXiv Preprint, arXiv:2311.17842, 2025.
[9]	LIU H, LI C, WU Q, et al. Visual instruction tuning[J]. arXiv Preprint, arXiv:2304.08485, 2023.
[10]	SHAFIULLAH N M M, PAXTON C, PINTO L, et al. CLIP-fields: weakly supervised semantic fields for robotic memory[J]. arXiv Preprint, arXiv:2210.05663, 2023.
[11]	JAMES S, WADA K, LAIDLOW T, et al. Coarse-to-fine q-attention: efficient learning for visual robotic manipulation via discretisation: proceedings[J]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022:13739-13748.
[12]	HUANG C, MEES O, ZENG A, et al. Audio visual language maps for robot navigation[J]. arXiv Preprint, arXiv:2303.07522, 2023.
[13]	QIN M, LI W, ZHOU J, et al. LangSplat: 3D language gaussian splatting: proceedings[J]. arXiv Preprint, arXiv:2312.16084, 2024.
[14]	SHORINWA O, TUCKER J, SMITH A, et al. Splat-MOVER: multi-stage, open-vocabulary robotic manipulation via editable gaussian splatting[J]. arXiv Preprint, arXiv:2405.04378, 2024.
[15]	兰沣卜, 赵文博, 朱凯, 等. 基于具身智能的移动操作机器人系统发展研究[J]. 中国工程科学, 2024, 26(1):139-148. doi: 10.15302/J-SSCAE-2024.01.010
[16]	HUANG W, WANG C, ZHANG R, et al. Voxposer: composable 3d value maps for robotic manipulation with language models[J]. arXiv Preprint, arXiv:2307.05973, 2023.
[17]	ZHEN H, QIU X, CHEN P, et al. 3D-VLA: a 3d vision-language-action generative world model[J]. arXiv Preprint, arXiv:2403.09631, 2024.
[18]	WU J, YIN S, FENG N, et al. Ivideogpt: interactive videogpts are scalable world models[J]. Advances in Neural Information Processing Systems, 2024,37:68082-68119.
[19]	ZHAO M, JAIN S, SONG S. Roco: dialectic multi-robot collaboration with large language models[J]. arXiv Preprint, arXiv:2307.04738, 2023.
[20]	WU H, GAO W, XU X. Solder joint recognition using mask R-CNN method[J]. IEEE Transactions on Components, Packaging and Manufacturing Technology, 2020, 10(3):525-530.
[21]	国金证券. 2025垂直领域具身智能机器人产业化落地现状及潜力应用场景分析报告[R], 2025.
[22]	白入文, 张蔚敏, 石霖, 等. 基于具身智能的智能制造创新体系与应用模式研究[J]. 数字化转型, 2025, 2(5):4-14.

特征	传统机器人	人工智能+机器人 (广义)	具身智能机器人 (“人工智能+机器人”子集)
核心驱动	程序控制	人工智能算法驱动	人工智能算法驱动
智能水平	低(执行预设任务)	中高(具备感知、决策能力)	高(强调“感知—认知—决策—执行”闭环)
交互深度	浅层(与环境交互有限)	多样(视具体应用而定)	深层(通过物理实体主动交互、反馈学习)
环境适应性	低(依赖结构化环境)	中高(视人工智能能力而定)	高(需适应非结构化、动态环境)
学习能力	无或弱	有(基于数据/模型)	强(强调基于环境交互的持续学习)
典型代表	工业机械臂(基础功能)	智能扫地机器人、智能客服机器人	人形机器人、高级护理机器人

特征	传统机器人	人工智能+机器人 (广义)	具身智能机器人 (“人工智能+机器人”子集)
核心驱动	程序控制	人工智能算法驱动	人工智能算法驱动
智能水平	低(执行预设任务)	中高(具备感知、决策能力)	高(强调“感知—认知—决策—执行”闭环)
交互深度	浅层(与环境交互有限)	多样(视具体应用而定)	深层(通过物理实体主动交互、反馈学习)
环境适应性	低(依赖结构化环境)	中高(视人工智能能力而定)	高(需适应非结构化、动态环境)
学习能力	无或弱	有(基于数据/模型)	强(强调基于环境交互的持续学习)
典型代表	工业机械臂(基础功能)	智能扫地机器人、智能客服机器人	人形机器人、高级护理机器人

控制范式	代表方法	优势	局限
基于规则	ZMP、PID	实时性高、实现简洁	自适应性差,难处理强非线性
基于模型	MPC、WBC	精度高、可加入物理约束	开发成本高、对模型精度敏感
基于学习	DRL、IL	自主探索、泛化能力强	数据与仿真资源需求大

控制范式	代表方法	优势	局限
基于规则	ZMP、PID	实时性高、实现简洁	自适应性差,难处理强非线性
基于模型	MPC、WBC	精度高、可加入物理约束	开发成本高、对模型精度敏感
基于学习	DRL、IL	自主探索、泛化能力强	数据与仿真资源需求大

范式	学习驱动生成 (扩散+Transformer)	物理驱动生成 (GAN+VAE+物理先验)
代表模型	Stable Diffusion、Imagen、DALL·E、Gato	GIRAFFE、Physics-informed GAN、双表示法高斯-粒子
生成优势	语义一致性强、图像/视频保真度高、文本控制灵活	物理一致性强、可直接输出三维场景和动力学数据
训练/推理效率	扩散去噪步可并行、Transformer具快速推理管线	对抗训练可并发采样、VAE编码压缩加速渲染
典型平台	NVIDIA Cosmos World Foundation Model	高保真数字孪生引擎、NVIDIA Blueprint
适用场景	大规模2D/3D合成数据、零样本视觉任务	大规模3D合成数据、复杂工业装配、力学仿真、“虚拟—现实”闭环优化