信息通信技术与政策

信息通信技术与政策

信息通信技术与政策 ›› 2026, Vol. 52 ›› Issue (5): 2-9.doi: 10.12267/j.issn.2096-5931.2026.05.001

专题:高质量数据集 上一篇    下一篇

全生命周期需求视角的高质量数据集供给研究

Research on high-quality dataset supply from the perspective of full life cycle demand

樊威1, 李荪1, 闫树1, 王甜甜2, 曹峰1   

  1. 1 中国信息通信研究院人工智能研究所, 北京 100083
    2 中国信息通信研究院政策与经济研究所, 北京 100083
  • 收稿日期:2026-03-29 出版日期:2026-05-25 发布日期:2026-05-28
  • 作者简介:
    樊威,中国信息通信研究院人工智能研究所高级工程师,主要从事人工智能高质量数据集建设及数据标注等相关研究工作
    李荪,中国信息通信研究院人工智能研究所平台与工程化部副主任,高级工程师,主要从事人工智能政策、标准、产业研究等相关研究工作,涵盖机器学习、语音感知认知技术及其与产品的融合应用等
    闫树,中国信息通信研究院人工智能研究所副总工程师,正高级工程师,主要从事人工智能、大数据相关标准、产业研究工作,涵盖高质量数据集建设、数据领域核心技术等
    王甜甜,中国信息通信研究院政策与经济研究所监管研究部主任工程师,高级工程师,主要从事人工智能、平台经济相关标准、产业研究工作
    曹峰,中国信息通信研究院人工智能研究所平台与工程化部主任,高级工程师,主要从事人工智能政策、产业及应用相关研究工作

FAN Wei1, LI Sun1, YAN Shu1, WANG Tiantian2, CAO Feng1   

  1. 1 Institute of Artificial Intelligence, China Academy of Information and Communications Technology, Beijing 100083, China
    2 Institute of Policy and Economics, China Academy of Information and Communications Technology, Beijing 100083, China
  • Received:2026-03-29 Online:2026-05-25 Published:2026-05-28

摘要:

随着人工智能技术快速迭代,模型训练已从通用规模扩张阶段,进入行业应用驱动、高质量数据支撑的新阶段,数据的类型、质量与供给能力直接决定模型的行业适配性与应用落地效果。在大模型全生命周期内,数据需求是差异化的,且面向行业场景的高质量数据集供给尚存短板。基于此,聚焦大模型预训练、监督微调、强化对齐、工程应用全阶段,梳理各阶段数据需求的演进趋势,对比国内外高质量数据供给差异,剖析我国在公共数据利用、数据开源、标注生态等方面的短板,从而针对性地提出优化数据供给的对策建议,为突破大模型训练数据瓶颈、推动人工智能产业发展提供支撑。

关键词: 高质量数据集建设, 数据训练需求, 训练数据瓶颈, 供给差异

Abstract:

With the rapid iteration of artificial intelligence technology, model training has shifted from the stage of general-scale expansion to a new stage driven by industrial applications and supported by high-quality data. The type, quality and supply capacity of data directly determine the industry adaptability and practical effectiveness of models. Data demands vary across the full life cycle of large models, and there are still shortcomings in the supply of high-quality datasets for industrial scenarios. Based on this, the focus is placed on the full stages of large model pre-training, supervised fine-tuning, reinforcement alignment, and engineering application. This paper sorts out the data demands of each stage and the evolutionary trend from “scale-oriented” to “quality-oriented”. The evolving trends of data needs at each stage are analyzed, the disparities in high-quality data supply between international and domestic standards are compared, and the shortcomings of China’s utilization of public data, data openness, and annotation ecosystems are examined. Consequently, targeted measures are proposed to optimize data supply, aiming to break through the bottleneck of training data for large models and support the advancement of the artificial intelligence industry.

Key words: construction of high-quality data sets, data training requirements, training data bottlenecks, supply differences

中图分类号: