Information and Communications Technology and Policy

Information and Communications Technology and Policy

Information and Communications Technology and Policy ›› 2026, Vol. 52 ›› Issue (5): 2-9.doi: 10.12267/j.issn.2096-5931.2026.05.001

Previous Articles     Next Articles

Research on high-quality dataset supply from the perspective of full life cycle demand

FAN Wei1, LI Sun1, YAN Shu1, WANG Tiantian2, CAO Feng1   

  1. 1 Institute of Artificial Intelligence, China Academy of Information and Communications Technology, Beijing 100083, China
    2 Institute of Policy and Economics, China Academy of Information and Communications Technology, Beijing 100083, China
  • Received:2026-03-29 Online:2026-05-25 Published:2026-05-28

Abstract:

With the rapid iteration of artificial intelligence technology, model training has shifted from the stage of general-scale expansion to a new stage driven by industrial applications and supported by high-quality data. The type, quality and supply capacity of data directly determine the industry adaptability and practical effectiveness of models. Data demands vary across the full life cycle of large models, and there are still shortcomings in the supply of high-quality datasets for industrial scenarios. Based on this, the focus is placed on the full stages of large model pre-training, supervised fine-tuning, reinforcement alignment, and engineering application. This paper sorts out the data demands of each stage and the evolutionary trend from “scale-oriented” to “quality-oriented”. The evolving trends of data needs at each stage are analyzed, the disparities in high-quality data supply between international and domestic standards are compared, and the shortcomings of China’s utilization of public data, data openness, and annotation ecosystems are examined. Consequently, targeted measures are proposed to optimize data supply, aiming to break through the bottleneck of training data for large models and support the advancement of the artificial intelligence industry.

Key words: construction of high-quality data sets, data training requirements, training data bottlenecks, supply differences

CLC Number: