信息通信技术与政策

信息通信技术与政策

信息通信技术与政策 ›› 2026, Vol. 52 ›› Issue (5): 41-49.doi: 10.12267/j.issn.2096-5931.2026.05.006

专题:高质量数据集 上一篇    下一篇

面向大模型训练的通信行业高质量数据集构建方法与实践

Construction methods and practice of high-quality datasets for telecommunications large model training

肖文彬1, 李雨霏2, 黄倚霄1, 马闻达2   

  1. 1 中国移动通信集团广东有限公司, 广州 510150
    2 中国信息通信研究院人工智能研究所, 北京 100191
  • 收稿日期:2026-04-03 出版日期:2026-05-25 发布日期:2026-05-28
  • 作者简介:
    肖文彬,中国移动通信集团广东有限公司高级工程师,主要从事智能体开发、数据治理、知识管理等相关研究工作
    李雨霏,中国信息通信研究院人工智能研究所工程师,主要从事数据资产、数据要素、数据估值、数据治理等相关研究工作
    黄倚霄,中国移动通信集团广东有限公司高级工程师,主要从事云计算 、AI、大数据等相关研究工作
    马闻达,中国信息通信研究院人工智能研究所工程师,主要从事数据治理、数据运营、数据交易等相关研究工作

XIAO Wenbin1, LI Yufei2, HUANG Yixiao1, MA Wenda2   

  1. 1 China Mobile Communications Group Guangdong Co., Ltd., Guangzhou 510150, China
    2 Institute of Artificial Intelligence, China Academy of Information and Communications Technology, Beijing 100191, China
  • Received:2026-04-03 Online:2026-05-25 Published:2026-05-28

摘要:

随着生成式人工智能技术的快速演进,数据质量已成为制约行业大模型性能的核心瓶颈。电信运营商掌握ZB量级跨域数据,具备训练垂直大模型的先天资源优势,然而原始通信数据普遍存在多源异构、冗余度高、长尾样本稀缺等问题,直接应用于模型训练效果有限。基于此,系统地提出“采集—治理—标注—评估”高质量数据集构建方法,涵盖深度语义压缩、基于沃瑟斯坦生成对抗网络与长短期记忆网络的长尾数据合成、领域本体构建与人机协同标注等关键技术;同时,设计通识与专识数据协同机制,有效缓解行业微调过程中的“灾难性遗忘”问题。实践证明,该方法行之有效,可为通信行业高质量数据集构建提供参考。

关键词: 高质量数据集, 通信行业, 数据合成, 领域本体, 数据增强

Abstract:

With the rapid evolution of generative artificial intelligence technology, data quality has become the core bottleneck restricting the performance of industry-scale large language models. Telecom operators possess ZB-scale cross-domain data, providing inherent resource advantages for training vertical large language models. However, raw communications data generally faces issues such as multi-source heterogeneity, high redundancy, and scarcity of long-tail samples, which limits its effectiveness when directly applied to model training. To address this, the system proposes a high-quality dataset construction method encompassing “collection—governance-annotation-evaluation,” featuring key technologies such as deep semantic compression, long-tail data synthesis based on Wasserstein Generative Adversarial Network (WGAN) and Long Short-Term Memory (LSTM) network, domain ontology construction, and human-machine collaborative annotation. Meanwhile, a general and specialized knowledge data coordination mechanism is designed to effectively mitigate catastrophic forgetting during industry fine-tuning. Practice has proved that this method is effective and can provide reference for the construction of high-quality datasets in the telecommunications industry.

Key words: high-quality dataset, telecommunications industry, data synthesis, domain ontology, data augmentation

中图分类号: