Information and Communications Technology and Policy

Information and Communications Technology and Policy

Information and Communications Technology and Policy ›› 2026, Vol. 52 ›› Issue (5): 41-49.doi: 10.12267/j.issn.2096-5931.2026.05.006

Previous Articles     Next Articles

Construction methods and practice of high-quality datasets for telecommunications large model training

XIAO Wenbin1, LI Yufei2, HUANG Yixiao1, MA Wenda2   

  1. 1 China Mobile Communications Group Guangdong Co., Ltd., Guangzhou 510150, China
    2 Institute of Artificial Intelligence, China Academy of Information and Communications Technology, Beijing 100191, China
  • Received:2026-04-03 Online:2026-05-25 Published:2026-05-28

Abstract:

With the rapid evolution of generative artificial intelligence technology, data quality has become the core bottleneck restricting the performance of industry-scale large language models. Telecom operators possess ZB-scale cross-domain data, providing inherent resource advantages for training vertical large language models. However, raw communications data generally faces issues such as multi-source heterogeneity, high redundancy, and scarcity of long-tail samples, which limits its effectiveness when directly applied to model training. To address this, the system proposes a high-quality dataset construction method encompassing “collection—governance-annotation-evaluation,” featuring key technologies such as deep semantic compression, long-tail data synthesis based on Wasserstein Generative Adversarial Network (WGAN) and Long Short-Term Memory (LSTM) network, domain ontology construction, and human-machine collaborative annotation. Meanwhile, a general and specialized knowledge data coordination mechanism is designed to effectively mitigate catastrophic forgetting during industry fine-tuning. Practice has proved that this method is effective and can provide reference for the construction of high-quality datasets in the telecommunications industry.

Key words: high-quality dataset, telecommunications industry, data synthesis, domain ontology, data augmentation

CLC Number: