A methodological framework for high-quality dataset construction based on multimodal fusion and AI assistance

doi:10.12267/j.issn.2096-5931.2026.05.004

Abstract

Abstract:

Building high-quality datasets for AI applications often faces four practical challenges: unclear alignment with business goals, fragmented implementation, limited technical infrastructure, and excessive annotation costs, this paper presents a methodology that addresses these issues through a three-layer framework—demand mapping, intelligent governance, and value realization—implemented on China Telecom’s Knowledge Service Platform. The methodology has been validated in high-end equipment manufacturing and consumer goods industries, cutting dataset construction time, offering a practical pathway for enterprise data asset development in the era of data marketization.

Key words: high-quality dataset, multimodal fusion, AI-assisted annotation, data asset management

CLC Number:

TP18

WANG Dong, YANG Huafeng, LIU Weichen, LI Kang, LIU Jingqian, LIU Shiwei. A methodological framework for high-quality dataset construction based on multimodal fusion and AI assistance[J]. Information and Communications Technology and Policy, 2026, 52(5): 22-31.

Add to citation manager EndNote|Ris|BibTeX

URL:

http://ictp.caict.ac.cn/EN/10.12267/j.issn.2096-5931.2026.05.004

http://ictp.caict.ac.cn/EN/Y2026/V52/I5/22

Figures/Tables 6

References 34

[1]	程乐. 我国高质量场景数据集的供给现状与发展策略[J]. 人民论坛, 2025(5):68-72.
[2]	中国信息通信研究院. 人工智能高质量数据集建设指南[R], 2025.
[3]	TADAS B, CHAITANYA A, LOUIS-PHILIPPE M. Multimodal machine learning: a survey and taxonomy[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 41(2):423-443. doi: 10.1109/TPAMI.2018.2798607 pmid: 29994351
[4]	SETTLES B. Active learning literature survey[R], 2009.
[5]	SHORTEN C, KHOSHGOFTAAR T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6(1):1-48. doi: 10.1186/s40537-018-0162-3
[6]	中国信息通信研究院. 高质量数据集建设指引[R], 2025.
[7]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. New York: PMLR, 2021:8748-8763.
[8]	HUO Y, ZHANG M, LIU G, et al. WenLan: bridging vision and language by large-scale multi-modal pre-training[J]. IEEE Transactions on Multimedia, 2022, 25:1131-1143.
[9]	SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]// IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019:7464-7473.
[10]	LUO H, JI L, ZHONG M, et al. UniVL: a unified video and language pre-training model for multimodal understanding and generation[J]. arXiv Preprint, arXiv:2002.06353, 2020.
[11]	RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision[C]// Proceedings of the 40th International Conference on Machine Learning. Honolulu: ACM, 2023:28492-28518.
[12]	BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[J]. Advances in Neural Information Processing Systems, 2020, 33:12449-12460.
[13]	CHEN K, WANG J, PANG J, et al. MMDetection: open MMLab detection toolbox and benchmark[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019: 9269-9278.
[14]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of NAACL-HLT. Stroudsburg:ACL, 2019: 4171-4186.
[15]	SUN Y, WANG S, LI Y, et al. ERNIE 2.0: a continual pre-training framework for language understanding[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020, 34(5): 8968-8975.
[16]	LEWIS D D, GALE W A. A sequential algorithm for training text classifiers[C]// ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1994:3-12.
[17]	KONYUSHKOVA K, SZNITMAN R, FUA P. Learning active learning from data[C]// Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2017, 30.
[18]	SENER O, SAVARESE S. Active learning for convolutional neural networks: a core-set approach[C]// International Conference on Learning Representations. Online:OpenReview, 2018.
[19]	RATNER A, BACH S H, EHRENBERG H, et al. Snorkel: rapid training data creation with weak supervision[C]// Proceedings of the VLDB Endowment. New York: VLDB Endowment, 2017, 11(3): 269-282.
[20]	RATNER A, VARMA P, HANCOCK B, et al. Learning to compose domain-specific transformations for data augmentation[J]. Advances in Neural Information Processing Systems, 2017, 30.
[21]	FU B, LI W, MA S, et al. Graph-based weak label denoising for entity typing[C]// Proceedings of the Web Conference. New York: ACM Press, 2021: 932-942.
[22]	SAMBASIVAN N, KAPANIA S, HIGHFILL H, et al. “Everyone wants to do the model work, not the data work”: data cascades in high-stakes AI[C]// Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. New York: ACM Press, 2021: 1-15.
[23]	NORTHCUTT C G, ATHALYE A, MUELLER J. Pervasive label errors in test sets destabilize machine learning benchmarks[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021, 35(11):9651-9660.
[24]	NORTHCUTT C G, JIANG L, CHUANG I L. Confident learning: estimating uncertainty in dataset labels[J]. Journal of Artificial Intelligence Research, 2021, 70:1373-1411. doi: 10.1613/jair.1.12125 URL
[25]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[C]// Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2014, 27.
[26]	KINGMA D P, WELLING M. Auto-encoding variational bayes[C]// International Conference on Learning Representations. Online:OpenReview, 2014.
[27]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33: 6840-6851.
[28]	WONG T T, GUO N. Finite element simulation for computer-aided synthesis of defect images[J]. IEEE Transactions on Instrumentation and Measurement, 2021, 70: 1-12.
[29]	YOON J, JARRETT D, VAN DER SCHAAR M. Time-series generative adversarial networks[C]// Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 2019, 32.
[30]	LE GUENNEC A, MALINOWSKI S, TAVENARD R. Data augmentation for time series classification using convolutional neural networks[C]// ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data. Berlin:Springer, 2016.
[31]	FISHER A, RUDIN C, DOMINICI F. All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously[J]. Journal of Machine Learning Research, 2019, 20(177):1-81.
[32]	中国信息通信研究院. 工业数据空间发展研究报告[R], 2024.
[33]	LEI Y, YANG B, JIANG X, et al. Applications of machine learning to machine fault diagnosis: A review and roadmap[J]. Mechanical Systems and Signal Processing, 2020, 138:106587. doi: 10.1016/j.ymssp.2019.106587 URL
[34]	中国人工智能产业发展联盟. 人工智能数据标注行业研究报告[R], 2024.

平台能力模块	高端装备制造案例	消费品案例	共性价值
数据中心-多源接入	4类工业数据源,2 h完成统一接入	10余个数据源,1周完成统一接入	标准化连接器显著降低接入成本与周期
数据中心-多模态融合	时序信号与维修文本多模态对齐	结构化用户画像+客服对话文本+产品图片/视频统一编码至共享语义空间	跨模态语义对齐支撑联合检索与推理
数据中心-AI辅助标注	预标注准确率91%,人工校验工作量降低67%	标注效率提升400%	成本敏感主动学习+人机协同闭环降本增效
知识中心-知识图谱	构建故障知识图谱	沉淀营销知识库	领域知识系统化沉淀与复用
模型中心-训练部署	故障预测模型迭代优化	推荐模型、文案撰写模型训练	一站式训练部署
智能体中心-应用构建	非计划停机降低64.3%,维修成本降低33%,OEE提升12%	营销转化率提升150%,ROI提升81.3%	低代码编排,业务人员可参与

平台能力模块	高端装备制造案例	消费品案例	共性价值
数据中心-多源接入	4类工业数据源,2 h完成统一接入	10余个数据源,1周完成统一接入	标准化连接器显著降低接入成本与周期
数据中心-多模态融合	时序信号与维修文本多模态对齐	结构化用户画像+客服对话文本+产品图片/视频统一编码至共享语义空间	跨模态语义对齐支撑联合检索与推理
数据中心-AI辅助标注	预标注准确率91%,人工校验工作量降低67%	标注效率提升400%	成本敏感主动学习+人机协同闭环降本增效
知识中心-知识图谱	构建故障知识图谱	沉淀营销知识库	领域知识系统化沉淀与复用
模型中心-训练部署	故障预测模型迭代优化	推荐模型、文案撰写模型训练	一站式训练部署
智能体中心-应用构建	非计划停机降低64.3%,维修成本降低33%,OEE提升12%	营销转化率提升150%,ROI提升81.3%	低代码编排,业务人员可参与