A review of multimodal deepfake detection technology

doi:10.12267/j.issn.2096-5931.2025.10.011

Abstract

Abstract:

The rapid development of deepfake technology has exacerbated the crisis of social trust and security threats, and its abuse scenarios have expanded from fake news and identity fraud to a wider field. In order to meet the challenges, the deepfake detection technology has gradually developed from single-modal to multimodal fusion detection, and the detection accuracy and robustness are significantly improved by integrating multi-source information such as audio-visual information. Firstly, the characteristics and application scenarios of multimodal datasets are analyzed. Secondly, the technical methodology system of detection-positioning-interpretation is classified and described. Then, the actual performance of the existing testing platform is evaluated. Finally, the future research directions are prospected. The purpose of this study is to construct a technical map of multimodal deepfake detection, and to provide theoretical support and practical reference for the development of the field.

Key words: deepfake detection, multimodal deepfake detection, datasets

CLC Number:

TN927.23

WANG Ling, YAN Kun, NIE Peng. A review of multimodal deepfake detection technology[J]. Information and Communications Technology and Policy, 2025, 51(10): 73-86.

Add to citation manager EndNote|Ris|BibTeX

URL:

http://ictp.caict.ac.cn/EN/10.12267/j.issn.2096-5931.2025.10.011

http://ictp.caict.ac.cn/EN/Y2025/V51/I10/73

Figures/Tables 4

References 72

[1]	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[2]	张璐, 芦天亮, 杜彦辉. 人脸视频深度伪造检测方法综述[J]. 计算机科学与探索, 2023, 17(1): 1-26. doi: 10.3778/j.issn.1673-9418.2205035
[3]	RADFORD A, METZ L, CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[J/OL]. arXiv Preprint, arXiv:1511.06434, 2016. http://arxiv.org/abs/1511.06434.
[4]	ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 2242-2251.
[5]	SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing obama: learning lip sync from audio[J]. ACM Transactions on Graphics, 2017, 36(4):1-13.
[6]	THIES J, ZOLLHOFER M, STAMMINGER M, et al. Face2Face: real-time face capture and reenactment of RGB videos[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 2387-2395.
[7]	KARRAS T, LAINE S, AITTALA M, et al. Analyzing and improving the image quality of StyleGAN[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020: 8107-8116.
[8]	SIAROHIN A, LATHUILIÈRE S, TULYAKOV S, et al. First order motion model for image animation[J/OL]. arXiv Preprint, arXiv:2003.00196, 2020. http://arxiv.org/abs/2003.00196.
[9]	LIU W, PIAO Z, TU Z, et al. Liquid warping GAN with attention: a unified framework for human image synthesis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5115-5133.
[10]	PRAJWAL K R, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 484-492.
[11]	WU F, LIU L, HAO F, et al. Text-to-image synthesis based on object-guided joint-decoding transformer[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022: 18092-18101.
[12]	SAUER A, KARRAS T, LAINE S, et al. StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis[C]// Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023: 30105-30118.
[13]	LIU X, REN J, SIAROHIN A, et al. HyperHuman: hyper-realistic human generation with latent structural diffusion[J/OL]. arXiv Preprint, arXiv:2310.08579, 2024. http://arxiv.org/abs/2310.08579.
[14]	WANG Q, BAI X, WANG H, et al. InstantID: zero-shot identity-preserving generation in seconds[J/OL]. arXiv Preprint, arXiv:2401.07519, 2024. http://arxiv.org/abs/2401.07519.
[15]	ZHANG C, WANG C, ZHANG J, et al. DREAM-Talk: diffusion-based realistic emotional audio-driven method for single image talking face generation[J/OL]. arXiv Preprint, arXiv:2312.13578, 2023. http://arxiv.org/abs/2312.13578.
[16]	LI Y, LYU S. Exposing deepfake videos by detecting face warping artifacts[J/OL]. arXiv Preprint, arXiv:1811.00656, 2019. http://arxiv.org/abs/1811.00656.
[17]	LI Y, CHANG M C, LYU S. In ictu oculi:exposing AI created fake videos by detecting eye blinking[C]// 2018 IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong: IEEE, 2018: 1-7.
[18]	AGARWAL S, FARID H, FRIED O, et al. Detecting deep-fake videos from phoneme-viseme mismatches[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle: IEEE, 2020: 2814-2822.
[19]	LIU H, LI X, ZHOU W, et al. Spatial-phase shallow learning:rethinking face forgery detection in frequency domain[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021: 772-781.
[20]	CIFTCI U A, DEMIR I, YIN L. FakeCatcher: detection of synthetic portrait videos using biological signals[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024: 1. DOI:10.1109/TPAMI.2020.3009287.
[21]	HALIASSOS A, VOUGIOUKAS K, PETRIDIS S, et al. Lips don’t lie: a generalisable and robust approach to face forgery detection[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021: 5037-5047.
[22]	ZHAO H, WEI T, ZHOU W, et al. Multi-attentional deepfake detection[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021: 2185-2194.
[23]	WANG Z, BAO J, ZHOU W, et al. DIRE for diffusion-generated image detection[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris: IEEE, 2023: 22388-22398.
[24]	ZHANG R, WANG H, LIU H, et al. Generalized face forgery detection with self-supervised face geometry information analysis network[J]. Applied Soft Computing, 2024, 166: 112143.
[25]	NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[C]// Interspeech 2017. ISCA, 2017: 2616-2620.
[26]	CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[C]// Interspeech 2018. ISCA, 2018: 1086-1090.
[27]	KHALID H, TARIQ S, KIM M, et al. FakeAVCeleb: a novel audio-video multimodal deepfake dataset[J/OL]. arXiv Preprint, arXiv:2108. 05080, 2022. http://arxiv.org/abs/2108.05080.
[28]	KOWALSKI M. FaceSwap[EB/OL]. 2020[2025-02-20]. https://github.com/deepfakes/faceswap.
[29]	Iperov. DeepFaceLab[EB/OL]. (2020-04-09)[2025-02-20]. https://github.com/iperov/DeepFaceLab.
[30]	Rudrabha. Wav2Lip[EB/OL]. (2020-08-18)[2025-02-20]. https://github.com/Rudrabha/Wav2Lip.
[31]	YANG W, ZHOU X, CHEN Z, et al. AVoiD-DF: audio-visual joint learning for detecting deepfake[J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 2015-2029.
[32]	SANDERSON C. The VIDTIMIT database[EB/OL]. 2002[2025-02-20]. https://infoscience.epfl.ch/record/82748?ln=fr&v=%5B%27pdf%27%5D.
[33]	JIA Y, ZHANG Y, WEISS R J, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 4485-4495.
[34]	HOU Y, FU H, CHEN C, et al. PolyGlotFake: a novel multilingual and multimodal deepfake dataset[C]// Pattern Recognition, 27th International Conference. Kolkata: Springer Nature Switzerland, 2025: 180-193.
[35]	LI J, TU W, XIAO L. FreeVC: Towards high-quality text-free one-shot voice conversion[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island: IEEE, 2023: 1-5.
[36]	CHEN S, WANG C, WU Y, et al. Neural codec language models are zero-shot text to speech synthesizers[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2025, 33: 705-718.
[37]	CHANDRA N A, MURTFELDT R, QIU L, et al. Deepfake-eval-2024:a multi-modal in-the-wild benchmark of deepfakes circulated in 2024[J/OL]. arXiv Preprint, arXiv:2503.02857, 2025. http://arxiv.org/abs/2503.02857.
[38]	QI P, BU Y, CAO J, et al. FakeSV: a multimodal benchmark with rich social context for fake news detection on short video platforms[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(12): 14444-14452.
[39]	CAI Z, STEFANOV K, DHALL A, et al. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization[C]// 2022 International Conference on Digital Image Computing:Techniques and Applications (DICTA). Sydney: IEEE, 2022: 1-10.
[40]	SHAO R, WU T, LIU Z. Detecting and grounding multi-modal media manipulation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver: IEEE, 2023: 6904-6913.
[41]	LIAN J, LIU L, WANG Y, et al. A large-scale interpretable multi-modality benchmark for facial image forgery localization[J/OL]. arXiv Preprint, arXiv:2412.19685, 2024. http://arxiv.org/abs/2412.19685.
[42]	LEE C H, LIU Z, WU L, et al. MaskGAN:towards diverse and interactive facial image manipulation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020: 5548-5557.
[43]	NVIDIA Labs. FFHQ[EB/OL]. (2019-03-19)[2025-02-20]. https://github.com/NVlabs/ffhq-dataset.
[44]	XU Z, ZHANG X, LI R, et al. FakeShield: explainable image forgery detection and localization via multi-modal large language models[J/OL]. arXiv preprint: arXiv: 2410.02761, 2025. http://arxiv.org/abs/2410.02761.
[45]	OpenAI. GPT-4o[EB/OL]. (2024-05-13)[2025-02-20]. https://openai.com.
[46]	LEWIS J K, TOUBAL I E, CHEN H, et al. Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multimodal deep learning[C]// 2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). Washington: IEEE, 2020: 1-9.
[47]	OpenDataLab. DeepFake detection challenge (DFDC)[EB/OL]. (2020-06-09)[2025-02-20]. https://opendatalab.org.cn/OpenDataLab/DFDC.
[48]	WANG J, WU Z, OUYANG W, et al. M2TR: multi-modal multi-scale transformers for deepfake detection[C]// Proceedings of the 2022 International Conference on Multimedia Retrieval. Newark: ACM, 2022: 615-623.
[49]	SALVI D, LIU H, MANDELLI S, et al. A robust approach to multimodal deepfake detection[J]. Journal of Imaging, 2023, 9(6): 122.
[50]	ANAS R M, MAHMOOD M K. MultimodalTrace:deepfake detection using audiovisual representation learning[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Vancouver: IEEE, 2023: 993-1000.
[51]	MUPPALLA S, JIA S, LYU S. Integrating audio-visual features for multimodal deepfake detection[C]// 2023 IEEE MIT Undergraduate Research Technology Conference (URTC). Cambridge: IEEE, 2023: 1-5.
[52]	SHIRLEY C P, JINGLE B J, ABISHA M B, et al. Deepfake detection using multi-modal fusion combined with attention mechanism[C]// 2024 4th International Conference on Sustainable Expert Systems (ICSES). IEEE, 2024: 1194-1199.
[53]	GANDHI K, KULKARNI P, SHAH T, et al. A multimodal framework for deepfake detection[J/OL]. Journal of Electrical Systems, 2024. DOI:10.53555/jes.v20i10s.6126.
[54]	NIE F, NI J, ZHANG J, et al. FRADE: Forgery-aware audio-distilled multimodal learning for deepfake detection[C]// Proceedings of the 32nd ACM International Conference on Multimedia. Melbourne: ACM, 2024: 6297-6306.
[55]	WU Y, ZHAN P, ZHANG Y, et al. Multimodal fusion with co-attention networks for fake news detection[C]// Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021: 2560-2569.
[56]	ZENGIN A Z, GÜNDÜZ Ö. Identifying topical influencers on Twitter based on user behavior and network topology[J]. Knowledge-Based Systems, 2018, 141: 211-221.
[57]	CAO Q, SHEN H, CEN K, et al. DeepHawkes: bridging the gap between prediction and understanding of information cascades[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Singapore: ACM, 2017: 1149-1158.
[58]	JABEEN S, KHAN U G, IQBAL R, et al. A deep multimodal system for provenance filtering with universal forgery detection and localization[J]. Multimedia Tools and Applications, 2021, 80(11): 17025-17044.
[59]	DONG J, WANG W, TAN T. CASIA image tampering detection evaluation database[C]// 2013 IEEE China Summit and International Conference on Signal and Information Processing. Beijing: IEEE, 2013: 422-426.
[60]	ZHANG R, WANG H, DU M, et al. UMMAFormer: a universal multimodal-adaptive transformer framework for temporal forgery localization[C]// Proceedings of the 31st ACM International Conference on Multimedia. Ottawa: ACM, 2023: 8749-8759.
[61]	TRIARIDIS K, MEZARIS V. Exploring multi-modal fusion for image manipulation detection and localization[C]// MultiMedia Modeling, 30th International Conference. Amsterdam: Springer Nature Switzerland, 2024: 198-211.
[62]	SHUAI C, ZHONG J, WU S, et al. Locate and verify: a two-stream network for improved deepfake detection[C]// Proceedings of the 31st ACM International Conference on Multimedia. Ottawa: ACM, 2023: 7131-7142.
[63]	ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. FaceForensics++:learning to detect manipulated facial images[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul: IEEE, 2019: 1-11.
[64]	LI Y, LIU X, WANG X, et al. FakeBench: probing explainable fake image detection via large multimodal models[J/OL]. arXiv Preprint, arXiv:2404.13306, 2024. http://arxiv.org/abs/2404.13306.
[65]	HUANG Z, HU J, LI X, et al. SIDA: social media image deepfake detection, localization and explanation with large multimodal model[J/OL]. arXiv Preprint, arXiv:2412.04292, 2025. http://arxiv.org/abs/2412.04292.
[66]	HE X, ZHOU Y, FAN B, et al. VLForgery face triad: detection, localization and attribution via multimodal large language models[J/OL]. arXiv Preprint, arXiv:2503.06142, 2025. http://arxiv.org/abs/2503.06142.
[67]	LIU J, ZHANG F, ZHU J, et al. ForgeryGPT: multimodal large language model for explainable image forgery detection and localization[J/OL]. arXiv Preprint, arXiv:2410.10238, 2025. http://arxiv.org/abs/2410.10238.
[68]	YU P, FEI J, GAO H, et al. Unlocking the capabilities of vision-language models for generalizable and explainable deepfake detection[J/OL]. arXiv Preprint, arXiv:2503.14853, 2025. http://arxiv.org/abs/2503.14853.
[69]	WEN S, YE J, FENG P, et al. Spot the fake: large multimodal model-based synthetic image detection with artifact explanation[J/OL]. arXiv Preprint, arXiv:2503.14905, 2025. http://arxiv.org/abs/2503.14905.
[70]	ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2023: 46595-46623.
[71]	HAQ I U, MALIK K M, MUHAMMAD K. Multimodal neurosymbolic approach for explainable deepfake detection[J]. ACM Transactions on Multimedia Computing Communications and Applications, 2024, 20(11): 1-16.
[72]	ARUNA S, MATTHEW G, ROSALIND P, et al. The presidential deepfakes dataset[EB/OL]. (2021-09-01)[2025-02-20]. https://www.media.mit.edu/publications/presidential-deepfakes-dataset.

数据集	模态类型	样本量	标注粒度	应用场景
VoxCeleb	音频	100 000+段	说话人身份标签	身份识别
VoxCeleb2	音频	1 000 000+段	说话人身份标签	扩展身份识别
FakeAVCeleb	视频+音频	20 000个	真/伪标签	标签检测
DefakeAVMiT	视频+音频	7 020个	真/伪标签	标签检测
PolyGlotFake	视频+音频	15 238个	真/伪标签+多语种	标签检测
Deepfake-Eval-2024	视频+音频+图像	101.5 h音/视频+ 1 975张图片	真/伪标签+多语种	标签检测
LAV-DF	视频+音频	136 304个	真/伪标签+时间戳	定位检测
DGM⁴	图像+文本	230 000个	真/伪标签+篡改类型标签+ 定位坐标	多类型定位检测
MMTT	图像+文本	128 303个	真/伪标签+篡改原因解释文本	可解释性伪造检测
MMTD-Set	图像+文本+掩码	—	真/伪标签+掩码+描述	可解释性伪造检测

数据集	模态类型	样本量	标注粒度	应用场景
VoxCeleb	音频	100 000+段	说话人身份标签	身份识别
VoxCeleb2	音频	1 000 000+段	说话人身份标签	扩展身份识别
FakeAVCeleb	视频+音频	20 000个	真/伪标签	标签检测
DefakeAVMiT	视频+音频	7 020个	真/伪标签	标签检测
PolyGlotFake	视频+音频	15 238个	真/伪标签+多语种	标签检测
Deepfake-Eval-2024	视频+音频+图像	101.5 h音/视频+ 1 975张图片	真/伪标签+多语种	标签检测
LAV-DF	视频+音频	136 304个	真/伪标签+时间戳	定位检测
DGM⁴	图像+文本	230 000个	真/伪标签+篡改类型标签+ 定位坐标	多类型定位检测
MMTT	图像+文本	128 303个	真/伪标签+篡改原因解释文本	可解释性伪造检测
MMTD-Set	图像+文本+掩码	—	真/伪标签+掩码+描述	可解释性伪造检测

方法	数据集	性能指标	性能	核心创新点	发表时间/年
Lewis等	DFDC	ACC	61.95%	DCT进行面部频谱特征分析	2020
M2TR	SR-DF FF++	ACC	91.2% 99.5%	多尺度变换器捕捉不同尺度下的伪造特征	2021
Salvi等	—	—	—	视频单模态提取训练	2023
AVoiD-DF	DefakeAVMiT	ACC	83.70%	多模态联合学习与时空特征融合	2023
Multimodaltrace	FakeAVCeleb	ACC	92.9%	跨模态多层次混合学习	2023
Muppalla等	FakeAVCeleb	ACC	90.51%	跨模态多任务学习	2023
Shirley等	混合数据集	ACC	96.8%	多模态融合与注意力机制	2024
Gandhi等	混合数据集	ACC	94%	面部特征提取与梅尔频谱图分析的组合	2024
FRADE	FakeAVCeleb	AUC	93.1%	音频蒸馏跨模态交互	2024
MCAN	Twitter Weibo	ACC	80.9% 89.9%	层叠共注意力机制	2021

方法	数据集	性能指标	性能	核心创新点	发表时间/年
Lewis等	DFDC	ACC	61.95%	DCT进行面部频谱特征分析	2020
M2TR	SR-DF FF++	ACC	91.2% 99.5%	多尺度变换器捕捉不同尺度下的伪造特征	2021
Salvi等	—	—	—	视频单模态提取训练	2023
AVoiD-DF	DefakeAVMiT	ACC	83.70%	多模态联合学习与时空特征融合	2023
Multimodaltrace	FakeAVCeleb	ACC	92.9%	跨模态多层次混合学习	2023
Muppalla等	FakeAVCeleb	ACC	90.51%	跨模态多任务学习	2023
Shirley等	混合数据集	ACC	96.8%	多模态融合与注意力机制	2024
Gandhi等	混合数据集	ACC	94%	面部特征提取与梅尔频谱图分析的组合	2024
FRADE	FakeAVCeleb	AUC	93.1%	音频蒸馏跨模态交互	2024
MCAN	Twitter Weibo	ACC	80.9% 89.9%	层叠共注意力机制	2021

方法	数据集	性能指标	性能	核心创新点	发表时间/年
Jabeen等	CASIA V2.0	ACC Presicion	93.04% 85.47%	误差级别分析	2020
UMMAFormer	Lav-DF	AP@0.95 AR@100	77.72% 97.34%	时序数据的多模态适应	2023
DGM⁴	DGM⁴	ACC Precision	93.44% 70.9%	浅层推理和深层推理多任务检测	2023
Triaridis等	—	F1	.750(前) .751(后)	后融合与前融合	2024
Shuai等	FF++	ACC	70%以上	双流网络	2023