Security challenges and response mechanisms for trustworthy large language model agents

doi:10.12267/j.issn.2096-5931.2025.01.005

Abstract

Abstract:

As the application of large language model-driven agents deepens in various fields, potential security risks are gradually prominent. This paper aims to systematically sort out the security and trustworthiness problems faced by agents based on large language models, including information leakage, model attacks, hallucination outputs, ethical and moral risks, and legal compliance hazards. By conducting an in-depth analysis of the causes and impacts of these security risks, this paper discusses existing protective measures and technical means, and proposes suggestions for building trustworthy large language model agents, providing references for related research and practice.

Key words: trustworthy large language model agent, security, defense

CLC Number:

TP391.1

ZHANG Xi, LI Chaozhuo, XU Nuo, ZHANG Litian. Security challenges and response mechanisms for trustworthy large language model agents[J]. Information and Communications Technology and Policy, 2025, 51(1): 33-37.

Add to citation manager EndNote|Ris|BibTeX

URL:

http://ictp.caict.ac.cn/EN/10.12267/j.issn.2096-5931.2025.01.005

http://ictp.caict.ac.cn/EN/Y2025/V51/I1/33

Figures/Tables 1

References 16

[1]	LU D, PANG T, DU C, et al. Test-time backdoor attacks on multimodal large language models[J]. arXiv Preprint, arXiv: 2402.08577, 2024.
[2]	LIU H, LIU Z, TANG R, et al. LoRA-as-an-attack! piercing LLM safety under the share-and-play scenario[J]. arXiv Preprint, arXiv: 2403.00108, 2024.
[3]	WEI C, KUN C, MENG W, et al. LMSanitator: defending prompt-tuning against task-agnostic backdoors[J]. arXiv Preprint, arXiv: 2308.13904, 2023.
[4]	CHEN D, WANG H, HUO Y, et al. GameGPT: multi-agent collaborative framework for game development[J]. arXiv Preprint, arXiv: 2310.08067, 2023.
[5]	DU Y, LI S, TORRALBA A, et al. Improving factuality and reasoning in language models through multiagent debate[J]. arXiv Preprint, arXiv: 2305.14325, 2023.
[6]	MAYA A, AMIT G, GOLDSTEEN A, et al. Is my data in your retrieval database? membership inference attacks against retrieval augmented generation[J]. arXiv Preprint, arXiv: 2405.20446, 2024.
[7]	LI H, GUO D, FAN W, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv Preprint, arXiv: 2304.05197, 2023.
[8]	LI H, XU M, SONG Y. Sentence embedding leaks more information than you expect: generative embedding inversion attack to recover the whole sentence[J]. arXiv Preprint, arXiv: 2305.03010, 2023.
[9]	EITAN B, GEIPING J, CHEREPANOVA V, et al. DP-InstaHide: provably defusing poisoning and backdoor attacks with differentially private data augmentations[J]. arXiv Preprint, arXiv: 2103.02079, 2021.
[10]	TIM B, GAO Y, ALON D, et al. Best-of-venom: attacking RLHF by injecting poisoned preference data[J]. arXiv Preprint, arXiv: 2404.05530, 2024.
[11]	HUANG T, HU S, LIU L. Vaccine: perturbation-aware alignment for large language model[J]. arXiv Preprint, arXiv: 2402.01109, 2024.
[12]	HUANG T, HU S, ILHAN F, et al. Lazy safety alignment for large language models against harmful fine-tuning[J]. arXiv Preprint, arXiv: 2405.18641, 2024.
[13]	HUANG T, BHATTACHARYA G, JOSHI P, et al. Antidote: post-fine-tuning safety alignment for large language models against harmful fine-tuning[J]. arXiv Preprint, arXiv: 2408.09600, 2024.
[14]	LI S, SUN T, CHENG Q, et al. Agent alignment in evolving social norms[J]. arXiv Preprint, arXiv:2401.04620, 2024.
[15]	LIN B, BOUNEFFOUF D, CECCHI G, et al. Towards healthy AI: large language models need therapists too[J]. arXiv Preprint, arXiv: 2304.00416, 2023.
[16]	KAI G, ABDELNABI S, MISHRA S, et al. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection[J]. arXiv Preprint, arXiv: 302.12173, 2023.