[1] |
LU D, PANG T, DU C, et al. Test-time backdoor attacks on multimodal large language models[J]. arXiv Preprint, arXiv: 2402.08577, 2024.
|
[2] |
LIU H, LIU Z, TANG R, et al. LoRA-as-an-attack! piercing LLM safety under the share-and-play scenario[J]. arXiv Preprint, arXiv: 2403.00108, 2024.
|
[3] |
WEI C, KUN C, MENG W, et al. LMSanitator: defending prompt-tuning against task-agnostic backdoors[J]. arXiv Preprint, arXiv: 2308.13904, 2023.
|
[4] |
CHEN D, WANG H, HUO Y, et al. GameGPT: multi-agent collaborative framework for game development[J]. arXiv Preprint, arXiv: 2310.08067, 2023.
|
[5] |
DU Y, LI S, TORRALBA A, et al. Improving factuality and reasoning in language models through multiagent debate[J]. arXiv Preprint, arXiv: 2305.14325, 2023.
|
[6] |
MAYA A, AMIT G, GOLDSTEEN A, et al. Is my data in your retrieval database? membership inference attacks against retrieval augmented generation[J]. arXiv Preprint, arXiv: 2405.20446, 2024.
|
[7] |
LI H, GUO D, FAN W, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv Preprint, arXiv: 2304.05197, 2023.
|
[8] |
LI H, XU M, SONG Y. Sentence embedding leaks more information than you expect: generative embedding inversion attack to recover the whole sentence[J]. arXiv Preprint, arXiv: 2305.03010, 2023.
|
[9] |
EITAN B, GEIPING J, CHEREPANOVA V, et al. DP-InstaHide: provably defusing poisoning and backdoor attacks with differentially private data augmentations[J]. arXiv Preprint, arXiv: 2103.02079, 2021.
|
[10] |
TIM B, GAO Y, ALON D, et al. Best-of-venom: attacking RLHF by injecting poisoned preference data[J]. arXiv Preprint, arXiv: 2404.05530, 2024.
|
[11] |
HUANG T, HU S, LIU L. Vaccine: perturbation-aware alignment for large language model[J]. arXiv Preprint, arXiv: 2402.01109, 2024.
|
[12] |
HUANG T, HU S, ILHAN F, et al. Lazy safety alignment for large language models against harmful fine-tuning[J]. arXiv Preprint, arXiv: 2405.18641, 2024.
|
[13] |
HUANG T, BHATTACHARYA G, JOSHI P, et al. Antidote: post-fine-tuning safety alignment for large language models against harmful fine-tuning[J]. arXiv Preprint, arXiv: 2408.09600, 2024.
|
[14] |
LI S, SUN T, CHENG Q, et al. Agent alignment in evolving social norms[J]. arXiv Preprint, arXiv:2401.04620, 2024.
|
[15] |
LIN B, BOUNEFFOUF D, CECCHI G, et al. Towards healthy AI: large language models need therapists too[J]. arXiv Preprint, arXiv: 2304.00416, 2023.
|
[16] |
KAI G, ABDELNABI S, MISHRA S, et al. Not what you’ve signed up for: compromising real-world LLM-integrated applications with indirect prompt injection[J]. arXiv Preprint, arXiv: 302.12173, 2023.
|