Research on the development of compute-storage collaboration driven by large model inference
ZHOU Lan, CHEN Lei
Informatization and Industrialization Integration Research Institute, China Academy of Information and Communications Technology, Beijing 100191, China
ZHOU Lan, CHEN Lei. Research on the development of compute-storage collaboration driven by large model inference[J]. Information and Communications Technology and Policy, 2025, 51(10): 2-6.
DeepSeek-AI, LIU A X, FENG B, et al. DeepSeek-V3: a strong mixture-of-experts language model[J]. arXiv Preprint, arXiv:2406.07524, 2024.
[4]
SUN Z, YANG A, LITMAN D. Heavy-hitter oracle: constrained generation for LLMs using KV cache[C]// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul: IEEE, 2024:12165-12169.
[5]
HOOPER C, KIM S, MOHAMMADZADEH H, et al. KVQuant: towards 10 million context length LLM inference with KV cache quantization[J]. arXiv Preprint, arXiv:2401.18079, 2024.
KWON W, LI Z H, ZHUANG S Y, et al. vLLM: easy, fast, and cheap LLM serving with PagedAttention[EB/OL]. (2023-06-20)[2025-09-01]. https://blog.vllm.ai/2023/06/20/vllm.html.
[9]
DAO T, FU D, ERMON S, et al. FlashAttention: fast and memory-efficient exact attention with IO-awareness[J] arXiv Preprint, arXiv:2205.14135, 2022.