大语言模型核心架构演进态势分析

doi:10.12267/j.issn.2096-5931.2024.12.003

摘要/Abstract

摘要：

体系化梳理分析了基于Transformer架构的重要创新方向,从Transformer自身架构创新、与其他架构融合创新以及非Transformer算法创新3个维度分析了大语言模型算法演进态势,就未来大模型发展方向进行展望。

关键词: 大模型架构, Transformer, 注意力机制, 架构创新

Abstract:

This paper systematically reviews and analyzes the significant innovation directions based on the Transformer architecture. It examines the evolution of large language model architecture from three dimensions: innovation within the Transformer architecture itself, fusion innovation with other architectures, and innovations in non-Transformer architecture. This paper also provides an outlook on the future development directions of foundation models.

Key words: large model architecture, Transformer, attention mechanism, architectural innovation

中图分类号:

TP183

王蕴韬. 大语言模型核心架构演进态势分析[J]. 信息通信技术与政策, 2024, 50(12): 13-20.

WANG Yuntao. Analysis of large language model architecture evolution[J]. Information and Communications Technology and Policy, 2024, 50(12): 13-20.

导出引用管理器 EndNote|Ris|BibTeX

链接本文:

http://ictp.caict.ac.cn/CN/10.12267/j.issn.2096-5931.2024.12.003

http://ictp.caict.ac.cn/CN/Y2024/V50/I12/13

参考文献 52

[1]	LIN T, WANG Y, LIU X, et al. A survey of transformers[J]. arXiv Preprint, arXiv: 2106.04554, 2021.
[2]	WANG S, LI J, SHI X, et al. TimeMixer++: a general time series pattern machine for universal predictive, analysis[J]. arXiv Preprint, arXiv: 2410.16032, 2024.
[3]	WU Z, LIU Z, LIN J, et al. Lite transformer with long-short range attention[J]. arXiv Preprint, arXiv: 2004. 11886, 2020.
[4]	DAI Z, LAI G, YANG Y, et al. Funnel-transformer: filtering out sequential redundancy for efficient language processing[J]. arXiv Preprint, arXiv: 2006. 03236, 2020.
[5]	MEHTA S, GHAZVININEJAD M, LYER S, et al. DeLighT: deep and light-weight transformer[J]. arXiv Preprint, arXiv: 2008.00623, 2021.
[6]	HE R, RAVULA A, KANAGAL B, et al. RealFormer: transformer likes residual attention[J]. arXiv Preprint, arXiv: 2012.11747, 2021.
[7]	DEHGHANI M, GOUWS S, VINYALS O, et al. Universal transformers[J]. arXiv Preprint, arXiv: 1807. 03819, 2019.
[8]	BAPNA A, ARIVAZHAGAN N, FIRAT O. Controlling computation versus quality for neural sequence models[J]. arXiv Preprint, arXiv: 2002.07106, 2020.
[9]	XIN J, TANG R, LEE J, et al. DeeBERT: dynamic early exiting for accelerating BERT inference[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 2246-2251.
[10]	DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv Preprint, arXiv: 1901.02860, 2019.
[11]	RAE J W, POTAPENKO A, JAYAKUMAR S M, et al. Compressive transformers for long-range sequence modelling[J]. arXiv Preprint, arXiv: 1911.05507, 2020.
[12]	WU Q, LAN Z, GU J, et al. Memformer: The memory-augmented transformer[J]. arXiv Preprint, arXiv: 2010.06891, 2020.
[13]	WU C, WU F, QI T, et al. Hi-transformer: hierarchical interactive transformer for efficient and effective long document modeling[J]. arXiv Preprint, arXiv: 2106.01040, 2021.
[14]	ZHANG X, WEI F, ZHOU M. Hibert: document level pre-training of hierarchical bidirectional transformers for document summarization[J]. arXiv Preprint, arXiv: 1905.06566, 2019.
[15]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv Preprint, arXiv: 2010. 11929, 2021
[16]	BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[J]. arXiv Preprint, arXiv: 2102.05095, 2021.
[17]	ZHAO Y, DONG L, SHEN Y, et al. Memory-efficient differentiable transformer architecture search[J]. arXiv Preprint, arXiv: 2105.14669, 2021.
[18]	LU Y, LI Z, HE D, et al. Understanding and improving transformer from a multi-particle dynamic system point of view[J]. arXiv Preprint, arXiv: 1906.02762, 2019.
[19]	PRESS O, SMITH N A, LEVY O. Improving transformer models by reordering their sublayers[J]. arXiv Preprint, arXiv: 1911.03864, 2020.
[20]	TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: a survey[J]. arXiv Preprint, arXiv: 2009.06732, 2020.
[21]	WANG S, LI B, KHABSA M, et al. Linformer: self-attention with linear complexity[J]. arXiv Preprint, arXiv: 2006.04768, 2020.
[22]	CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers[J]. arXiv Preprint, arXiv: 2009.14794, 2021.
[23]	XIONG Y, ZENG Z, CHAKRABORTY R, et al. Nystroömformer: a Nystoöm-based algorithm for approximating self-attention[J]. arXiv Preprint, arXiv: 2102.03902, 2021.
[24]	TAY Y, BAHRI D, METZLER D, et al. Synthesizer: rethinking self-attention in transformer models[J]. arXiv Preprint, arXiv: 2005.00743, 2021.
[25]	BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[J]. arXiv Preprint, arXiv: 2004.05150, 2020.
[26]	KITAEV N, KAISER L, LEVSKAYA A. Reformer: the efficient transformer[J]. arXiv Preprint, arXiv: 2001.04451, 2020.
[27]	ZAHEER M, GURUGANESH G, DUBEY K A, et al. Big bird: transformers for longer sequences[J]. arXiv Preprint, arXiv: 2007.14062, 2020.
[28]	CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers[J]. arXiv Preprint, arXiv: 2009.14794, 2021.
[29]	ROY A, SAFFAR M, VASWANI A, et al. Efficient content-based sparse attention with routing transformers[J]. arXiv Preprint, arXiv: 2003.05997, 2020.
[30]	CORDONNIER J B, LOUKAS A, JAGGI M. Multi-head attention: collaborate instead of concatenate[J]. arXiv Preprint, arXiv: 2006.16362, 2020.
[31]	SHAZEER N M, LAN Z, CHENG Y, et al. Talking-heads attention[J]. arXiv Preprint, arXiv: 2003. 02436, 2020.
[32]	SUBRAMANIAN S, COLLOBERT R, RANZATO M, et al. Multi-scale transformer language models[J]. arXiv Preprint, arXiv: 2005.00581, 2020.
[33]	JIN P, ZHU B, YUAN L, et al. MoH: multi-head attention as mixture-of-head attention[J]. arXiv Preprint, arXiv: 2410.11842, 2024.
[34]	SU J, LU Y, PAN S, et al. Roformer: enhanced transformer with rotary position embedding[J]. arXiv Preprint, arXiv: 2104.09864, 2021.
[35]	WANG Z, MA Y, LIU Z, et al. R-transformer: recurrent neural network enhanced transformer[J]. arXiv Preprint, arXiv: 1907.05572, 2019.
[36]	LOSHCHILOV I, HSIEH C P, SUN S, et al. nGPT: Normalized transformer with representation learning on the hypersphere[J]. arXiv Preprint, arXiv: 2410.01131, 2024.
[37]	XIONG Z, WANG Z, LIU Y, ET AL. A hybrid model of bi-directional LSTM and transformer for text classification[C]// Proceedings of the 2020 International Conference on Artificial Intelligence and Big Data, 2020.
[38]	LI Z, YANG J, WANG J, et al. Integrating LSTM and BERT for long-sequence data analysis in intelligent tutoring systems[J]. arXiv Preprint, arXiv: 2405.05136, 2024.
[39]	ZHANG T, LIU S, LI T, et al. Boundary information matters more: accurate temporal action detection with temporal boundary network[C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019: 1642-1646.
[40]	DONG L, XU S, XU B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018: 5884-5888.
[41]	CSORDÁS R, IRIE K, SCHMIDHUBER J, et al. MoEUT: mixture-of-experts universal transformers[J]. arXiv Preprint, arXiv: 2405.16039, 2024.
[42]	PENG B, ALCAIDE E, ANTHONY Q, et al. RWKV: reinventing RNNs for the transformer era[J]. arXiv Preprint, arXiv: 2305.13048, 2023.
[43]	GU A, GOEL K, RÉ C. Efficiently modeling long sequences with structured state spaces[J]. arXiv Preprint, arXiv: 2111.00396, 2021.
[44]	GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[J]. arXiv Preprint, arXiv: 2312.00752, 2023.
[45]	BECK M, PÖPPEL K, SPANRING M, et al. xLSTM: extended long short-term memory[J]. arXiv Preprint, arXiv: 2405.04517, 2024.
[46]	FENG L, TUNG F, AHMED M O, et al. Were RNNs all we needed?[J]. arXiv Preprint, arXiv: 2410.01201, 2024.
[47]	POLI M, MASSAROLI S, NGUYEN E, et al. Hyena hierarchy: towards larger convolutions in feedforward neural networks[J]. arXiv Preprint, arXiv: 2302. 10866, 2023.
[48]	SUN Y, DONG L, HUANG S, et al. Retentive network: a successor to transformer for large language models[J]. arXiv Preprint, arXiv: 2307.08621, 2023.
[49]	TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[J]. arXiv Preprint, arXiv: 2105.01601, 2021.
[50]	TROCKMAN A, KOLTER J Z. Patches are all you need?[J]. arXiv Preprint, arXiv: 2201.09792, 2022.
[51]	LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with fourier transforms[J]. arXiv Preprint, arXiv: 2105.03824, 2021.
[52]	LIU Z, WANG Y, SACHIN V, et al. KAN: Kolmogorov-Arnold networks[J]. arXiv Preprint, arXiv: 2404.19756, 2024.