[1] |
LIN T, WANG Y, LIU X, et al. A survey of transformers[J]. arXiv Preprint, arXiv: 2106.04554, 2021.
|
[2] |
WANG S, LI J, SHI X, et al. TimeMixer++: a general time series pattern machine for universal predictive, analysis[J]. arXiv Preprint, arXiv: 2410.16032, 2024.
|
[3] |
WU Z, LIU Z, LIN J, et al. Lite transformer with long-short range attention[J]. arXiv Preprint, arXiv: 2004. 11886, 2020.
|
[4] |
DAI Z, LAI G, YANG Y, et al. Funnel-transformer: filtering out sequential redundancy for efficient language processing[J]. arXiv Preprint, arXiv: 2006. 03236, 2020.
|
[5] |
MEHTA S, GHAZVININEJAD M, LYER S, et al. DeLighT: deep and light-weight transformer[J]. arXiv Preprint, arXiv: 2008.00623, 2021.
|
[6] |
HE R, RAVULA A, KANAGAL B, et al. RealFormer: transformer likes residual attention[J]. arXiv Preprint, arXiv: 2012.11747, 2021.
|
[7] |
DEHGHANI M, GOUWS S, VINYALS O, et al. Universal transformers[J]. arXiv Preprint, arXiv: 1807. 03819, 2019.
|
[8] |
BAPNA A, ARIVAZHAGAN N, FIRAT O. Controlling computation versus quality for neural sequence models[J]. arXiv Preprint, arXiv: 2002.07106, 2020.
|
[9] |
XIN J, TANG R, LEE J, et al. DeeBERT: dynamic early exiting for accelerating BERT inference[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 2246-2251.
|
[10] |
DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv Preprint, arXiv: 1901.02860, 2019.
|
[11] |
RAE J W, POTAPENKO A, JAYAKUMAR S M, et al. Compressive transformers for long-range sequence modelling[J]. arXiv Preprint, arXiv: 1911.05507, 2020.
|
[12] |
WU Q, LAN Z, GU J, et al. Memformer: The memory-augmented transformer[J]. arXiv Preprint, arXiv: 2010.06891, 2020.
|
[13] |
WU C, WU F, QI T, et al. Hi-transformer: hierarchical interactive transformer for efficient and effective long document modeling[J]. arXiv Preprint, arXiv: 2106.01040, 2021.
|
[14] |
ZHANG X, WEI F, ZHOU M. Hibert: document level pre-training of hierarchical bidirectional transformers for document summarization[J]. arXiv Preprint, arXiv: 1905.06566, 2019.
|
[15] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv Preprint, arXiv: 2010. 11929, 2021
|
[16] |
BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[J]. arXiv Preprint, arXiv: 2102.05095, 2021.
|
[17] |
ZHAO Y, DONG L, SHEN Y, et al. Memory-efficient differentiable transformer architecture search[J]. arXiv Preprint, arXiv: 2105.14669, 2021.
|
[18] |
LU Y, LI Z, HE D, et al. Understanding and improving transformer from a multi-particle dynamic system point of view[J]. arXiv Preprint, arXiv: 1906.02762, 2019.
|
[19] |
PRESS O, SMITH N A, LEVY O. Improving transformer models by reordering their sublayers[J]. arXiv Preprint, arXiv: 1911.03864, 2020.
|
[20] |
TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: a survey[J]. arXiv Preprint, arXiv: 2009.06732, 2020.
|
[21] |
WANG S, LI B, KHABSA M, et al. Linformer: self-attention with linear complexity[J]. arXiv Preprint, arXiv: 2006.04768, 2020.
|
[22] |
CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers[J]. arXiv Preprint, arXiv: 2009.14794, 2021.
|
[23] |
XIONG Y, ZENG Z, CHAKRABORTY R, et al. Nystroömformer: a Nystoöm-based algorithm for approximating self-attention[J]. arXiv Preprint, arXiv: 2102.03902, 2021.
|
[24] |
TAY Y, BAHRI D, METZLER D, et al. Synthesizer: rethinking self-attention in transformer models[J]. arXiv Preprint, arXiv: 2005.00743, 2021.
|
[25] |
BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[J]. arXiv Preprint, arXiv: 2004.05150, 2020.
|
[26] |
KITAEV N, KAISER L, LEVSKAYA A. Reformer: the efficient transformer[J]. arXiv Preprint, arXiv: 2001.04451, 2020.
|
[27] |
ZAHEER M, GURUGANESH G, DUBEY K A, et al. Big bird: transformers for longer sequences[J]. arXiv Preprint, arXiv: 2007.14062, 2020.
|
[28] |
CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers[J]. arXiv Preprint, arXiv: 2009.14794, 2021.
|
[29] |
ROY A, SAFFAR M, VASWANI A, et al. Efficient content-based sparse attention with routing transformers[J]. arXiv Preprint, arXiv: 2003.05997, 2020.
|
[30] |
CORDONNIER J B, LOUKAS A, JAGGI M. Multi-head attention: collaborate instead of concatenate[J]. arXiv Preprint, arXiv: 2006.16362, 2020.
|
[31] |
SHAZEER N M, LAN Z, CHENG Y, et al. Talking-heads attention[J]. arXiv Preprint, arXiv: 2003. 02436, 2020.
|
[32] |
SUBRAMANIAN S, COLLOBERT R, RANZATO M, et al. Multi-scale transformer language models[J]. arXiv Preprint, arXiv: 2005.00581, 2020.
|
[33] |
JIN P, ZHU B, YUAN L, et al. MoH: multi-head attention as mixture-of-head attention[J]. arXiv Preprint, arXiv: 2410.11842, 2024.
|
[34] |
SU J, LU Y, PAN S, et al. Roformer: enhanced transformer with rotary position embedding[J]. arXiv Preprint, arXiv: 2104.09864, 2021.
|
[35] |
WANG Z, MA Y, LIU Z, et al. R-transformer: recurrent neural network enhanced transformer[J]. arXiv Preprint, arXiv: 1907.05572, 2019.
|
[36] |
LOSHCHILOV I, HSIEH C P, SUN S, et al. nGPT: Normalized transformer with representation learning on the hypersphere[J]. arXiv Preprint, arXiv: 2410.01131, 2024.
|
[37] |
XIONG Z, WANG Z, LIU Y, ET AL. A hybrid model of bi-directional LSTM and transformer for text classification[C]// Proceedings of the 2020 International Conference on Artificial Intelligence and Big Data, 2020.
|
[38] |
LI Z, YANG J, WANG J, et al. Integrating LSTM and BERT for long-sequence data analysis in intelligent tutoring systems[J]. arXiv Preprint, arXiv: 2405.05136, 2024.
|
[39] |
ZHANG T, LIU S, LI T, et al. Boundary information matters more: accurate temporal action detection with temporal boundary network[C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019: 1642-1646.
|
[40] |
DONG L, XU S, XU B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018: 5884-5888.
|
[41] |
CSORDÁS R, IRIE K, SCHMIDHUBER J, et al. MoEUT: mixture-of-experts universal transformers[J]. arXiv Preprint, arXiv: 2405.16039, 2024.
|
[42] |
PENG B, ALCAIDE E, ANTHONY Q, et al. RWKV: reinventing RNNs for the transformer era[J]. arXiv Preprint, arXiv: 2305.13048, 2023.
|
[43] |
GU A, GOEL K, RÉ C. Efficiently modeling long sequences with structured state spaces[J]. arXiv Preprint, arXiv: 2111.00396, 2021.
|
[44] |
GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[J]. arXiv Preprint, arXiv: 2312.00752, 2023.
|
[45] |
BECK M, PÖPPEL K, SPANRING M, et al. xLSTM: extended long short-term memory[J]. arXiv Preprint, arXiv: 2405.04517, 2024.
|
[46] |
FENG L, TUNG F, AHMED M O, et al. Were RNNs all we needed?[J]. arXiv Preprint, arXiv: 2410.01201, 2024.
|
[47] |
POLI M, MASSAROLI S, NGUYEN E, et al. Hyena hierarchy: towards larger convolutions in feedforward neural networks[J]. arXiv Preprint, arXiv: 2302. 10866, 2023.
|
[48] |
SUN Y, DONG L, HUANG S, et al. Retentive network: a successor to transformer for large language models[J]. arXiv Preprint, arXiv: 2307.08621, 2023.
|
[49] |
TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[J]. arXiv Preprint, arXiv: 2105.01601, 2021.
|
[50] |
TROCKMAN A, KOLTER J Z. Patches are all you need?[J]. arXiv Preprint, arXiv: 2201.09792, 2022.
|
[51] |
LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with fourier transforms[J]. arXiv Preprint, arXiv: 2105.03824, 2021.
|
[52] |
LIU Z, WANG Y, SACHIN V, et al. KAN: Kolmogorov-Arnold networks[J]. arXiv Preprint, arXiv: 2404.19756, 2024.
|