Fastformer: Additive Attention Can Be All You Need
Transform Architecture: Self Attention

Time complexity: O(n². d)
It has become a bottleneck for the Transformer to handle the long sequences.
Fastformer
Efficient Transformer variant based on additive attention that can achieve effective context modeling in linear complexity.

Efficiency

Architecture



Dataset

Experimentation Results


Influence of Interaction Function

To leagues the global query vector with key vector and global key vector with value vector:
- Element-wise multiplication produces the best result in comparison with concatenation and addition operation.
Influence of Parameter Sharing

Fastformer incorporate both query and value sharing and layer-wise sharing strategies to improve the model performance and meanwhile reduce the model parameter size.
Conclusion
- Fastformer: Transformer variant based on additive attention that can handle long sequences efficiently with linear complexity.
- Extensive experiments on five benchmark datasets show that Fastformer is much more efficient than many existing Transformer models.