Fastformer: Additive Attention Can Be All You Need

Mahendra Singh Thapa
2 min readSep 18, 2021

Transform Architecture: Self Attention

Time complexity: O(n². d)

It has become a bottleneck for the Transformer to handle the long sequences.

Fastformer

Efficient Transformer variant based on additive attention that can achieve effective context modeling in linear complexity.

Efficiency

Training and inference speed of different methods. The y-axis (time) is on a logarithmic scale

Architecture

Dataset

Experimentation Results

Influence of Interaction Function

To leagues the global query vector with key vector and global key vector with value vector:

  • Element-wise multiplication produces the best result in comparison with concatenation and addition operation.

Influence of Parameter Sharing

Fastformer incorporate both query and value sharing and layer-wise sharing strategies to improve the model performance and meanwhile reduce the model parameter size.

Conclusion

  • Fastformer: Transformer variant based on additive attention that can handle long sequences efficiently with linear complexity.
  • Extensive experiments on five benchmark datasets show that Fastformer is much more efficient than many existing Transformer models.

Resources

Sign up to discover human stories that deepen your understanding of the world.

No responses yet

Write a response