Fastformer: Additive Attention Can Be All You Need

Mahendra Singh Thapa

2 min readSep 18, 2021

Transform Architecture: Self Attention

Time complexity: O(n². d)

It has become a bottleneck for the Transformer to handle the long sequences.

Fastformer

Efficient Transformer variant based on additive attention that can achieve effective context modeling in linear complexity.

Efficiency

Training and inference speed of different methods. The y-axis (time) is on a logarithmic scale

Architecture

Dataset

Experimentation Results

Influence of Interaction Function

To leagues the global query vector with key vector and global key vector with value vector:

Element-wise multiplication produces the best result in comparison with concatenation and addition operation.

Influence of Parameter Sharing

Fastformer incorporate both query and value sharing and layer-wise sharing strategies to improve the model performance and meanwhile reduce the model parameter size.

Conclusion

Fastformer: Transformer variant based on additive attention that can handle long sequences efficiently with linear complexity.
Extensive experiments on five benchmark datasets show that Fastformer is much more efficient than many existing Transformer models.

Resources

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Mahendra Singh Thapa

AI Researcher

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams