Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
- Submitting institution
-
University of Edinburgh
- Unit of assessment
- 11 - Computer Science and Informatics
- Output identifier
- 178562818
- Type
- E - Conference contribution
- DOI
-
10.18653/v1/P19-1580
- Title of conference / published proceedings
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (long papers)
- First page
- 5797
- Volume
- -
- Issue
- -
- ISSN
- -
- Open access status
- -
- Month of publication
- July
- Year of publication
- 2019
- URL
-
-
- Supplementary information
-
-
- Request cross-referral to
- -
- Output has been delayed by COVID-19
- No
- COVID-19 affected output statement
- -
- Forensic science
- No
- Criminology
- No
- Interdisciplinary
- No
- Number of additional authors
-
4
- Research group(s)
-
D - Language, Interaction and Robotics
- Citation count
- 18
- Proposed double-weighted
- No
- Reserve for an output with double weighting
- No
- Additional information
- The paper is the first to demonstrate redundancies and emerging specialization in an extremely popular and effective class of models, Transformers. Transformers are huge neural networks consisting of a large number of sub-components, called 'heads'. We study machine translation and show that only a small subset of heads is important and these heads are mostly linguistically interpretable. These findings motivated numerous follow-up work (e.g., from the University of Washington, MIT, and Google), including (1) methods hand-crafting 'heads', producing cheap and small models; (2) techniques for 'pruning' or adaptively changing model size; (3) follow-on studies with other tasks and models.
- Author contribution statement
- -
- Non-English
- No
- English abstract
- -