Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head div...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-02-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/4/1548 |
id |
doaj-65a9d7a5730746c7b46648d1fa2de5b3 |
---|---|
record_format |
Article |
spelling |
doaj-65a9d7a5730746c7b46648d1fa2de5b32021-02-09T00:04:57ZengMDPI AGApplied Sciences2076-34172021-02-01111548154810.3390/app11041548Analyzing and Controlling Inter-Head Diversity in Multi-Head AttentionHyeongu Yun0Taegwan Kang1Kyomin Jung2Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaMulti-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.https://www.mdpi.com/2076-3417/11/4/1548multi-head attentioninter-head similarityTransformermachine translationlanguage modelingNatural Language Processing |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Hyeongu Yun Taegwan Kang Kyomin Jung |
spellingShingle |
Hyeongu Yun Taegwan Kang Kyomin Jung Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention Applied Sciences multi-head attention inter-head similarity Transformer machine translation language modeling Natural Language Processing |
author_facet |
Hyeongu Yun Taegwan Kang Kyomin Jung |
author_sort |
Hyeongu Yun |
title |
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention |
title_short |
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention |
title_full |
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention |
title_fullStr |
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention |
title_full_unstemmed |
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention |
title_sort |
analyzing and controlling inter-head diversity in multi-head attention |
publisher |
MDPI AG |
series |
Applied Sciences |
issn |
2076-3417 |
publishDate |
2021-02-01 |
description |
Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines. |
topic |
multi-head attention inter-head similarity Transformer machine translation language modeling Natural Language Processing |
url |
https://www.mdpi.com/2076-3417/11/4/1548 |
work_keys_str_mv |
AT hyeonguyun analyzingandcontrollinginterheaddiversityinmultiheadattention AT taegwankang analyzingandcontrollinginterheaddiversityinmultiheadattention AT kyominjung analyzingandcontrollinginterheaddiversityinmultiheadattention |
_version_ |
1724278776831934464 |