Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head div...

Full description

Bibliographic Details
Main Authors: Hyeongu Yun, Taegwan Kang, Kyomin Jung
Format: Article
Language:English
Published: MDPI AG 2021-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/4/1548
id doaj-65a9d7a5730746c7b46648d1fa2de5b3
record_format Article
spelling doaj-65a9d7a5730746c7b46648d1fa2de5b32021-02-09T00:04:57ZengMDPI AGApplied Sciences2076-34172021-02-01111548154810.3390/app11041548Analyzing and Controlling Inter-Head Diversity in Multi-Head AttentionHyeongu Yun0Taegwan Kang1Kyomin Jung2Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering, Seoul National University, Seoul 08826, KoreaMulti-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.https://www.mdpi.com/2076-3417/11/4/1548multi-head attentioninter-head similarityTransformermachine translationlanguage modelingNatural Language Processing
collection DOAJ
language English
format Article
sources DOAJ
author Hyeongu Yun
Taegwan Kang
Kyomin Jung
spellingShingle Hyeongu Yun
Taegwan Kang
Kyomin Jung
Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
Applied Sciences
multi-head attention
inter-head similarity
Transformer
machine translation
language modeling
Natural Language Processing
author_facet Hyeongu Yun
Taegwan Kang
Kyomin Jung
author_sort Hyeongu Yun
title Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
title_short Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
title_full Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
title_fullStr Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
title_full_unstemmed Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
title_sort analyzing and controlling inter-head diversity in multi-head attention
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-02-01
description Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.
topic multi-head attention
inter-head similarity
Transformer
machine translation
language modeling
Natural Language Processing
url https://www.mdpi.com/2076-3417/11/4/1548
work_keys_str_mv AT hyeonguyun analyzingandcontrollinginterheaddiversityinmultiheadattention
AT taegwankang analyzingandcontrollinginterheaddiversityinmultiheadattention
AT kyominjung analyzingandcontrollinginterheaddiversityinmultiheadattention
_version_ 1724278776831934464