Transformer Introduction

警告
本文最后更新于 2023-07-25,文中内容可能已过时。

reference:
[1]. The Transformer Family [2]. Attention [3]. 细节考究

SymbolMeaning
ddThe model size / hidden state dimension / positional encoding size.
hhThe number of heads in multi-head attention layer.
LLThe segment length of input sequence.
XRL×dX \in \mathbb R ^ {L \times d}The input sequence where each element has been mapped into an embedding vector of shape , same as the model size.
WkRd×dkW^k \in \mathbb R ^ {d \times d^k}The key weight matrix.
WqRd×dkW^q \in \mathbb R ^ {d \times d^k}The query weight matrix.
WvRd×dkW^v \in \mathbb R ^ {d \times d^k}The value weight matrix.Often we have dk=dv=dd_k = d_v = d.
WiK,WiqRd×dk/h;WivRdxdv/hW^K_i, W^q_i \in \mathbb R ^ {d \times d^k / h}; W^v_i \in \mathbb R^{d x d_v / h}The weight matrices per head.
Wodv×dW^o \in \mathbb d_v \times dThe output weight matrix.
Q=XWqRL×dqQ = XW^q \in \mathbb R^{L \times d_q}The query embedding inputs.
K=XWkRL×dkK = XW^k \in \mathbb R^{L \times d_k}The key embedding inputs.
V=XWvRL×dvV = XW^v \in \mathbb R^{L \times d_v}The value embedding inputs.
SiS_iA collection of key positions for the -th query to attend to.
ARL×LA \in \mathbb R ^ {L \times L}The self-attention matrix between a input sequence of lenght LL and itself. A=softmax(QKT/(dk))A = softmax (Q K^T/\sqrt{(d_k)} )
aij inAa_ij \ in A The scalar attention score between query qiq_i and key kjk_j.
PRL×dP \in \mathbb R ^ {L \times d}position encoding matrix, where the ithi-th row is the positional encoding for input xix_i.

Attention is a mechanism in the neural network that a model can learn to make predictions by selectively attending to a given set of data. The amount of attention is quantified by learned weights and thus the output is usually formed as a weighted average.

Self-attention is a type of attention mechanism where the model makes prediction for one part of a data sample using other parts of the observation about the same sample. Conceptually, it feels quite similar to non-local means. Also note that self-attention is permutation-invariant; in other words, it is an operation on sets.

There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix QQ, a key matrix KK and a value matrix VV, the output is a weighted sum of the value vectors, where the weight assigned to each value slot is determined by the dot-product of the query with the corresponding key:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

And for a query and a key vector qi,kjRdq_i, k_j \in \mathbb R ^ d (row vectors in query and key matrices), we have a scalar score:

aij=softmax(qikjTdk)=exp(qikjT)dkrSi(qikjT)a_{ij} = softmax(\frac{q_i k_j^T}{\sqrt{d_k}}) = \frac{\exp(q_i k_j^T)}{\sqrt{d_k}\sum_{r \in S_i}(q_i k_j^T)}

where SiS_i is a collection of key positions for the ii-th query to attend to.

See my old post for other types of attention if interested.

The multi-head self-attention module is a key component in Transformer. Rather than only computing the attention once, the multi-head mechanism splits the inputs into smaller chunks and then computes the scaled dot-product attention over each subspace in parallel. The independent attention outputs are simply concatenated and linearly transformed into expected dimensions.

MulitHeadAttention(Xq,Xk,Xv)=[head1,;;headh]Wo,whereheadi=Attention(XqWiq,XkWik,XvWiv)\text{MulitHeadAttention}(X_q, X_k, X_v) = [\text{head}_1,;…; \text{head}_h] W^o, where \text{head}_i = \text{Attention}(X_qW_i^q, X_kW_i^k, X_vW_i^v)

where [.;.][.;.] is a concatenation operation. Wiq,WikRd×dk/hW_i^q, W_i^k \in \mathbb R^{d \times d_{k} / h}, WivRd×dv/hW_i^v \in \mathbb R^{d \times d_{v} / h} are weight matrices to map input embeddings of size L×dL \times d into query, key and value matrices. And WoRdv×dW^o \in \mathbb R ^ {d_v \times d} is the output linear transformation. All the weights should be learned during training.

Fig. 1. Illustration of the multi-head scaled dot-product attention mechanism.

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later decoder-only Transformer was shown to achieve great performance in language modeling tasks, like in GPT and BERT.

Encoder-Decoder Architecture

The encoder generates an attention-based representation with capability to locate a specific piece of information from a large context. It consists of a stack of 6 identity modules, each containing two submodules, a multi-head self-attention layer and a point-wise fully connected feed-forward network. By point-wise, it means that it applies the same linear transformation (with same weights) to each element in the sequence. This can also be viewed as a convolutional layer with filter size 1. Each submodule has a residual connection and layer normalization. All the submodules output data of the same dimension dd.

The function of Transformer decoder is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is masked to prevent positions from attending to the future.

Fig. 2. The architecture of the vanilla Transformer model

Positional Encoding

Because self-attention operation is permutation invariant, it is important to use proper positional encoding to provide order information to the model. The positional encoding PRL×dP \in \mathbb R ^ {L \times d} has the same dimension as the input embedding, so it can be added on the input directly. The vanilla Transformer considered two types of encodings:

(1). Sinusoidal positional encoding is defined as follows, given the token i=1,,Li = 1, …, L position and the dimension δ=1,,d\delta = 1, …, d:

$$ \text{PE}(i, \delta) = \left{ \begin{aligned} \sin\big(\frac{i}{10000^{2\delta’/d}}\big) , if \delta&=2\delta’\ \cos\big(\frac{i}{10000^{2\delta’/d}}\big) , if \delta&=2\delta’+1 \ \end{aligned} \right.$$

In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from 2π2\pi to 10000 * 2π2\pi.

Fig. 2. Sinusoidal positional encoding

(2). Learned positional encoding, as its name suggested, assigns each element with a learned column vector which encodes its absolute position (Gehring, et al. 2017).

Buy me a coffee~
支付宝
微信
0%