Transformer Introduction
reference:
[1]. The Transformer Family
[2]. Attention
[3]. 细节考究
Transformer Family
Notations
Symbol | Meaning |
---|---|
The model size / hidden state dimension / positional encoding size. | |
The number of heads in multi-head attention layer. | |
The segment length of input sequence. | |
The input sequence where each element has been mapped into an embedding vector of shape , same as the model size. | |
The key weight matrix. | |
The query weight matrix. | |
The value weight matrix.Often we have . | |
The weight matrices per head. | |
The output weight matrix. | |
The query embedding inputs. | |
The key embedding inputs. | |
The value embedding inputs. | |
A collection of key positions for the -th query to attend to. | |
The self-attention matrix between a input sequence of lenght and itself. | |
The scalar attention score between query and key . | |
position encoding matrix, where the row is the positional encoding for input . |
Attention and Self-Attention
Attention is a mechanism in the neural network that a model can learn to make predictions by selectively attending to a given set of data. The amount of attention is quantified by learned weights and thus the output is usually formed as a weighted average.
Self-attention is a type of attention mechanism where the model makes prediction for one part of a data sample using other parts of the observation about the same sample. Conceptually, it feels quite similar to non-local means. Also note that self-attention is permutation-invariant; in other words, it is an operation on sets.
There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix , a key matrix and a value matrix , the output is a weighted sum of the value vectors, where the weight assigned to each value slot is determined by the dot-product of the query with the corresponding key:
And for a query and a key vector (row vectors in query and key matrices), we have a scalar score:
where is a collection of key positions for the -th query to attend to.
See my old post for other types of attention if interested.
Multi-Head Self-Attention
The multi-head self-attention module is a key component in Transformer. Rather than only computing the attention once, the multi-head mechanism splits the inputs into smaller chunks and then computes the scaled dot-product attention over each subspace in parallel. The independent attention outputs are simply concatenated and linearly transformed into expected dimensions.
where is a concatenation operation. , are weight matrices to map input embeddings of size into query, key and value matrices. And is the output linear transformation. All the weights should be learned during training.
Transformer
The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later decoder-only Transformer was shown to achieve great performance in language modeling tasks, like in GPT and BERT.
Encoder-Decoder Architecture
The encoder generates an attention-based representation with capability to locate a specific piece of information from a large context. It consists of a stack of 6 identity modules, each containing two submodules, a multi-head self-attention layer and a point-wise fully connected feed-forward network. By point-wise, it means that it applies the same linear transformation (with same weights) to each element in the sequence. This can also be viewed as a convolutional layer with filter size 1. Each submodule has a residual connection and layer normalization. All the submodules output data of the same dimension .
The function of Transformer decoder is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is masked to prevent positions from attending to the future.
Positional Encoding
Because self-attention operation is permutation invariant, it is important to use proper positional encoding to provide order information to the model. The positional encoding has the same dimension as the input embedding, so it can be added on the input directly. The vanilla Transformer considered two types of encodings:
(1). Sinusoidal positional encoding is defined as follows, given the token position and the dimension :
$$ \text{PE}(i, \delta) = \left{ \begin{aligned} \sin\big(\frac{i}{10000^{2\delta’/d}}\big) , if \delta&=2\delta’\ \cos\big(\frac{i}{10000^{2\delta’/d}}\big) , if \delta&=2\delta’+1 \ \end{aligned} \right.$$
In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from to 10000 * .
(2). Learned positional encoding, as its name suggested, assigns each element with a learned column vector which encodes its absolute position (Gehring, et al. 2017).

