2023-01-27

DistilBERT

What is DistilBERT

DistilBERT is a model appeared in a paper called DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

DistilBERT is a Transformer model based on BERT. It has 40% fewer parameters, runs 60% faster, and can maintain 97% of the performance of BERT as measured by the GLUE Benchmark.

DistilBERT is trained using Distillation Knowledge, a technique that compresses a large model called Teacher into a smaller model called Student. Distilling BERT yields a Transformer model that has many similarities to the original BERT model, but is lighter and faster to run.

Background of DistilBERT

In recent NLP, it has become a mainstream practice to create highly accurate language models by pre-training a large number of parameters on unlabeled data. The BERT and GPT series, for example, are becoming larger and larger, and the computation of a single sample takes so long that it is too computationally intensive for the average user.

Parameter counts

Therefore, the goal of this paper is to create a more lightweight model while maintaining accuracy.

Knowledge Distillation

Knowledge Distillation is a technique that appeared in 2015 in the paper Distilling the Knowledge in a Neural Network. The idea is to compress and transfer knowledge from a large, computationally expensive model (Teacher model) to a smaller model (Student model) while maintaining validity.

Students are trained to reproduce the behavior of the Teacher. However, the Student model does not have as large a memory (number of parameters) as the Teacher model, so it tries to be as close to the Teacher model as possible with its small memory.

Teacher Model
Ensembles of very large models or individually trained models with strong regularization, such as dropouts.
Student Model
Small models that rely on the distilled knowledge of the Teacher model

Knowledge Distillation: A Survey

In DistilBERT, Teacher refers to BERT and Student refers to DistilBERT.

Structure of DistilBERT

Student architecture

DistilBERT as Student follows the same general architecture as BERT. DistilBERT takes the following measures to reduce parameters:

Removal of the Token embeddings layer
Removal of the Pooler layer (the layer for classification after the Transformer Encoder)
Halving the number of Transformer layers (BERT_BASE: 12, DistilBERT: 6)

These measures will result in a 40% reduction in the number of parameters.

Initial values for Student

The parameters of BERT, the Teacher, are used as the initial values for Student. However, since the number of Student layers is half the number of Teacher layers, one of the two corresponding layers is used as the initial value.

Distillation

DistilBERT sets up the learning method as follows:

Increase batch size to 4,000
Masking is done dynamically
Next Sentence Prediction is not performed

Training Loss Function

The loss function of DistilBERT is a linear combination of Distillation Loss ( $L_{ce}$ ), Masked Language Modelling Loss ( $L_{mlm}$ ) and Cosine Embedding Loss ( $L_{cos}$ ).

L = L_{ce} + L_{mlm} + L_{cos}

Distillation Loss

Distillation Loss ( $L_{ce}$ ) is the loss function that makes the Student imitate the Teacher and is expressed by

L_{ce} = - \sum_i t_i \times \log(s_i)

where $t_i$ is the prediction probability of the $i$ th word by the Teacher (BERT) and $s_i$ is the prediction probability of the $i$ th word by the Student (DistilBERT).

The loss function in the above equation allows us to mimic not only the word with the highest probability of prediction, but also the word with the next highest probability of prediction, etc., thus learning the distribution predicted by the Teacher.

For example, consider the following sentence.

I watched [MASK] yesterday.

Suppose that the predicted probability of BERT, the Teacher, is as follows

Predicted words	Predicted probability
movie	0.8
TV	0.1
baseball	0.05
MMA	0.05

In this case, the Student learns not only "movie" but also "TV", "MMA", and "baseball" so that the prediction probabilities are closer to that of the Teacher, BERT.

However, here the prediction probability $p_i$ is expressed using the function softmax-temperature.

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j) / T}

If $T$ is set to 1, it becomes an ordinary Softmax function, but if $T$ is greater than 1, it increases the probability of predicting words with lower prediction probabilities than ordinary Softmax.

By using softmax-temperature, the function learns to predict words with low probability.

Masked Language Model Loss

Masked Language Modelling Loss ( $L_{mlm}$ ) is the same Masked Language Model loss function as BERT.

Cosine Embedding Loss

Cosine Embedding Loss ( $L_{cos}$ ) is the loss function of the cosine similarity between the Teacher (BERT) and Student (DistilBERT) hidden layer vectors. Cosine Embedding Loss allows the Student's embedding vectors to be closer to the Teacher's embedding vectors.

References

https://arxiv.org/abs/1910.01108
https://arxiv.org/abs/1503.02531
https://arxiv.org/abs/2006.05525

Understanding Logits in BERT

Text Classification with DistilBERT

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS