2023-01-27

DistilBERT

What is DistilBERT

DistilBERT is a model appeared in a paper called DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

DistilBERT is a Transformer model based on BERT. It has 40% fewer parameters, runs 60% faster, and can maintain 97% of the performance of BERT as measured by the GLUE Benchmark.

DistilBERT is trained using Distillation Knowledge, a technique that compresses a large model called Teacher into a smaller model called Student. Distilling BERT yields a Transformer model that has many similarities to the original BERT model, but is lighter and faster to run.

Background of DistilBERT

In recent NLP, it has become a mainstream practice to create highly accurate language models by pre-training a large number of parameters on unlabeled data. The BERT and GPT series, for example, are becoming larger and larger, and the computation of a single sample takes so long that it is too computationally intensive for the average user.

Parameter counts

Therefore, the goal of this paper is to create a more lightweight model while maintaining accuracy.

Knowledge Distillation

Knowledge Distillation is a technique that appeared in 2015 in the paper Distilling the Knowledge in a Neural Network. The idea is to compress and transfer knowledge from a large, computationally expensive model (Teacher model) to a smaller model (Student model) while maintaining validity.

Students are trained to reproduce the behavior of the Teacher. However, the Student model does not have as large a memory (number of parameters) as the Teacher model, so it tries to be as close to the Teacher model as possible with its small memory.

  • Teacher Model
    Ensembles of very large models or individually trained models with strong regularization, such as dropouts.

  • Student Model
    Small models that rely on the distilled knowledge of the Teacher model

Knowledge distillation
Knowledge Distillation: A Survey

In DistilBERT, Teacher refers to BERT and Student refers to DistilBERT.

Structure of DistilBERT

Student architecture

DistilBERT as Student follows the same general architecture as BERT. DistilBERT takes the following measures to reduce parameters:

  • Removal of the Token embeddings layer
  • Removal of the Pooler layer (the layer for classification after the Transformer Encoder)
  • Halving the number of Transformer layers (BERT_BASE: 12, DistilBERT: 6)

These measures will result in a 40% reduction in the number of parameters.

Initial values for Student

The parameters of BERT, the Teacher, are used as the initial values for Student. However, since the number of Student layers is half the number of Teacher layers, one of the two corresponding layers is used as the initial value.

Distillation

DistilBERT sets up the learning method as follows:

  • Increase batch size to 4,000
  • Masking is done dynamically
  • Next Sentence Prediction is not performed

Training Loss Function

The loss function of DistilBERT is a linear combination of Distillation Loss (L_{ce}), Masked Language Modelling Loss (L_{mlm}) and Cosine Embedding Loss (L_{cos}).

L = L_{ce} + L_{mlm} + L_{cos}

Distillation Loss

Distillation Loss (L_{ce}) is the loss function that makes the Student imitate the Teacher and is expressed by

L_{ce} = - \sum_i t_i \times \log(s_i)

where t_i is the prediction probability of the ith word by the Teacher (BERT) and s_i is the prediction probability of the ith word by the Student (DistilBERT).

The loss function in the above equation allows us to mimic not only the word with the highest probability of prediction, but also the word with the next highest probability of prediction, etc., thus learning the distribution predicted by the Teacher.

For example, consider the following sentence.

I watched [MASK] yesterday.

Suppose that the predicted probability of BERT, the Teacher, is as follows

Predicted words Predicted probability
movie 0.8
TV 0.1
baseball 0.05
MMA 0.05

In this case, the Student learns not only "movie" but also "TV", "MMA", and "baseball" so that the prediction probabilities are closer to that of the Teacher, BERT.

However, here the prediction probability p_i is expressed using the function softmax-temperature.

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j) / T}

If T is set to 1, it becomes an ordinary Softmax function, but if T is greater than 1, it increases the probability of predicting words with lower prediction probabilities than ordinary Softmax.

By using softmax-temperature, the function learns to predict words with low probability.

Masked Language Model Loss

Masked Language Modelling Loss (L_{mlm}) is the same Masked Language Model loss function as BERT.

Cosine Embedding Loss

Cosine Embedding Loss (L_{cos}) is the loss function of the cosine similarity between the Teacher (BERT) and Student (DistilBERT) hidden layer vectors. Cosine Embedding Loss allows the Student's embedding vectors to be closer to the Teacher's embedding vectors.

References

https://arxiv.org/abs/1910.01108
https://arxiv.org/abs/1503.02531
https://arxiv.org/abs/2006.05525
https://arunm8489.medium.com/understanding-distil-bert-in-depth-5f2ca92cf1ed
https://www.analyticsvidhya.com/blog/2022/11/introduction-to-distilbert-in-student-model/
https://medium.com/huggingface/distilbert-8cf3380435b5
https://data-analytics.fun/2020/10/06/understanding-distilbert/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!