2023-01-20

What is BERT

What is BERT

BERT stands for Bidirectional Encoder Representations from Transformers and is a natural language processing model announced by Google on October 11, 2018.

BERT is a pre-training model by Transformer that is bidirectional using large unsupervised datasets. BERT is a pre-training model, not a predictive model, so BERT by itself cannot do anything. Fine tuning for individual tasks such as document classification or sentiment analysis makes it a predictive model.

BERT features

BERT has the following features:

  • Contextual understanding
  • High versatility
  • Overcoming data shortages

Contextual understanding

Before BERT, there were language processing models such as ELMo and OpenAI GPT, which could not understand context because ELMo was a shallow bidirectional model and OpenAI GPT was a unidirectional model. Here is an example from Google.

Prior to the introduction of BERT, Google searches could not process words like "to" that formed a sentence-to-sentence relationship. For example, suppose we have the following search terms.

2019 brazil traveler to usa need a visa

In the case of the above search term, what the user wants to know is whether a traveler from Brazil needs a visa to go to the United States. However, prior to the introduction of BERT, the word "to" could not be processed, so it could be interpreted as "American traveler to Brazil," resulting in search results that did not match the need.

BERT is a context-aware model using an interactive Transformer, and after BERT was implemented, "Brazilian travelers to the U.S." could be interpreted and the U.S. Embassy's visa information page for Brazilian travelers could be displayed at the top of the results. The model is now interpreted as "Brazilian travelers to the U.S.".

2019 brazil traveler to usa need a visa
Understanding searches better than ever before

High versatility

BERT can pre-train large amounts of text data from Wikipedia, BooksCorpus, and other sources, and then apply it to various tasks such as sentiment analysis and translation through fine tuning.

Overcoming data shortages

Unlike traditional models, BERT can process unlabeled data sets. Labeled data sets are often difficult to obtain; BERT can overcome the lack of data by allowing unlabeled data to be used as material for processing.

How BERT works

Learning BERT involves the following two steps:

  1. Pre-training
    Pre-training with unlabeled data
  2. Fine tuning
    Fine tuning with labeled data using the pre-trained weights as initial values

As an example, the spam classification task is illustrated below.

BERT transfer learning
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

BERT has the architecture of a Transformer Encoder only, and there are BASE and LARGE models.

Model L (# of Transformer blocks) H (hidden layer size) A (# of Self-Attention head) # parameters
\textrm{BERT}_{\textrm{BASE}} 12 768 12 110M
\textrm{BERT}_{\textrm{LARGE}} 24 1024 16 340M

Pre-training

The data used in BERT pre-training is unlabeled raw text data.

The document data is converted to a sequence of vectors as shown in the figure below.

BERT input representation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

In the top row, "Input," the structure is one input sequence, consisting of a special token [CLS] at the beginning and two sentences A and B concatenated by inserting the special token [SEP] at the end of each sentence.

In the second and subsequent stages, immediately after the input sequence is submitted, the three Embeddings replace the tokens, sentence categories, and positions in the sequence with H-dimensional embedded representations that are learned in the pre-training process, and the vector sequence obtained by adding these tokens becomes the input to the Transformer.

  • Token Embeddings: Token ID
  • Segment Embeddings: Segments of sentences A and B
  • Position Embeddings: Position in the sequence

BERT performs pre-training on this input, combining the following two objective functions

  • Masked language Model
  • Next Sentence Prediction

Masked Language Model

Conventional natural language processing models can only process sentences from a single direction and must make predictions based on data from sentences preceding the target word. BERT, however, training using a bidirectional Transformer, resulting in a significant improvement in accuracy over conventional methods. This is made possible by the Masked Language Model.

The Masked Language Model process works by probabilistically replacing 15% of the words in the input sentence with another word and predicting the word before the replacement based on the context. Of the 15% selected, 80% are masked to be replaced by [MASK], 10% by another random word, and the remaining 10% are left unchanged.

  • 80% of the 15% selected words are converted to [MASK].
    • my dog is hairy ➡︎ my dog is [MASK]
  • 10% of the 15% selected words are converted to random words.
    • my dog is hairy ➡︎ my dog is apple
  • 10% of 15% selected words are kept same.
    • my dog is hairy ➡︎ my dog is hairy

By solving the task of guessing these substituted words from the surrounding context, the contextual information corresponding to the words can be learned.

Next Sentence Prediction

While the Masked Language Model can learn words, it cannot learn sentences. Next Sentence Prediction can learn the relationship between two sentences.

Next Sentence Prediction learns by replacing one of the sentences with the other with 50% probability and determining whether they are adjacent (isNext or notNext) or not (isNext or notNext). A token called [CLS] is provided for classification.

Example sentence Judgement
[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] isNext
[CLS] the man went to [MASK] store [SEP] penguin [MASK] are flightless birds [SEP] notNext

In this way, BERT is able to learn not only about words, but also about the expression of whole sentences.

Fine tuning

BERT's pre-trained models are fine-tuned according to the task to be solved. Fine tuning involves learning on a relatively small amount of labeled data.

When fine tuning is performed, the parameters obtained from the pre-training are used as the initial values of the model, and the parameters are learned using the labeled data.

Thus, by using the parameters obtained in the pre-training as initial values during fine tuning, a model with high performance can be obtained even with a relatively small number of training data.

The figure below shows an example of fine tuning.

  • (a) Sentence Pair Classification Tasks
  • (b) Single Sentence Classification Tasks
  • (c) Question Answering Tasks
  • (d) Single Sentence Tagging Tasks

Fine tuning
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT performance benchmarks

The performance of natural language processing models is measured by GLUE (The General Language Understanding Evaluation), a benchmark that calculates scores on all natural language processing tasks. The following table shows the GLUE results described in the paper.

GLUE test results
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT scores higher than existing models such as OPEN AI on all datasets.

References

https://arxiv.org/abs/1810.04805
https://www.blog.google/products/search/search-language-understanding-bert/
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
https://towardsdatascience.com/question-answering-with-a-fine-tuned-bert-bc4dafd45626
https://jalammar.github.io/illustrated-bert/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!