2023-02-03

Hugging Face Transformers:Tokenizer

Hugging Face Transformers Tokenizer

The Transformer model does not allow direct input of raw text data, so text must be converted to numerical data. The Tokenizer performs this conversion.

The Tokenizer performs the following processing on the input text:

  • Separates text into the smallest units such as words, sub-words, symbols, etc.
  • Assign an ID to each token
  • Add special tokens to the input text that provide information needed to populate the model
    • For example, BERT requires tokens such as <CLS> to indicate the beginning of a sentence and <EOS> to indicate the end of a sentence.

You can use the from_pretrained method of the AutoTokenizer class to use a Tokenizer of the specified model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)

>> {'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
>>  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The meaning of the return value is as follows

Return Description
input_ids Input text divided into tokens, each replaced by an ID
attention_mask An indication of whether the input token is eligible for Attention
token_type_ids A statement indicating which sentence each token belongs to, when a single input consists of multiple sentences

Basic usage

Use the AutoTokenizer class to instantiate a Tokenizer as follows.

from transformers import AutoTokenizer

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

You can encode text using Tokenizer as follows.

text = "The quick brown fox jumps over the lazy dog."
encoded_text = tokenizer(text)
print(encoded_text)

>> {
>>     'input_ids': [101, 1109, 3613, 3058, 17594, 15457, 1166, 1204, 1103, 16688, 3676, 119, 102],
>>     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>> }

The convert_ids_to_tokens method can also be used to convert from IDs to tokens.

tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
['[CLS]', 'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '[SEP]']

[CLS] refers to tokens that indicate sentence beginnings in the BERT model, and [SEP] refers to tokens that indicate sentence breaks. In inputs_ids, 101 refers to [CLS] and 102 refers to [SEP].

The convert_tokens_to_string method can be used to convert from IDs to text.

print(tokenizer.convert_tokens_to_string(tokens))
[CLS] The quick brown fox jumps over the lazy dog. [SEP]

Conversion from ID to text is performed as follows.

decoded_string = tokenizer.decode(encoded_text["input_ids"])
print(decoded_string)
The quick brown fox jumps over the lazy dog.

Special tokens, etc., are automatically removed, making the form easy to utilize when outputting inference results.

Encoding of multiple sentences

Multiple sentences can be encoded together by passing them to Tokenizer as a list.

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

>> {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
>>                [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
>>                [101, 1327, 1164, 5450, 23434, 136, 102]],
>>  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0]],
>>  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
>>                     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                     [1, 1, 1, 1, 1, 1, 1]]}

The following options can be set for encoding:

  • Padding each sentence to the maximum length of sentences in the batch
  • Truncate each sentence to the maximum length allowed by the model
  • Return tensor

Padding

The tensor, which is the input to the model, must have a uniform shape. Therefore, different lengths of sentences can be problematic. Padding is a strategy to ensure that the tensor is rectangular by adding a special padding token to short sentences.

If padding=True is set, the short sequences in a batch are padded to match the longest sequence.

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

>> {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
>>                [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
>>                [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
>>  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
>>  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
>>                     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

Truncation

On the other hand, there are cases where the sequence is too long for the model to handle. In this case, the sequence must be truncated to a shorter length.

Setting truncation=True will truncate the sequence to the maximum length the model will accept.

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

>> {'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
>>                [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
>>                [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
>>  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
>>  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
>>                     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                     [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

Return tensor

Returns the actual tensors given to the model to the Tokenizer.

Set the return_tensors parameter to pt for PyTorch or tf for TensorFlow.

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

>> {'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
>>                       [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
>>                       [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),
>>  'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>                            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
>>  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
>>                            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
print(encoded_input)

>> {'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
>> array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
>>        [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
>>        [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
>>       dtype=int32)>,
>>  'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
>> array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
>>        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>,
>>  'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
>> array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
>>        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}

The following table summarizes the recommended options for padding and truncation.

recommended-padding-and-truncation
Preprocessing data

See below for detailed specifications of padding and truncation.

https://huggingface.co/transformers/v3.2.0/preprocessing.html

Sentence pairs

Sometimes it is necessary to provide sentence pairs to the model, such as when one wants to classify whether two sentences are similar.

In BERT, the input is represented as follows.

[CLS] Sequence A [SEP] Sequence B [SEP]

Two sentences can be given as two arguments to encode a pair of sentences in the format expected by the model.

encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)

>> {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
>> 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
>> 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The token_type_ids is used to determine which part is sentence A and which part is sentence B. Therefore, token_type_ids is not required for all models.

By default, Tokenizer tokenization returns only the input expected by the associated model. You can use return_input_ids or return_token_type_ids to force the return or non-return of one of these special arguments.

Decoding the retrieved token ID will show that the special token has been properly added.

tokenizer.decode(encoded_input["input_ids"])

>> "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"

If you have a list of sentence pairs you wish to process, pass those two lists to the Tokenizer.

batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)

>> {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
>>                [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
>>                [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
>> 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                    [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                    [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
>> 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
>>                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

It is also possible to decode the list of input_ids one by one to check the ones given in the model.

for ids in encoded_inputs["input_ids"]:
    print(tokenizer.decode(ids))

>> [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
>> [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
>> [CLS] And the very very last one [SEP] And I go with the very last one [SEP]

Pre-Tokenized input

You can also pass pre-tokenized input to the Tokenizer.

If you want to use pre-tokenized input, specify is_pretokenized=True when passing input to Tokenizer.

encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
print(encoded_input)

>> {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
>> 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
>> 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note that Tokenizer will add special tokens unless you specify add_special_tokens=False.

A batch of statements can be encoded as follows.

batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                   ["And", "another", "sentence"],
                   ["And", "the", "very", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, is_pretokenized=True)

You can also encode sentence pairs as follows.

batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_pretokenized=True)

References

https://huggingface.co/docs/tokenizers/quicktour#build-a-tokenizer-from-scratch
https://huggingface.co/docs/transformers/preprocessing
https://huggingface.co/transformers/v3.2.0/preprocessing.html

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!