2023-02-03

Word Embeddings

What is word embedding

In order for a computer to recognize natural language, it is necessary to quantify words and documents, reflecting their properties and meanings. To quantify words, a word is represented as a vector of fixed length, which is called word embedding.

By representing words as vectors, for example, semantically similar words (such as "boat" and "ship") or semantically related words (such as "boat" and "water") become closer together in vector space. To capture semantic relatedness, documents are used as context; to capture semantic similarity, words are used as context.

The following methods have been proposed to represent words as vectors.

One-hot vector
Word2Vec
fastText
GloVe
ELMo
BERT

One-hot vector

One technique for vectorizing words is the one-hot vector, meaning a vector in which only one of all elements in the vector is 1 and the rest are all 0s.

For example, assign a vector to each word as follows

Word	One-hot vector
I	$(1,0,0)$
play	$(0,1,0)$
basketball	$(0,0,1)$

While this method can mechanically assign vectors, it has the following challenges:

The dimension is too large
If there are 1 million words, the dimension is 1 million, which is computationally expensive.
Cannot encode meaning
Although it can determine if they are the same word or not, the vector does not include the meaning of the word.

Word embeddings

Word embedding is a method proposed to solve the problem of the huge dimensionality of vectors. Word embedding is a method of embedding a word in a vector space and viewing it as a single point in that space. Word embedding enables us to represent words as vectors that include their meanings, and to compute vectors between words with different meanings.

The following methods have been proposed for word embedding.

Word2Vec
fastText
GloVe
ELMo
BERT

Word2vec

A method that overcomes the challenges of one-hot vectors, such as too large a dimension and the inability to encode meaning, is word2vec. Word2vec was released in 2013 by a research team at Google.

https://arxiv.org/abs/1310.4546

Word2vec is a set of two architectures: Skip-gram and CBOW (Continuous Bag-Of-Words).

CBOW and skip-gram
Efficient Estimation of Word Representations in Vector Space

Architecture	Content
Skip-gram	Predicts surrounding words from a word
CBOW	Predicts certain words from surrounding words

Word2vec can be easily implemented using a library called gensim.

https://radimrehurek.com/gensim/

word2vec.ipynb

import gensim
import gensim.downloader as gendl

corpus = gendl.load("text8")
model = gensim.models.Word2Vec(corpus, size=200, window=5, iter=10, min_count=1)

The word "tea" is represented by the following 200-dimensional vector.

word2vec.ipynb

model.wv['tea']

array([ 1.0338242 , -0.5497605 ,  0.85095   , -1.9308733 ,  0.02577365,
       -0.04760092, -0.11763089, -0.88888925,  1.0477808 , -0.8593776 ,
       -0.18352145,  0.44907862,  0.7292549 , -0.989956  , -0.31382865,
       -0.326568  ,  1.0326288 , -1.015825  ,  1.1364412 ,  2.2554684 ,
       -0.46170548, -0.87449044, -0.92171615,  1.3384876 ,  0.71479833,
        1.5579652 , -0.79532343, -0.01963346, -0.09595346,  0.2980923 ,
       -0.77367973, -0.21674974, -0.16363882,  0.7561723 , -0.9837275 ,
        0.7700488 ,  0.31943208,  1.1382285 ,  0.7920231 ,  0.4009208 ,
       -0.12014329,  0.45739177, -0.8750017 ,  0.9846294 ,  1.0761734 ,
        0.9916462 , -1.9893979 ,  0.63205105,  1.6679561 ,  0.23037305,
       -0.09124516,  0.20652458,  2.5330052 ,  0.46803576, -0.52581394,
       -0.04794846, -0.8805762 , -0.19912305,  0.4005117 , -0.7323035 ,
        1.1342882 , -2.2171109 ,  2.2963653 ,  0.30365032,  1.0434904 ,
       -0.04669679,  0.4062761 ,  0.5528872 ,  0.47095624, -1.6239247 ,
        1.557434  , -0.9619251 ,  0.9666863 , -0.2241764 , -0.75970024,
        1.8893499 , -1.0962995 , -1.0411807 ,  0.4480955 ,  0.5210397 ,
        1.6454376 ,  0.65203476, -0.04012801, -0.40056226,  2.852509  ,
       -0.32553425, -0.20229222,  0.7245843 , -1.9557822 , -0.12104818,
        0.28175735, -0.895713  ,  0.58786476,  1.3742826 , -0.41480052,
        0.07302658,  0.09337772, -0.34411722, -0.00724911, -0.9056035 ,
        0.12258726, -0.15003532,  0.04658122, -2.2854378 ,  0.28233293,
        0.77172595,  1.3786261 , -1.2632967 , -1.426814  , -0.51365477,
       -1.2197368 ,  0.04790526,  0.62031317,  1.1224078 , -0.1804243 ,
        0.49400517, -0.02745727, -1.0031232 , -0.10298221,  0.91482383,
        0.98645395,  0.2342329 ,  0.02064842, -0.33413368, -1.1859212 ,
        1.1475669 , -1.0501987 , -0.7069197 , -0.12736289,  1.7058631 ,
       -0.74710023,  0.48769948, -0.7129323 ,  0.49225873, -1.3105804 ,
       -2.1176062 ,  0.5835398 , -0.01676613,  0.40714362,  1.6942219 ,
        0.8474393 , -0.7914968 ,  1.8470286 ,  0.4587502 , -1.2789383 ,
        0.48545036, -0.50352156, -0.04223507, -0.35754788, -0.60754126,
        0.05735195,  0.32261458, -0.09268601,  0.702604  ,  1.1815772 ,
        0.17023656, -0.46019134, -1.0920937 ,  0.26714015, -0.06735314,
       -0.16602936,  0.6498549 , -0.35616133,  0.20689702,  0.7797428 ,
        0.14901382, -0.71886814,  1.2617997 ,  0.43995163,  1.0300183 ,
       -0.81545556, -0.06593582, -0.23527797, -1.3182799 ,  0.41896763,
        1.8256154 , -0.04706338, -1.106722  , -0.47043508, -0.30877873,
        0.27309516, -0.5845974 ,  0.12854019,  0.44951373,  0.46647298,
       -0.30743253, -0.7417909 ,  1.0234478 , -1.4167138 , -0.6474553 ,
        0.9093568 ,  0.17825471, -1.3186976 , -1.7007768 , -1.4912218 ,
        0.02761938, -1.5523437 , -0.30878946, -0.5883677 ,  0.35952488],
      dtype=float32)

len(model.wv['tea'])

200

Thus, Word2vec is able to reduce the vector representation of a word from millions of dimensions to 200 dimensions.

Words similar to the word "tea" are as follows.

word2vec.ipynb

model.wv.most_similar("tea")

[('coffee', 0.7221519947052002),
 ('beef', 0.6720874309539795),
 ('roast', 0.6600600481033325),
 ('spices', 0.653677225112915),
 ('drinks', 0.6486484408378601),
 ('sweets', 0.6478144526481628),
 ('cheese', 0.6444395780563354),
 ('pork', 0.6424828767776489),
 ('sausage', 0.6399528980255127),
 ('vegetables', 0.6373381614685059)]

Thus, Word2vec is able to understand the meaning of words.

See what happens when we combine the following words.

queen + man - woman


```python
model.wv.most_similar(positive=['queen', 'man'], negative=['woman'])

[('king', 0.5715309977531433),
 ('lord', 0.5021377801895142),
 ('crown', 0.4470388889312744),
 ('wight', 0.43937280774116516),
 ('prince', 0.43743711709976196),
 ('duke', 0.43583834171295166),
 ('regent', 0.4239676296710968),
 ('valdemar', 0.4121330976486206),
 ('scotland', 0.4113471210002899),
 ('vii', 0.3981912136077881)]

The result came back "king".

GloVe

Both architectures of Word2Vec are predictive and ignore the fact that words in some contexts occur more often than others. They also consider only local context and cannot capture global context.

GloVe (Global Vectors for Word Representation) uses global statistics on word co-occurrence to learn vector representations of words.

Below is sample code for a GloVe implementation.

glove.ipynb

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

import numpy as np

embeddings_dict = {}
with open("glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

glove.ipynb

embeddings_dict['water']

array([ 0.53507 ,  0.5761  , -0.054351, -0.208   , -0.7882  , -0.17592 ,
       -0.21255 , -0.14388 ,  1.0344  , -0.079253,  0.27696 ,  0.37951 ,
        1.2139  , -0.34032 , -0.18118 ,  0.72968 ,  0.89373 ,  0.82912 ,
       -0.88932 , -1.4071  ,  0.55571 , -0.017453,  1.2524  , -0.57916 ,
        0.43    , -0.77935 ,  0.4977  ,  1.2746  ,  1.0448  ,  0.36433 ,
        3.7921  ,  0.083653, -0.45044 , -0.063996, -0.19866 ,  0.75252 ,
       -0.27811 ,  0.42783 ,  1.4755  ,  0.37735 ,  0.079519,  0.024462,
        0.5013  ,  0.33565 ,  0.051406,  0.39879 , -0.35603 , -0.78654 ,
        0.61563 , -0.95478 ], dtype=float32)

https://nlp.stanford.edu/pubs/glove.pdf
https://github.com/stanfordnlp/GloVe

fastText

Word2Vec and GloVe had the challenge of not being able to encode unknown or non-lexical words. To address this challenge, Facebook proposed a model called fastText.

https://arxiv.org/abs/1607.01759

While Word2Vec learns a vector of individual words, fastText learns a word by dividing it into several subwords (n-grams).

For example, if we consider the word "what," using tri-grams ( $n=3$ ), the word is represented by the following character n-grams.

"wh", "wha", "hat", "at"

Based on this concept, FastText can generate embedded vectors for words not included in the training text using their character embeddings. For example, an unknown word "googling" can now be split into "google" and "ing" to create a vector representation.

The following is sample code for a fastText implementation.

fasttext.ipynb

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip wiki-news-300d-1M.vec.zip

import io

def load_vectors(word):
    fin = io.open('wiki-news-300d-1M.vec', 'r', encoding='utf-8', newline='\n', errors='ignore')
    for line in fin:
        tokens = line.rstrip().split(' ')
        if target_word == tokens[0]:
            return [float(s) for s in  tokens[1:]]

fasttext.ipynb

load_vectors('googling')

[-0.1151,
  0.2652,
  -0.13,
  .
  .
  .
  0.123,
  0.2461,
  0.0373]

ELMo

Word2vec, GloVe, and fastText all convert one word into one vector. Therefore, they cannot express polysemous words. For example, the word "bank" can refer not only to a financial institution but also to a river bank.

Therefore, a model was developed to address the contextual dependence of the meaning of a given word, ELMo.

https://arxiv.org/abs/1802.05365

ELMo does not generate a fixed vector representation for a single word. Instead, ELMo considers the entire sentence before generating an embedding for each word in the sentence. This context-sensitive embedding framework generates a vector representation of a word that takes into account the context in which the word is actually used.

ELMo uses a deep bi-directional LSTM language model, which allows for better understanding of previous words as well as the next word.

Below is a sample code of an ELMo implementation.

elmo.ipynb

!pip install allennlp

from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
elmo = Elmo(options_file, weight_file, 2, dropout=0)

sentences = ['By working faithfully eight hours a day, you may eventually get to be a boss and work twelve hours a day.'.split(' ')]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
print(embeddings['elmo_representations'])

[tensor([[[ 0.2935,  0.2494, -0.4810,  ..., -0.2546, -0.2394,  0.2540],
         [ 0.2875,  0.2179, -0.2757,  ...,  0.0597, -0.1851, -0.2132],
         [-0.4081,  0.1770,  0.5752,  ..., -0.3343,  0.4417,  0.6988],
         ...,
         [ 0.2734,  0.0159, -0.1510,  ..., -0.0459, -0.0769,  0.8115],
         [ 0.0628, -0.0689, -0.3339,  ..., -0.1212, -0.0101,  0.0247],
         [-0.0577,  0.8521, -0.3685,  ...,  0.0323, -0.1151,  0.2783]]],
       grad_fn=<CopySlices>), tensor([[[ 0.2935,  0.2494, -0.4810,  ..., -0.2546, -0.2394,  0.2540],
         [ 0.2875,  0.2179, -0.2757,  ...,  0.0597, -0.1851, -0.2132],
         [-0.4081,  0.1770,  0.5752,  ..., -0.3343,  0.4417,  0.6988],
         ...,
         [ 0.2734,  0.0159, -0.1510,  ..., -0.0459, -0.0769,  0.8115],
         [ 0.0628, -0.0689, -0.3339,  ..., -0.1212, -0.0101,  0.0247],
         [-0.0577,  0.8521, -0.3685,  ...,  0.0323, -0.1151,  0.2783]]],
       grad_fn=<CopySlices>)]

BERT

BERT is an MLP model published by Google on October 11, 2018. BERT is not only an algorithm for solving translation and classification problems, but also a word as a distributed representation of words. Specifically, within BERT there is a Masked Language Model (MLM), which estimates masked words, and within which the distributed representation of words is learned.

While ELMo learns words independently in the forward and reverse directions, BERT's MLM learns words simultaneously in the forward and reverse directions. Therefore, theoretically, BERT's distributed representation is more accurate than ELMo's.

Below is sample code for a BERT implementation.

bert.ipynb

!pip install transformers

import torch
from transformers import BertTokenizer, BertModel

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode('The optimist sees the doughnut, the pessimist sees the hole.', add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]

print(last_hidden_states)

tensor([[[-0.4030,  0.3356, -0.0636,  ..., -0.6573,  0.6247,  0.6181],
         [-0.6247,  0.1251, -0.3543,  ...,  0.0818,  1.1095,  0.0013],
         [-0.1742, -0.0781,  0.4559,  ..., -0.3268,  0.2127,  0.1130],
         ...,
         [ 0.3077,  0.2755, -0.1591,  ...,  0.3503,  0.3849,  0.6388],
         [ 0.6037,  0.0777, -0.2810,  ...,  0.1952, -0.3510, -0.3897],
         [ 0.7766,  0.1315, -0.1458,  ...,  0.1757, -0.4855, -0.3783]]],
       grad_fn=<NativeLayerNormBackward0>)

Summary

The following table summarizes the methods of word vector representation.

Method (Major category)	Method (Minor category)	Meaning understanding	Context understanding	Number of dimensions
One-hot representation	One-hot vector	×	×	1M
Word embedding without taking context into account	Word2vec Glove fastText	●	×	100~300
Word embedding wit taking context into account	ELMo BERT	●	●	100~1000