2023-02-04

Vector Similarity

What is Vector Similarity

Vector similarity is a measure used to quantify the resemblance or closeness between two or more vectors. In many applications, especially in data science and machine learning, it's crucial to compare vectors in terms of direction and/or magnitude, which can be thought of as their similarity.

Measures of Vector Similarity

I will introduce common measures used to quantify the similarity between vectors. These measures are fundamental in comparing and contrasting data in multidimensional spaces, and they are pivotal in several fields such as natural language processing (NLP), recommendation systems, and image recognition.

Euclidean Distance

Euclidean distance is one of the most intuitive measures for vector similarity. It calculates the straight-line distance between two points in Euclidean space.

For two points in a 2-dimensional space, P(x1, y1) and Q(x2, y2), the Euclidean distance is calculated using the Pythagorean theorem:

D_{euclidean} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

In an n-dimensional space, for vectors a and b, the Euclidean distance is generalized as:

D_{euclidean}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}

Cosine Similarity

Unlike the Euclidean distance, cosine similarity measures the cosine of the angle between two non-zero vectors. It is particularly useful when the magnitude of the vectors is not relevant, and we are more interested in their orientation.

Cosine similarity is calculated as the dot product of the two vectors divided by the product of the magnitudes of each vector.

\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \times ||\mathbf{b}||} = \frac{\sum_{i=1}^{n} a_i \cdot b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \times \sqrt{\sum_{i=1}^{n} b_i^2}}

Dot Product Similarity

Dot product similarity is a measure that calculates the dot product between two vectors. It is closely related to cosine similarity, but while cosine similarity normalizes the result to provide a measure of the angle between vectors, dot product similarity is unbounded and takes both magnitude and direction into account.

The dot product of two vectors a and b is calculated as:

\text{dot\_product}(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{n} a_i \cdot b_i

Dot product similarity can be positive, negative, or zero. A positive value indicates that the vectors point in a generally similar direction, a negative value indicates that they point in opposite directions, and a zero value indicates that the vectors are orthogonal.

In the context of high-dimensional spaces like those common in NLP, the dot product can be thought of as a measure of how many features the vectors have in common - with higher values indicating more common features.

Jaccard Similarity

Jaccard similarity is a measure used for comparing the similarity between sets. When vectors are used to represent sets (e.g., binary vectors), Jaccard similarity can be very useful.

It is defined as the size of the intersection of the sets divided by the size of the union of the sets.

J(\mathbf{A}, \mathbf{B}) = \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}

For binary vectors a and b, it can also be expressed as:

J(\mathbf{a}, \mathbf{b}) = \frac{\sum_{i=1}^{n} \min(a_i, b_i)}{\sum_{i=1}^{n} \max(a_i, b_i)}

References

https://www.pinecone.io/learn/vector-similarity/
https://www.learndatasci.com/glossary/jaccard-similarity/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!