What is Vector Similarity
Vector similarity is a measure used to quantify the resemblance or closeness between two or more vectors. In many applications, especially in data science and machine learning, it's crucial to compare vectors in terms of direction and/or magnitude, which can be thought of as their similarity.
Measures of Vector Similarity
I will introduce common measures used to quantify the similarity between vectors. These measures are fundamental in comparing and contrasting data in multidimensional spaces, and they are pivotal in several fields such as natural language processing (NLP), recommendation systems, and image recognition.
Euclidean Distance
Euclidean distance is one of the most intuitive measures for vector similarity. It calculates the straight-line distance between two points in Euclidean space.
For two points in a 2-dimensional space,
In an n-dimensional space, for vectors
Cosine Similarity
Unlike the Euclidean distance, cosine similarity measures the cosine of the angle between two non-zero vectors. It is particularly useful when the magnitude of the vectors is not relevant, and we are more interested in their orientation.
Cosine similarity is calculated as the dot product of the two vectors divided by the product of the magnitudes of each vector.
Dot Product Similarity
Dot product similarity is a measure that calculates the dot product between two vectors. It is closely related to cosine similarity, but while cosine similarity normalizes the result to provide a measure of the angle between vectors, dot product similarity is unbounded and takes both magnitude and direction into account.
The dot product of two vectors
Dot product similarity can be positive, negative, or zero. A positive value indicates that the vectors point in a generally similar direction, a negative value indicates that they point in opposite directions, and a zero value indicates that the vectors are orthogonal.
In the context of high-dimensional spaces like those common in NLP, the dot product can be thought of as a measure of how many features the vectors have in common - with higher values indicating more common features.
Jaccard Similarity
Jaccard similarity is a measure used for comparing the similarity between sets. When vectors are used to represent sets (e.g., binary vectors), Jaccard similarity can be very useful.
It is defined as the size of the intersection of the sets divided by the size of the union of the sets.
For binary vectors
References