2023-06-11

Pinecone Sparse-Dense Vectors

Pinecone Sparse-Dense Vectors

Pinecone is a vector database that specializes in handling high-dimensional vector data, designed to offer advanced search capabilities. A distinctive feature of Pinecone is its support for sparse-dense vectors, which enable a hybrid approach combining keyword and semantic search.

Semantic search and keyword search have different strengths. Semantic search, which uses dense vectors, excels at returning similar results even without exact keyword matches. Keyword search, which uses sparse vectors, can provide highly relevant results when exact keyword matches are present. The sparse-dense vectors feature in Pinecone harnesses the strengths of both approaches, enhancing the relevance of search results even for out-of-domain queries.

Sparse Versus Dense Vectors in Pinecone

Pinecone supports the use of dense vectors, which are numerical representations of semantic meaning. Dense vectors can enable semantic search, returning the most similar results based on a specific distance metric, even if no exact keyword matches are present. The dense vectors are often generated by embedding models like SBERT (Sentence-BERT), which create meaningful representations of textual data.

In contrast to dense vectors, sparse vectors are high-dimensional data structures where only a small proportion of values are non-zero. In the context of keyword search, each sparse vector represents a document; the dimensions represent words from a dictionary, and the non-zero values represent the importance of these words in the document.

Keyword search algorithms, such as the BM25 algorithm, leverage sparse vectors to compute the relevance of text documents. These algorithms evaluate the number of keyword matches, their frequency, and other factors to deliver relevant search results.

By combining the strengths of dense and sparse vectors, Pinecone provides a robust and versatile vector database that can accommodate a variety of search scenarios, enhancing both the precision and recall of search results.

Steps in Creating and Using Sparse-Dense Vectors

Here is the typical workflow when using sparse-dense vectors in Pinecone:

  1. Create dense vectors using an external embedding model. This usually involves a pre-processing step where raw data, such as text, is converted into numerical representations using models like SBERT.

  2. Create sparse vectors using an external model. These representations often incorporate TF-IDF or other keyword frequency measures to create a high-dimensional representation where each dimension corresponds to a specific word or feature.

  3. Create an index that supports sparse-dense vectors (s1 or p1 with the dotproduct metric). This index will store your vectors and enable efficient similarity search.

  4. Upsert dense and sparse vectors to your index. This involves adding your vectors to the index, where they are stored and made available for querying.

  5. Search the index using sparse-dense vectors. The query includes both sparse and dense vector values.

  6. Pinecone returns sparse-dense vectors. The results of the query will be a ranked list of vectors from your index, scored by their similarity to the query vector.

Creating Sparse Vector Embeddings

When creating sparse vector embeddings, one must remember that Pinecone indexes accept sparse indices rather than documents. Thus, it is the user's responsibility to control the generation of sparse vectors to represent documents.

Representation of Documents with Sparse Vectors

To effectively implement keyword-aware semantic search, each document needs to be represented as a vector. The vectorization process typically involves extracting keywords and other features from the document, quantifying their importance, and encoding these into a high-dimensional vector where each dimension corresponds to a feature.

Pinecone's Support for Sparse Vector Sizes

Pinecone supports sparse vector values of sizes up to 1000 non-zero values. This capacity provides considerable flexibility for representing documents, even when they contain a large number of distinct features or keywords.

Sparse and Dense Values in Vectors

When creating vectors in Pinecone, each vector can consist of both dense and sparse values. The dense values represent the semantic meaning of the content, while the sparse values represent specific keyword information. Note, however, that Pinecone does not support vectors with only sparse values.

Sparse-Dense Queries

Performing queries using sparse-dense vectors in Pinecone involves providing a query vector that contains both sparse and dense values. The way Pinecone processes these queries is by considering the full dot product over the entire vector, and the score of a vector is the summation of the dot product of its dense values with the dense part of the query, together with the dot product of its sparse values with the sparse part of the query.

Representation of Sparse Values in Pinecone

Pinecone represents sparse values as a dictionary of two arrays:

  • indices: the positions of non-zero values in the vector
  • values: the non-zero values themselves

These values can be upserted inside a vector parameter to upsert a sparse-dense vector.

Upserting Vectors with Sparse and Dense Values

Pinecone provides a straightforward way to upsert vectors that contain both sparse and dense values. A typical Python example could involve the use of the Pinecone Index's upsert method to add vectors with specified id, values, and sparse_values into an existing index.

python
index = pinecone.Index('example-index')

upsert_response = index.upsert(
    vectors=[
    {'id': 'vec1',
        'values': [0.1, 0.2, 0.3],
        'metadata': {'genre': 'drama'},
        'sparse_values': {
        'indices': [10, 45, 16],
        'values': [0.5, 0.5, 0.2]
    }},
    {'id': 'vec2',
        'values': [0.2, 0.3, 0.4],
        'metadata': {'genre': 'action'},
        'sparse_values': {
        'indices': [15, 40, 11],
        'values': [0.4, 0.5, 0.2]
    }}
    ],
    namespace='example-namespace'
)

Querying an Index Using a Sparse-Dense Vector

Querying a Pinecone index using a sparse-dense vector is equally straightforward. You provide a dense vector and a sparse vector as part of your query, and Pinecone will return the vectors in your index that are most similar to your query vector, based on its hybrid scoring system.
The following example queries an index using a sparse-dense vector.

python
query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=[0.1, 0.2, 0.3], # dense vector
    sparse_vector={ # sparse vector
        'indices': [10, 45, 16],
        'values': [0.5, 0.5, 0.2]
    }
)

Sparse-Dense Weighting

Pinecone's index treats your sparse-dense vector as a unified entity, and as such, doesn't inherently provide a mechanism to vary the influence of the query's dense portion versus its sparse portion. It remains neutral with regards to the density or sparsity within your vector coordinates. Nevertheless, you can apply a linear weighting scheme through modification of your query vector, as illustrated in the subsequent function.

In the example that follows, the vector values are modified by employing an alpha parameter.

python
def hybrid_score_norm(dense, sparse, alpha: float):
"""Hybrid score using a convex combination

    alpha * dense + (1 - alpha) * sparse

    Args:
        dense: Array of floats representing
        sparse: a dict of `indices` and `values`
        alpha: scale between 0 and 1
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    return [v * alpha for v in dense], hs

The subsequent example applies the transformation function to a vector, then uses this transformed vector to make a query on a Pinecone index.

python
sparse_vector = {
    'indices': [10, 45, 16],
    'values': [0.5, 0.5, 0.2]
}
dense_vector = [0.1, 0.2, 0.3]

hdense, hsparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=0.75)

query_response = index.query(
    namespace="example-namespace",
    top_k=10,
    vector=hdense,
    sparse_vector=hsparse
)

References

https://docs.pinecone.io/docs/hybrid-search

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!