Traffine I/O

English

2023-08-31

What is Approximate Nearest Neighbors (ANN)

Vector Database

Vector Database

What is Approximate Nearest Neighbors (ANN)

Approximate Nearest Neighbors (ANN) is a technique used to approximately find the nearest neighbors of high-dimensional vector data, such as embeddings of text or images, without performing exact calculations.

In high-dimensional vector data, calculating the distance to every data point to find the nearest one can become computationally expensive. Particularly in large datasets or real-time applications, this computational cost can pose a challenge. ANN employs fast and efficient algorithms to swiftly locate the nearest neighbors, allowing for practical-speed neighbor searches. Such algorithms find application in various fields, including similar image retrieval and recommendation systems.

This article introduces the following ANN algorithms:

LSH (Locality-Sensitive Hashing)
HNSW (Hierarchical Navigable Small World)
IVF (Inverted File)

LSH (Locality-Sensitive Hashing)

LSH uses specialized hash functions that ensure similar items are hashed into the same "bucket" with high probability.

LSH Algorithm

The LSH algorithm consists of the following key steps:

Hash Function Selection: Choose hash functions that maximize the probability of similar items being placed in the same bucket.
Data Hashing: Assign each data point to a bucket using the hash functions.
Search within Buckets: Search for data points assigned to the same bucket as the query point.
Candidate Refinement: Select the closest data point from the candidates obtained through bucket-based search using further calculations.

HNSW (Hierarchical Navigable Small World)

HNSW represents data using a multi-layered graph structure to efficiently search for nearby data points from query points.

HNSW Algorithm

The HNSW algorithm is composed of the following main steps:

Graph Construction: Build a multi-layered graph with data points as vertices.
Edge Assignment: Assign edges between close data points to enable efficient exploration on the graph.
Hierarchical Search: Begin search from the upper layers of the graph, gradually moving down to find the nearest data point to the query point.

IVF (Inverted File)

Inverted File (IVF) is an efficient method for searching high-dimensional data through clustering.

IVF Algorithm

The basic IVF algorithm is structured as follows:

Clustering: Divide data points into several clusters and assign IDs to each cluster.
Indexing: Store the relationship between cluster IDs and individual data points in an index (inverted file).
Search: When a query is given, identify the nearest cluster and perform detailed distance calculations within that cluster to find the nearest neighbors.

Comparison of Each Method

Speed and Accuracy

LSH: Offers fast search, but often has lower accuracy compared to other methods.
HNSW: Balances high accuracy with efficiency, but may require higher memory usage.
IVF: Enables fast search through indexing, but accuracy can be unstable due to clustering quality.

Memory Usage

LSH: Generally exhibits good memory efficiency due to the use of hash tables.
HNSW: Advanced graph structures may lead to higher memory usage.
IVF: Memory usage increases with data volume due to the need to store index information.

Implementation Complexity

LSH: Algorithm is simple and can be easily implemented in many programming languages.
HNSW: Implementation is somewhat challenging due to the complex graph structure.
IVF: Data structure is relatively simple, but efficient clustering and indexing are required.

References

Context Size of LLM and Vector Database

Pinecone

AlloyDB

Amazon Cognito

Amazon EC2

Amazon ECS

Amazon QuickSight

Amazon QuickSight

Amazon RDS

Amazon Redshift

Amazon Redshift

Amazon S3

API

Autonomous Vehicle

Autonomous Vehicle

AWS

AWS API Gateway

AWS API Gateway

AWS Chalice

AWS Control Tower

AWS Control Tower

AWS IAM

AWS Lambda

AWS VPC

BERT

BigQuery

Causal Inference

Causal Inference

ChatGPT

Chrome Extension

Chrome Extension

CircleCI

Classification

Cloud Functions

Cloud Functions

Cloud IAM

Cloud Run

Cloud Storage

Clustering

CSS

Data Engineering

Data Engineering

Data Modeling

Database

dbt

Decision Tree

Deep Learning

Descriptive Statistics

Descriptive Statistics

Differential Equation

Differential Equation

Dimensionality Reduction

Dimensionality Reduction

Discrete Choice Model

Discrete Choice Model

Docker

Economics

FastAPI

Firebase

GIS

git

GitHub

GitHub Actions

Google

Google Cloud

Google Search Console

Google Search Console

Hugging Face

Hypothesis Testing

Hypothesis Testing

Inferential Statistics

Inferential Statistics

Interval Estimation

Interval Estimation

JavaScript

Jinja

Kedro

Kubernetes

LightGBM

Linux

LLM

Mac

Machine Learning

Machine Learning

Macroeconomics

Marketing

Mathematical Model

Mathematical Model

Meltano

MLflow

MLOps

MySQL

NextJS

NLP

Nodejs

NoSQL

ONNX

OpenAI

Optimization Problem

Optimization Problem

Optuna

Pandas

Pinecone

PostGIS

PostgreSQL

Probability Distribution

Probability Distribution

Product

Project

Psychology

Python

PyTorch

QGIS

ReactJS

Regression

Rideshare

SEO

Singer

sklearn

Slack

Snowflake

Software Development

Software Development

SQL

Statistical Model

Statistical Model

Statistics

Streamlit

Tabular

Tailwind CSS

TensorFlow

Terraform

Transportation

TypeScript

Urban Planning

Vector Database

Vector Database

Vertex AI

VSCode

XGBoost

Ryusei Kakujo

researchgate

github

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS