Traffine I/O

日本語

2023-08-30

LLMアプリケーションにおけるチャンク

Machine Learning

Machine Learning

NLP

LLM

Vector Database

Vector Database

LLMアプリケーションにおけるチャンク

LLM（Large Language Model）関連のアプリケーションを開発する際には、チャンキングについて理解する必要があります。チャンキングとは、大きなテキストをより小さなセグメントに分割するプロセスであり、これによってベクトルデータベースから取得するコンテンツの関連性が最適化されます。

チャンキングの方法

チャンキングには次のような方法が挙げられます。

固定サイズのチャンキング

テキストは一定のサイズのチャンクに分割されます。これはもっとも単純な方法であり、高速ですが、文脈が失われる可能性があります。

python

text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 256,
    chunk_overlap  = 20
)
docs = text_splitter.create_documents([text])

文に基づいたチャンキング

テキストは文の境界に沿って分割されます。これにより、文脈がより保持されますが、チャンクのサイズが不均一になります。

python

text = "..." # your text
docs = text.split(".")

再帰的チャンキング

テキストを階層的かつ反復的に小さなチャンクに分割します。

python

text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 256,
    chunk_overlap  = 20
)

docs = text_splitter.create_documents([text])

特殊なチャンキング

Markdown、LaTeXなど、特定のフォーマットに対応したチャンキング方法も存在します。

python

from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

どのチャンキングを選べば良いか

最適なチャンキング戦略を選定する際に、次のような要素を考慮する必要があります。

コンテンツの性質
インデックスされるべきコンテンツが長文（例えば、学術論文や書籍）か短文（例えば、ツイートやチャットメッセージ）かによって、選定すべきモデルやチャンキング戦略が変わります。
使用する埋め込みモデル
どの埋め込みモデルが使用されるか、そしてそのモデルがもっとも効率的に動作するチャンクサイズは何か、といった点も重要です。
ユーザークエリの期待値
ユーザーからのクエリが短く具体的なものか、それとも長く複雑なものかによって、チャンキングの方法も適応的に変更する必要があります。
結果の用途
検索結果がどのように使用されるか（例：セマンティック検索、質問応答、要約など）も、チャンキング戦略の選定に影響を与えます。

参考

ChatGPT Retrieval Pluginを使ったSlack Botの構築

言語処理100本ノック第1章：準備運動

AlloyDB

Amazon Cognito

Amazon EC2

Amazon ECS

Amazon QuickSight

Amazon QuickSight

Amazon RDS

Amazon Redshift

Amazon Redshift

Amazon S3

API

Autonomous Vehicle

Autonomous Vehicle

AWS

AWS API Gateway

AWS API Gateway

AWS Chalice

AWS Control Tower

AWS Control Tower

AWS IAM

AWS Lambda

AWS VPC

BERT

BigQuery

Causal Inference

Causal Inference

ChatGPT

Chrome Extension

Chrome Extension

CircleCI

Classification

Cloud Functions

Cloud Functions

Cloud IAM

Cloud Run

Cloud Storage

Clustering

CSS

Data Engineering

Data Engineering

Data Modeling

Database

dbt

Decision Tree

Deep Learning

Descriptive Statistics

Descriptive Statistics

Differential Equation

Differential Equation

Dimensionality Reduction

Dimensionality Reduction

Discrete Choice Model

Discrete Choice Model

Docker

Economics

FastAPI

Firebase

GIS

git

GitHub

GitHub Actions

Google

Google Cloud

Google Search Console

Google Search Console

Hugging Face

Hypothesis Testing

Hypothesis Testing

Inferential Statistics

Inferential Statistics

Interval Estimation

Interval Estimation

JavaScript

Jinja

Kedro

Kubernetes

LightGBM

Linux

LLM

Mac

Machine Learning

Machine Learning

Macroeconomics

Marketing

Mathematical Model

Mathematical Model

Meltano

MLflow

MLOps

MySQL

NextJS

NLP

Nodejs

NoSQL

ONNX

OpenAI

Optimization Problem

Optimization Problem

Optuna

Pandas

Pinecone

PostGIS

PostgreSQL

Probability Distribution

Probability Distribution

Product

Project

Psychology

Python

PyTorch

QGIS

ReactJS

Regression

Rideshare

SEO

Singer

sklearn

Slack

Snowflake

Software Development

Software Development

SQL

Statistical Model

Statistical Model

Statistics

Streamlit

Tabular

Tailwind CSS

TensorFlow

Terraform

Transportation

TypeScript

Urban Planning

Vector Database

Vector Database

Vertex AI

VSCode

XGBoost

Ryusei Kakujo

researchgate

github

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS