Chunking in LLM Applications

When developing applications related to LLM (Large Language Model), it's essential to understand the concept of chunking. Chunking is the process of dividing large text into smaller segments, thereby optimizing the relevance of content retrieval from a vector database.

Methods of Chunking

Several methods of chunking are available:

Fixed-Size Chunking

The text is divided into chunks of a fixed size. This is the simplest and fastest method, but there is a potential loss of context.

python

text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 256,
    chunk_overlap  = 20
)
docs = text_splitter.create_documents([text])

Sentence-Based Chunking

The text is split along the boundaries of sentences. This retains more context, but the chunk sizes may be uneven.

python

text = "..." # your text
docs = text.split(".")

Recursive Chunking

The text is recursively divided into smaller chunks hierarchically and iteratively.

python

text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 256,
    chunk_overlap  = 20
)

docs = text_splitter.create_documents([text])

Specialized Chunking

There are also chunking methods tailored to specific formats such as Markdown or LaTeX.

python

from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

Choosing the Right Chunking Strategy

When selecting the optimal chunking strategy, consider the following factors:

Nature of the Content
Whether the content to be indexed is long-form (e.g., academic papers or books) or short-form (e.g., tweets or chat messages) will influence the choice of model and chunking strategy.
Embedding Model in Use
The choice of embedding model and the chunk size at which the model operates most efficiently are crucial considerations.
User Query Expectations
Depending on whether user queries are short and specific or long and complex, the chunking method might need to be adapted.
Intended Use of Results
How search results will be utilized (e.g., semantic search, question answering, summarization) also impacts the choice of chunking strategy.

Chunking in LLM Applications

Methods of Chunking

Fixed-Size Chunking

Sentence-Based Chunking

Recursive Chunking

Specialized Chunking

Choosing the Right Chunking Strategy

References

Chunking in LLM Applications

Chunking in LLM Applications

Methods of Chunking

Fixed-Size Chunking

Sentence-Based Chunking

Recursive Chunking

Specialized Chunking

Choosing the Right Chunking Strategy

References

Building a Slack Bot Using ChatGPT Retrieval Plugin

NLP 100 Exercise ch1：Warm-up

Ryusei Kakujo