Chunking in LLM Applications
When developing applications related to LLM (Large Language Model), it's essential to understand the concept of chunking. Chunking is the process of dividing large text into smaller segments, thereby optimizing the relevance of content retrieval from a vector database.
Methods of Chunking
Several methods of chunking are available:
Fixed-Size Chunking
The text is divided into chunks of a fixed size. This is the simplest and fastest method, but there is a potential loss of context.
text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Sentence-Based Chunking
The text is split along the boundaries of sentences. This retains more context, but the chunk sizes may be uneven.
text = "..." # your text
docs = text.split(".")
Recursive Chunking
The text is recursively divided into smaller chunks hierarchically and iteratively.
text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Specialized Chunking
There are also chunking methods tailored to specific formats such as Markdown or LaTeX.
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
Choosing the Right Chunking Strategy
When selecting the optimal chunking strategy, consider the following factors:
-
Nature of the Content
Whether the content to be indexed is long-form (e.g., academic papers or books) or short-form (e.g., tweets or chat messages) will influence the choice of model and chunking strategy. -
Embedding Model in Use
The choice of embedding model and the chunk size at which the model operates most efficiently are crucial considerations. -
User Query Expectations
Depending on whether user queries are short and specific or long and complex, the chunking method might need to be adapted. -
Intended Use of Results
How search results will be utilized (e.g., semantic search, question answering, summarization) also impacts the choice of chunking strategy.
References