2023-08-30

Chunking in LLM Applications

Chunking in LLM Applications

When developing applications related to LLM (Large Language Model), it's essential to understand the concept of chunking. Chunking is the process of dividing large text into smaller segments, thereby optimizing the relevance of content retrieval from a vector database.

Methods of Chunking

Several methods of chunking are available:

Fixed-Size Chunking

The text is divided into chunks of a fixed size. This is the simplest and fastest method, but there is a potential loss of context.

python
text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 256,
    chunk_overlap  = 20
)
docs = text_splitter.create_documents([text])

Sentence-Based Chunking

The text is split along the boundaries of sentences. This retains more context, but the chunk sizes may be uneven.

python
text = "..." # your text
docs = text.split(".")

Recursive Chunking

The text is recursively divided into smaller chunks hierarchically and iteratively.

python
text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 256,
    chunk_overlap  = 20
)

docs = text_splitter.create_documents([text])

Specialized Chunking

There are also chunking methods tailored to specific formats such as Markdown or LaTeX.

python
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

Choosing the Right Chunking Strategy

When selecting the optimal chunking strategy, consider the following factors:

  • Nature of the Content
    Whether the content to be indexed is long-form (e.g., academic papers or books) or short-form (e.g., tweets or chat messages) will influence the choice of model and chunking strategy.

  • Embedding Model in Use
    The choice of embedding model and the chunk size at which the model operates most efficiently are crucial considerations.

  • User Query Expectations
    Depending on whether user queries are short and specific or long and complex, the chunking method might need to be adapted.

  • Intended Use of Results
    How search results will be utilized (e.g., semantic search, question answering, summarization) also impacts the choice of chunking strategy.

References

https://www.pinecone.io/learn/chunking-strategies/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!