2023-01-21

N-grams

What are N-grams

The world of natural language processing (NLP) is rich and diverse, offering a plethora of techniques and approaches to help us understand and analyze text. Among these, n-grams have emerged as a fundamental tool for studying language patterns and predicting linguistic sequences. N-grams are contiguous sequences of n items from a given sample of text or speech, where n is a positive integer. Typically, the items in question are either characters or words.

In essence, n-grams can be thought of as building blocks of language, representing various combinations of characters, words, or phrases. By studying these patterns, we can gain insights into the structure and nature of languages. N-grams have found widespread use in various NLP applications, ranging from text generation and language identification to sentiment analysis and plagiarism detection.

N-gram Models in NLP

N-grams have become an essential tool in the study of language patterns and the development of natural language processing models. In this article, I will explore the different types of n-gram models and their specific applications in NLP, providing examples to illustrate their use.

Character N-grams

Character n-grams are sequences of n contiguous characters taken from a given text. These models are particularly useful in the analysis of morphological structures and the identification of language-specific patterns at the character level. Character n-grams have been applied to tasks such as language identification, authorship attribution, and text compression.

Given the text "language processing," the 3-character n-grams (also called trigrams) would be:

_ represents a space character.

Word N-grams

Word n-grams consist of n contiguous words in a text, making them suitable for capturing dependencies between adjacent words in a sentence. Word n-grams play a central role in the development of statistical language models, which can be used to estimate the probability of a word given its context. These models are widely employed in a variety of NLP tasks, including text generation, machine translation, and speech recognition.

Given the sentence "The quick brown fox jumps over the lazy dog," the 3-word n-grams (also called trigrams) would be:

The quick brown
quick brown fox
brown fox jumps
fox jumps over
jumps over the
over the lazy
the lazy dog

Syntactic N-grams

Syntactic n-grams extend the concept of word n-grams by considering the syntactic relationships between words in a sentence. These models are built upon the syntactic structure of a text, such as its dependency or constituency parse tree. By incorporating syntax, syntactic n-grams provide a more nuanced understanding of language, making them particularly useful for tasks that require a deeper analysis of sentence structure, like sentiment analysis and information extraction.

Given the sentence "The cat chased the mouse," a syntactic n-gram based on the dependency parse tree might be:

chased_det(cat, The)
chased_nsubj(chased, cat)
chased_det(mouse, the)
chased_dobj(chased, mouse)

The syntactic n-gram captures the relationships between the words and their syntactic roles, such as the determiner (det), subject (nsubj), and object (dobj).

N	N-gram Name	Example	Description
1	Unigram	dog	Single character or word
2	Bigram	lazy dog	Sequence of two characters/words
3	Trigram	the lazy dog	Sequence of three characters/words
4	4-gram	over the lazy dog	Sequence of four characters/words
5	5-gram	...	Sequence of five characters/words
...	...	...	...
N	N-gram	...	Sequence of N characters/words

N-grams

What are N-grams

N-gram Models in NLP

Character N-grams

Word N-grams

Syntactic N-grams

N-gram Terminology

References

What is Bag of Words (BoW)

NLP with NLTK

Ryusei Kakujo