What are N-grams
The world of natural language processing (NLP) is rich and diverse, offering a plethora of techniques and approaches to help us understand and analyze text. Among these, n-grams have emerged as a fundamental tool for studying language patterns and predicting linguistic sequences. N-grams are contiguous sequences of n items from a given sample of text or speech, where n is a positive integer. Typically, the items in question are either characters or words.
In essence, n-grams can be thought of as building blocks of language, representing various combinations of characters, words, or phrases. By studying these patterns, we can gain insights into the structure and nature of languages. N-grams have found widespread use in various NLP applications, ranging from text generation and language identification to sentiment analysis and plagiarism detection.
N-gram Models in NLP
N-grams have become an essential tool in the study of language patterns and the development of natural language processing models. In this article, I will explore the different types of n-gram models and their specific applications in NLP, providing examples to illustrate their use.
Character N-grams
Character n-grams are sequences of n contiguous characters taken from a given text. These models are particularly useful in the analysis of morphological structures and the identification of language-specific patterns at the character level. Character n-grams have been applied to tasks such as language identification, authorship attribution, and text compression.
Given the text "language processing," the 3-character n-grams (also called trigrams) would be:
- lan
- ang
- ngu
- gua
- uag
- age
- ge_
- e_p
- _pr
- pro
- roc
- oce
- ces
- ess
- ssi
- sin
- ing
_
represents a space character.
Word N-grams
Word n-grams consist of n contiguous words in a text, making them suitable for capturing dependencies between adjacent words in a sentence. Word n-grams play a central role in the development of statistical language models, which can be used to estimate the probability of a word given its context. These models are widely employed in a variety of NLP tasks, including text generation, machine translation, and speech recognition.
Given the sentence "The quick brown fox jumps over the lazy dog," the 3-word n-grams (also called trigrams) would be:
- The quick brown
- quick brown fox
- brown fox jumps
- fox jumps over
- jumps over the
- over the lazy
- the lazy dog
Syntactic N-grams
Syntactic n-grams extend the concept of word n-grams by considering the syntactic relationships between words in a sentence. These models are built upon the syntactic structure of a text, such as its dependency or constituency parse tree. By incorporating syntax, syntactic n-grams provide a more nuanced understanding of language, making them particularly useful for tasks that require a deeper analysis of sentence structure, like sentiment analysis and information extraction.
Given the sentence "The cat chased the mouse," a syntactic n-gram based on the dependency parse tree might be:
- chased_det(cat, The)
- chased_nsubj(chased, cat)
- chased_det(mouse, the)
- chased_dobj(chased, mouse)
The syntactic n-gram captures the relationships between the words and their syntactic roles, such as the determiner (det
), subject (nsubj
), and object (dobj
).
N-gram Terminology
The table below is an overview of the terminology used to describe n-grams of varying lengths.
N | N-gram Name | Example | Description |
---|---|---|---|
1 | Unigram | dog | Single character or word |
2 | Bigram | lazy dog | Sequence of two characters/words |
3 | Trigram | the lazy dog | Sequence of three characters/words |
4 | 4-gram | over the lazy dog | Sequence of four characters/words |
5 | 5-gram | ... | Sequence of five characters/words |
... | ... | ... | ... |
N | N-gram | ... | Sequence of N characters/words |
References