2023-01-20

NLP with NLTK

What is NLTK

The Natural Language Toolkit (NLTK) is a Python library for working with human language data. It provides easy-to-use interfaces for over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

NLTK offers several advantages including:

  • A wide array of text processing functions and tools
  • Easy access to over 50 corpora and lexical resources
  • A comprehensive and easy-to-understand documentation
  • An active community of users and developers
  • Integration with other popular Python libraries, such as NumPy, pandas, and scikit-learn

Installing and Setting up NLTK

Prerequisites

Before you begin installing NLTK, ensure that your system meets the following requirements:

  • Python 3.6 or later installed
  • pip (the package installer for Python) installed

Installing NLTK

To install NLTK, open a terminal or command prompt, and run the following command:

bash
$ pip install nltk

Downloading NLTK Data

NLTK comes with a variety of corpora, datasets, and resources to work with. To download these resources, you need to use the NLTK Data Downloader. In your Python interpreter or script, run the following commands:

python
import nltk
nltk.download()

A new window called NLTK Downloader will open. Here, you can choose which resources you want to download. For beginners, it is recommended to download the popular collection, which includes a subset of the most commonly used resources.

To download the popular collection, click on the Collections tab, select popular, and then click the Download button. The download may take a few minutes, depending on your internet connection.

Alternatively, you can download specific resources by selecting them from the Corpora and Models tabs.

nltk.download() Options

You can use different options with the nltk.download() function to download specific resources or collections.

Here are some options that you can use with the nltk.download() function:

  • Downloading a specific resource
    To download a specific resource, pass the identifier of the resource as a parameter to the nltk.download() function. For example, to download the punkt tokenizer models, you can use the following code:
python
import nltk
nltk.download('punkt')
  • Downloading a collection
    NLTK provides pre-defined collections, such as book, popular, and all. To download a specific collection, pass the identifier of the collection as a parameter to the nltk.download() function. For example, to download the popular collection, you can use the following code:
python
import nltk
nltk.download('popular')
  • Downloading resources for a specific package
    Some NLTK packages require additional data to function correctly. To download the data for a specific package, pass the package's identifier followed by the resource's identifier as a parameter to the nltk.download() function. For example, to download the wordnet data for the corpora package, you can use the following code:
python
import nltk
nltk.download('corpora/wordnet')
  • Downloading all resources
    To download all available NLTK resources, pass the all identifier as a parameter to the nltk.download() function. Note that downloading all resources may take a long time and require significant disk space.
python
import nltk
nltk.download('all')
  • Downloading resources to a specific location
    By default, the nltk.download() function downloads resources to the NLTK data directory. If you want to download the resources to a specific location, you can use the download_dir parameter. For example, to download the punkt tokenizer models to a custom directory, you can use the following code:
python
import nltk
nltk.download('punkt', download_dir='/path/to/your/custom/directory')

Text Preprocessing with NLTK

Tokenization

Tokenization is the process of splitting a text into individual words or tokens. It is a crucial step in NLP, as it helps in understanding and analyzing the structure and content of the text. NLTK provides two types of tokenization: word tokenization and sentence tokenization.

Word Tokenization

To tokenize a text into words using NLTK, you can use the word_tokenize function:

python
from nltk.tokenize import word_tokenize

text = "NLTK provides various tools for text preprocessing."
tokens = word_tokenize(text)
print(tokens)

The output will be:

['NLTK', 'provides', 'various', 'tools', 'for', 'text', 'preprocessing', '.']

Sentence Tokenization

To tokenize a text into sentences, you can use the sent_tokenize function:

python
from nltk.tokenize import sent_tokenize

text = "NLTK is an amazing library. It provides various tools for NLP."
sentences = sent_tokenize(text)
print(sentences)

The output will be:

['NLTK is an amazing library.', 'It provides various tools for NLP.']

Stopwords Removal

Stopwords are common words such as 'a', 'an', 'the', 'in', and 'is', which do not contribute much to the overall meaning of a text. Removing stopwords can help in reducing the noise and dimensionality of the text data. NLTK provides a predefined list of stopwords for various languages.

To remove stopwords from a list of tokens, you can use the following code:

python
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "NLTK provides various tools for text preprocessing."
tokens = word_tokenize(text)
filtered_tokens = [token for token in tokens if token.lower() not in stopwords.words('english')]
print(filtered_tokens)

The output will be:

['NLTK', 'provides', 'various', 'tools', 'text', 'preprocessing', '.']

Stemming and Lemmatization

Stemming and lemmatization are techniques used to reduce words to their base or root form. This helps in reducing the dimensionality of the text data and grouping similar words together.

Stemming

Stemming removes suffixes from words to obtain their base form. NLTK provides various stemming algorithms, such as the Porter Stemmer and the Snowball Stemmer. Here's an example using the Porter Stemmer:

python
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

text = "The cats are playing with their toys."
tokens = word_tokenize(text)
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

The output will be:

['the', 'cat', 'are', 'play', 'with', 'their', 'toy', '.']

Lemmatization

Lemmatization reduces words to their base or dictionary form, known as the lemma. It considers the context and part of speech of the word. To perform lemmatization in NLTK, you can use the WordNetLemmatizer:

python
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

text = "The cats are playing with their toys."
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized_tokens)

The output will be:

['The', 'cat', 'are', 'playing', 'with', 'their', 'toy', '.']

Note that lemmatization is generally more accurate than stemming, as it considers the context and part of speech. However, it may be slower due to its complexity.

Text Normalization

Text normalization is the process of transforming a text into a standard or canonical form. It can include various tasks such as converting text to lowercase, removing special characters, expanding contractions, and correcting spelling errors. Here are a few examples:

Converting Text to Lowercase

To convert a text to lowercase, you can simply use Python's lower() method:

text = "NLTK provides various tools for text preprocessing."
lowercase_text = text.lower()
print(lowercase_text)

The output will be:

'nltk provides various tools for text preprocessing.'

Removing Special Characters

To remove special characters from a text, you can use Python's re module:

python
import re

text = "NLTK provides various tools for text preprocessing!."
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)

The output will be:

'NLTK provides various tools for text preprocessing'

References

https://www.nltk.org/
https://github.com/nltk/nltk

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!