2023-02-03

Hugging Face Datasets

Hugging Face Datasets

Hugging Face's Datasets provides a website listing datasets and a library of datasets to work with them.

Datasets

Publicly available datasets can be found at the following link.

https://huggingface.co/datasets

Datasets can be keyword-searched, as well as filtered by task category, data size, license, etc.

datasets API

The datasets API can be installed with the following command.

$ pip install datasets

List of available datasets

A list of available datasets can be viewed using list_datasets.

from datasets import list_datasets, load_dataset, list_metrics, load_metric
from pprint import pprint

pprint(list_datasets())

As of February 3, 2023, there are 20,267 datasets.

>> len(list_datasets)

20267

The dataset can also be found at the following link.

https://huggingface.co/datasets

Load datasets

Datasets can be loaded by using load_dataset.

from datasets import load_dataset

dataset = load_dataset(
  'wikitext', # dataset identifier
  'wikitext-103-v1' # instance name in the dataset
)

https://huggingface.co/docs/datasets/v1.11.0/package_reference/loading_methods.html#datasets.load_dataset

Dataset type

The loaded dataset is of type DatasetDict.

from datasets import load_dataset

dataset = load_dataset('glue', 'cola')
type(dataset)

datasets.dataset_dict.DatasetDict

Datasets can be converted to various types as follows.

dataset.set_format(type='python') # (default) Python object
dataset.set_format(type='torch') # PyTorch tensor
dataset.set_format(type='tensorflow') # TensorFlow tensor
dataset.set_format(type='jax') # JAX
dataset.set_format(type='numpy') # NumPy
dataset.set_format(type='pandas') # pandas DataFrame

https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.set_format
https://pypi.org/project/datasets/

Generate DatasetDict class from CSV file

DatasetDict classes can be created from local files or pandas DataFrames.

First, prepare a CSV file. Get a dataset called glue and save it as a CSV.

from datasets import load_dataset
from pprint import pprint

dataset = load_dataset('glue', 'cola')
dataset
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

Save the dataset locally as CSV with the to_csv method.

dataset.set_format(type="pandas") # convert to pandas
dataset["train"][:].to_csv("train.csv", index=None) # save
dataset["validation"][:].to_csv("validation.csv", index=None) # save
dataset["test"][:].to_csv("test.csv", index=None) # save
dataset.reset_format() # reset format

The train.csv has the following data.

$ !cat 'train.csv' | head -n 5

sentence,label,idx
"Our friends won't buy this analysis, let alone the next one we propose.",1,0
One more pseudo generalization and I'm giving up.,1,1
One more pseudo generalization or I'm giving up.,1,2
"The more we study verbs, the crazier they get.",1,3

CSV file

To create a DatasetDict class from a local CSV file, execute the following code.

dataset_files = {
    "train": ["train.csv"],
    "validation": ["validation.csv"],
    "test": ["test.csv"],
}
dataset_from_csv = load_dataset("csv", data_files=dataset_files)
print(dataset_from_csv)
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

pandas DataFrame

To create a DatasetDict class from pandas, execute the following code.

import pandas as pd
from datasets import Dataset, DatasetDict

train_df = pd.read_csv("train.csv")
validation_df = pd.read_csv("validation.csv")
test_df = pd.read_csv("test.csv")

train_ds = Dataset.from_pandas(train_df)
validation_ds = Dataset.from_pandas(validation_df)

dataset_from_pandas = DatasetDict({
    "train": train_ds,
    "validation": validation_ds,
    "test": test_ds,
})
print(dataset_from_csv)
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

Cast label column to ClassLabel

The label of a dataset loaded from a public dataset is of type ClassLabel, while the label of a dataset_from_pandas is of type Value.

dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}
dataset_from_pandas['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None),
 'idx': Value(dtype='int64', id=None)}

You can cast the column types of a dataset as follows.

from datasets import ClassLabel

# convert "label" to ClassLabel
class_label = ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'])
dataset_from_pandas = dataset_from_pandas.cast_column("label", class_label)
dataset_from_pandas['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int64', id=None)}

Processing a dataset

A dataset can be processed for the entire dataset using the map method.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset('glue', 'cola')

# add "length" column that is the length of "sentence"
dataset_with_length = dataset.map(lambda x: {"length": len(x["sentence"])})

# tokenize "sentence" column
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
dataset_tokenized = dataset.map(lambda x: tokenizer(x['sentence']), batched=True)

References

https://huggingface.co/docs/datasets/main/index
https://huggingface.co/datasets
https://huggingface.co/docs/datasets/v1.11.0/package_reference/loading_methods.html#datasets.load_dataset
https://huggingface.co/docs/datasets/package_reference/main_classes
https://huggingface.co/docs/datasets/about_dataset_features
https://pypi.org/project/datasets/
https://huggingface.co/docs/datasets/v1.1.1/add_dataset.html

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!