Hugging Face Datasets
Hugging Face's Datasets provides a website listing datasets and a library of datasets
to work with them.
Datasets
Publicly available datasets can be found at the following link.
Datasets can be keyword-searched, as well as filtered by task category, data size, license, etc.
datasets API
The datasets
API can be installed with the following command.
$ pip install datasets
List of available datasets
A list of available datasets can be viewed using list_datasets
.
from datasets import list_datasets, load_dataset, list_metrics, load_metric
from pprint import pprint
pprint(list_datasets())
As of February 3, 2023, there are 20,267 datasets.
>> len(list_datasets)
20267
The dataset can also be found at the following link.
Load datasets
Datasets can be loaded by using load_dataset
.
from datasets import load_dataset
dataset = load_dataset(
'wikitext', # dataset identifier
'wikitext-103-v1' # instance name in the dataset
)
Dataset type
The loaded dataset is of type DatasetDict
.
from datasets import load_dataset
dataset = load_dataset('glue', 'cola')
type(dataset)
datasets.dataset_dict.DatasetDict
Datasets can be converted to various types as follows.
dataset.set_format(type='python') # (default) Python object
dataset.set_format(type='torch') # PyTorch tensor
dataset.set_format(type='tensorflow') # TensorFlow tensor
dataset.set_format(type='jax') # JAX
dataset.set_format(type='numpy') # NumPy
dataset.set_format(type='pandas') # pandas DataFrame
Generate DatasetDict class from CSV file
DatasetDict classes can be created from local files or pandas DataFrames.
First, prepare a CSV file. Get a dataset called glue
and save it as a CSV.
from datasets import load_dataset
from pprint import pprint
dataset = load_dataset('glue', 'cola')
dataset
DatasetDict({
train: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
validation: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1043
})
test: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1063
})
})
Save the dataset locally as CSV with the to_csv
method.
dataset.set_format(type="pandas") # convert to pandas
dataset["train"][:].to_csv("train.csv", index=None) # save
dataset["validation"][:].to_csv("validation.csv", index=None) # save
dataset["test"][:].to_csv("test.csv", index=None) # save
dataset.reset_format() # reset format
The train.csv
has the following data.
$ !cat 'train.csv' | head -n 5
sentence,label,idx
"Our friends won't buy this analysis, let alone the next one we propose.",1,0
One more pseudo generalization and I'm giving up.,1,1
One more pseudo generalization or I'm giving up.,1,2
"The more we study verbs, the crazier they get.",1,3
CSV file
To create a DatasetDict class from a local CSV file, execute the following code.
dataset_files = {
"train": ["train.csv"],
"validation": ["validation.csv"],
"test": ["test.csv"],
}
dataset_from_csv = load_dataset("csv", data_files=dataset_files)
print(dataset_from_csv)
DatasetDict({
train: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
validation: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1043
})
test: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1063
})
})
pandas DataFrame
To create a DatasetDict class from pandas, execute the following code.
import pandas as pd
from datasets import Dataset, DatasetDict
train_df = pd.read_csv("train.csv")
validation_df = pd.read_csv("validation.csv")
test_df = pd.read_csv("test.csv")
train_ds = Dataset.from_pandas(train_df)
validation_ds = Dataset.from_pandas(validation_df)
dataset_from_pandas = DatasetDict({
"train": train_ds,
"validation": validation_ds,
"test": test_ds,
})
print(dataset_from_csv)
DatasetDict({
train: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
validation: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1043
})
test: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1063
})
})
Cast label column to ClassLabel
The label
of a dataset
loaded from a public dataset is of type ClassLabel
, while the label
of a dataset_from_pandas
is of type Value
.
dataset['train'].features
{'sentence': Value(dtype='string', id=None),
'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
'idx': Value(dtype='int32', id=None)}
dataset_from_pandas['train'].features
{'sentence': Value(dtype='string', id=None),
'label': Value(dtype='int64', id=None),
'idx': Value(dtype='int64', id=None)}
You can cast the column types of a dataset as follows.
from datasets import ClassLabel
# convert "label" to ClassLabel
class_label = ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'])
dataset_from_pandas = dataset_from_pandas.cast_column("label", class_label)
dataset_from_pandas['train'].features
{'sentence': Value(dtype='string', id=None),
'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
'idx': Value(dtype='int64', id=None)}
Processing a dataset
A dataset can be processed for the entire dataset using the map
method.
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset('glue', 'cola')
# add "length" column that is the length of "sentence"
dataset_with_length = dataset.map(lambda x: {"length": len(x["sentence"])})
# tokenize "sentence" column
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
dataset_tokenized = dataset.map(lambda x: tokenizer(x['sentence']), batched=True)
References