2022-03-05

Public Datasets - Library Wise

Public Datasets

Public datasets play a critical role in the machine learning landscape. They serve as a foundation for training and testing ML models, enabling researchers and practitioners to evaluate their algorithms' performance, validate hypotheses, and compare results with existing benchmarks.

In addition to providing a platform for experimentation, public datasets contribute to the democratization of machine learning. By making data freely available, students, academics, and professionals alike can access high-quality datasets to support their research and projects. This widespread availability of data fosters innovation and collaboration within the ML community, driving the field forward.

This article will introduce an overview of popular public datasets offered by five renowned machine learning libraries: Scikit-learn, Seaborn, PyTorch, TensorFlow, and Hugging Face.

Scikit-learn

python

from sklearn import datasets
from sklearn.datasets import fetch_california_housing

# Iris Dataset
iris = datasets.load_iris()

# California Housing Dataset
california_housing = fetch_california_housing()

# Digits Dataset
digits = datasets.load_digits()

# Diabetes Dataset
diabetes = datasets.load_diabetes()

Iris Dataset

The Iris Dataset, also known as Fisher's Iris Dataset, is a classic dataset in the field of pattern recognition and machine learning. Consisting of 150 samples, it includes three classes of iris flowers (Setosa, Versicolor, and Virginica), each with 50 instances. The dataset contains four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. The Iris Dataset is widely used for classification and clustering tasks, serving as a beginner-friendly introduction to machine learning using the sklearn library.

California Housing Dataset

The California Housing Dataset is a comprehensive dataset used for regression tasks, containing 20,640 instances representing California's housing blocks. Each instance consists of 8 attributes: median income, housing median age, average number of rooms per household, average number of bedrooms per household, total population, average occupancy, latitude, and longitude. The target variable is the median house value for each housing block. Sklearn provides the necessary tools for data preprocessing, training, and evaluating models on the California Housing Dataset, allowing users to develop their skills in regression and explore the factors influencing housing prices.

Digits Dataset

The Digits Dataset is a collection of 8x8 grayscale images of handwritten digits, ranging from 0 to 9. It consists of 1,797 samples, making it a smaller and more manageable alternative to the popular MNIST dataset. The Digits Dataset is suitable for image classification tasks and can be used to introduce users to image processing and pattern recognition techniques using sklearn's suite of tools.

Diabetes Dataset

The Diabetes Dataset is another popular choice for regression tasks. It comprises 442 instances, each representing a diabetes patient, and includes ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline. Using sklearn's rich feature set, users can preprocess the data, train regression models, and evaluate their performance on the Diabetes Dataset.

Seaborn

python

import seaborn as sns

# Tips Dataset
tips = sns.load_dataset("tips")

# Titanic Dataset
titanic = sns.load_dataset("titanic")

# Car Crashes Dataset
car_crashes = sns.load_dataset("car_crashes")

# Penguins Dataset
penguins = sns.load_dataset("penguins")

Tips Dataset

The Tips Dataset is a built-in Seaborn dataset that comprises 244 instances, each representing a meal at a restaurant. It includes seven attributes: total bill, tip, sex, smoker, day, time, and party size. The dataset is ideal for data exploration, visualization, and statistical analysis, allowing users to discover relationships between various factors and their impact on tipping behavior. Seaborn's rich visualization capabilities enable users to create various plots, such as scatter plots, box plots, and violin plots, to investigate patterns and trends in the data.

Titanic Dataset

The Titanic Dataset is a well-known dataset in the machine learning community that contains information about passengers aboard the ill-fated Titanic. With 891 instances and 15 attributes, including passenger class, sex, age, fare, and survival status, the dataset offers insights into the factors that contributed to passengers' survival. Seaborn's visualization tools allow users to explore the dataset and identify patterns, correlations, and outliers that may help predict survival outcomes.

https://www.kaggle.com/c/titanic/data

Car Crashes Dataset

The Car Crashes Dataset is another built-in Seaborn dataset that includes data on the frequency of car crashes in the United States. The dataset contains 51 instances, one for each state, and seven attributes: total crashes, speeding-related crashes, alcohol-impaired crashes, not distracted crashes, no previous crashes, insurance premiums, and loss per insured driver. Seaborn's visualization capabilities enable users to create plots and perform statistical analysis to identify trends and factors contributing to car crash frequency across different states.

Penguins Dataset

The Penguins Dataset is a relatively new dataset that has gained popularity as an alternative to the Iris Dataset. It contains 344 instances, each representing a penguin from one of three species (Adélie, Chinstrap, and Gentoo). The dataset includes seven attributes: species, island, bill length, bill depth, flipper length, body mass, and sex. With its diverse set of attributes, the Penguins Dataset is ideal for data exploration, visualization, and statistical analysis using Seaborn. Users can create a variety of plots, such as scatter plots, pair plots, and distribution plots, to uncover patterns and relationships among the attributes.

PyTorch

python

import torch
from torchvision import datasets, transforms

# MNIST Dataset
mnist_train = datasets.MNIST(root="./data", train=True, download=True, transform=transforms.ToTensor())
mnist_test = datasets.MNIST(root="./data", train=False, download=True, transform=transforms.ToTensor())

# Fashion-MNIST Dataset
fashion_mnist_train = datasets.FashionMNIST(root="./data", train=True, download=True, transform=transforms.ToTensor())
fashion_mnist_test = datasets.FashionMNIST(root="./data", train=False, download=True, transform=transforms.ToTensor())

# CIFAR-10 Dataset
cifar10_train = datasets.CIFAR10(root="./data", train=True, download=True, transform=transforms.ToTensor())
cifar10_test = datasets.CIFAR10(root="./data", train=False, download=True, transform=transforms.ToTensor())

# CIFAR-100 Dataset
cifar100_train = datasets.CIFAR100(root="./data", train=True, download=True, transform=transforms.ToTensor())
cifar100_test = datasets.CIFAR100(root="./data", train=False, download=True, transform=transforms.ToTensor())

MNIST Dataset

The MNIST (Modified National Institute of Standards and Technology) Dataset is a popular dataset for image recognition tasks, specifically for handwritten digit classification. It consists of 70,000 grayscale images, each 28x28 pixels, representing digits from 0 to 9. The dataset is split into 60,000 training images and 10,000 test images. PyTorch provides a built-in DataLoader for the MNIST dataset, simplifying the process of loading and preprocessing the data for training deep learning models. The MNIST dataset is widely used as a benchmark for image classification algorithms and serves as an excellent starting point for those new to deep learning.

Fashion-MNIST Dataset

The Fashion-MNIST Dataset is an alternative to the traditional MNIST dataset, designed to address its limitations in terms of complexity and overuse. The dataset consists of 70,000 grayscale images, each 28x28 pixels, representing 10 classes of clothing items, such as t-shirts, trousers, and dresses. Like the MNIST dataset, it is split into 60,000 training images and 10,000 test images. PyTorch's built-in DataLoader for the Fashion-MNIST dataset enables users to easily load and preprocess the data for training and evaluation. The dataset is an excellent choice for those looking to explore more complex image classification tasks using PyTorch.

CIFAR-10 and CIFAR-100 Datasets

The CIFAR-10 and CIFAR-100 datasets are popular choices for image classification tasks involving more complex and diverse images. The CIFAR-10 dataset contains 60,000 color images, each 32x32 pixels, representing 10 classes of objects, such as airplanes, automobiles, and birds. The dataset is divided into 50,000 training images and 10,000 test images. The CIFAR-100 dataset is similar, but it contains 100 classes of objects, with 600 images per class. PyTorch offers built-in DataLoaders for both CIFAR-10 and CIFAR-100 datasets, making it easy for users to load and preprocess the data for training deep learning models.

TensorFlow

python

import tensorflow as tf

# MNIST Dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Fashion-MNIST Dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# CIFAR-10 Dataset
cifar10 = tf.keras.datasets.cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# CIFAR-100 Dataset
cifar100 = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar100.load_data()

# IMDB Movie Review Dataset
imdb = tf.keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()

MNIST Dataset

The MNIST Dataset is a popular dataset for handwritten digit classification. TensorFlow also provides built-in support for the MNIST dataset, including data loading and preprocessing utilities. TensorFlow's extensive suite of tools enables users to build, train, and evaluate deep learning models for image recognition tasks, making the MNIST dataset an excellent starting point for those new to TensorFlow and deep learning.

Fashion-MNIST Dataset

The Fashion-MNIST Dataset, an alternative to the traditional MNIST dataset, is another popular choice for image classification tasks using TensorFlow. TensorFlow offers built-in utilities for loading and preprocessing the Fashion-MNIST dataset, allowing users to quickly and easily train and evaluate deep learning models for classifying clothing items.

CIFAR-10 and CIFAR-100 Datasets

The CIFAR-10 and CIFAR-100 datasets are widely used for image classification tasks involving more complex and diverse images. TensorFlow provides built-in support for these datasets, simplifying the process of loading and preprocessing the data. By working with the CIFAR-10 and CIFAR-100 datasets, users can gain experience in training deep learning models on more challenging image classification tasks using TensorFlow.

IMDB Movie Review Dataset

The IMDB Movie Review Dataset is a popular dataset for natural language processing tasks, specifically sentiment analysis. It consists of 50,000 movie reviews, labeled as either positive or negative, with a balanced distribution of classes. TensorFlow offers built-in utilities for loading and preprocessing the IMDB Movie Review Dataset, making it easy for users to train and evaluate deep learning models for sentiment analysis.

COCO Dataset

The COCO (Common Objects in Context) Dataset is a large-scale dataset for object detection, segmentation, and captioning tasks. It contains over 200,000 labeled images, with more than 1.5 million object instances across 80 object categories. TensorFlow provides support for the COCO dataset through the TensorFlow Object Detection API, which includes tools and utilities for loading, preprocessing, and evaluating data. The COCO dataset is an excellent resource for those looking to explore advanced computer vision tasks using TensorFlow.

Hugging Face

https://huggingface.co/datasets

python

from datasets import load_dataset

# GLUE Benchmark (for example, MRPC task)
glue_mrpc = load_dataset("glue", "mrpc")

# SQuAD Dataset
squad = load_dataset("squad")

GLUE Benchmark

The GLUE (General Language Understanding Evaluation) Benchmark is a collection of nine diverse natural language understanding tasks, including sentiment analysis, question answering, and paraphrasing. The benchmark aims to evaluate the performance of NLP models across a wide range of tasks. Hugging Face offers pre-trained models and datasets for the GLUE Benchmark, allowing users to fine-tune models and evaluate their performance on specific tasks, as well as compare their results with those of other models.

https://huggingface.co/datasets/glue

SQuAD (Stanford Question Answering Dataset)

The SQuAD dataset is a popular choice for question answering and reading comprehension tasks. It consists of over 100,000 questions based on more than 500 Wikipedia articles, with each question accompanied by a paragraph containing the answer. Hugging Face provides access to the SQuAD dataset and pre-trained models, simplifying the process of fine-tuning and evaluating models for question answering tasks.

https://huggingface.co/datasets/squad

Machine Learning Model File Formats

The Future Pioneered by Generative AI, as Seen by Chairman Masayoshi Son

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS