2022-11-10

Pandas Overview

What is the Pandas Library

Pandas is a Python library for data analysis and manipulation. It provides data structures for efficiently storing and manipulating data, as well as tools for data cleaning, filtering, and transformation. Pandas is built on top of the NumPy library, which provides efficient numerical computation in Python. Pandas is widely used in data science and machine learning, and is an essential tool for anyone working with data in Python.

Key Features of Pandas

Some of the key features of Pandas include:

  • Data structures for efficient storage and manipulation of tabular data, including dataframes and series.
  • Tools for data cleaning, filtering, and transformation, such as the ability to handle missing data and duplicate values.
  • Integration with other Python libraries, such as NumPy, Matplotlib, and Scikit-learn.
  • Built-in support for reading and writing data in a variety of formats, including CSV, Excel, and SQL databases.
  • Powerful indexing and selection capabilities, allowing for complex data slicing and filtering.
  • Easy integration with other Python libraries and tools for data analysis and visualization.

Installation

To use Pandas, you first need to install it on your computer. Pandas can be installed using the pip package manager.

bash
$ pip install pandas

Data Structures in Pandas

Pandas provides two main data structures for storing and manipulating data: dataframes and series.
In this chapter, I will explore the two main data structures provided by the Pandas library for storing and manipulating data: dataframes and series.

Dataframes

A dataframe is a two-dimensional table of data, similar to a spreadsheet. It consists of rows and columns, where each column represents a variable and each row represents an observation. Dataframes are the most commonly used data structure in Pandas, and they provide a powerful way to work with tabular data.

To create a dataframe in Pandas, you can use the DataFrame() function and pass in a dictionary or a list of lists. The keys of the dictionary or the first list in the list of lists will become the column names, and the values or the remaining lists will become the rows. For example:

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Paris', 'London', 'Tokyo']}

df = pd.DataFrame(data)

print(df)
       Name  Age      City
0     Alice   25  New York
1       Bob   30     Paris
2   Charlie   35    London
3     David   40     Tokyo

You can access columns in a dataframe using their names, for example:

python
ages = df['Age']
print(ages)
0    25
1    30
2    35
3    40
Name: Age, dtype: int64

You can also access rows in a dataframe using the loc[] method, which takes a row label, or the iloc[] method, which takes a row index. For example:

python
row = df.loc[1]
print(row)
Name        Bob
Age          30
City      Paris
Name: 1, dtype: object

Series

A series is a one-dimensional array of data, similar to a column in a spreadsheet. Series are often used to represent a single variable or a single column of data in a dataframe. Series provide a powerful way to work with one-dimensional data in Pandas.

To create a series in Pandas, you can use the Series() function and pass in a list or an array. For example:

python
import pandas as pd

ages = pd.Series([25, 30, 35, 40])

print(ages)
0    25
1    30
2    35
3    40
dtype: int64

You can access elements in a series using their indexes, for example:

python
age = ages[1]
print(age)
30

You can also perform element-wise operations on series, for example:

doubled_ages = ages * 2
print(doubled_ages)
0    50
1    60
2    70
3    80
dtype: int64

References

https://pandas.pydata.org/docs/reference/index.html

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!