2022-11-17

Time Series Data with Pandas

Introduction

Time series data is a sequence of data points collected or recorded at regular intervals. Analyzing time series data is crucial for forecasting, financial analysis, and understanding trends in various domains. This article provides an introduction to handling and analyzing time series data using the Python library, Pandas.

Working with Dates and Times

In this chapter, I'll work with an example dataset to demonstrate how to create and manipulate DateTime objects using the Pandas library. We'll also learn how to parse dates and times from strings, and format them as required.

Here is an example dataset.

python
data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
        'Value': [10, 20, 30, 40, 50]}

Creating DateTime Objects

First, let's import the necessary libraries and load our example dataset into a Pandas DataFrame.

python
import pandas as pd

data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
        'Value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)
print(df)
         Date  Value
0  2021-01-01     10
1  2021-01-02     20
2  2021-01-03     30
3  2021-01-04     40
4  2021-01-05     50

Now, let's convert the 'Date' column from string to DateTime objects using pd.to_datetime().

python
df['Date'] = pd.to_datetime(df['Date'])
print(df)
        Date  Value
0 2021-01-01     10
1 2021-01-02     20
2 2021-01-03     30
3 2021-01-04     40
4 2021-01-05     50

Formatting Dates and Times

We can format the DateTime objects in our DataFrame using the strftime() function. Let's format the 'Date' column as 'Month-Day-Year'.

python
df['Formatted_Date'] = df['Date'].dt.strftime('%m-%d-%Y')
print(df)
        Date  Value Formatted_Date
0 2021-01-01     10     01-01-2021
1 2021-01-02     20     01-02-2021
2 2021-01-03     30     01-03-2021
3 2021-01-04     40     01-04-2021
4 2021-01-05     50     01-05-2021

Parsing Dates and Times from Strings

Let's assume we have a new column 'Date_Str' with dates in the format 'Month-Day-Year', and we want to parse them into DateTime objects.

python
data = {'Date_Str': ['01-01-2021', '01-02-2021', '01-03-2021', '01-04-2021', '01-05-2021'],
        'Value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)
print(df)
     Date_Str  Value
0  01-01-2021     10
1  01-02-2021     20
2  01-03-2021     30
3  01-04-2021     40
4  01-05-2021     50

To parse the 'Date_Str' column and convert it into DateTime objects, we can use the pd.to_datetime() function with the format parameter.

python
df['Date'] = pd.to_datetime(df['Date_Str'], format='%m-%d-%Y')
print(df)
     Date_Str  Value       Date
0  01-01-2021     10 2021-01-01
1  01-02-2021     20 2021-01-02
2  01-03-2021     30 2021-01-03
3  01-04-2021     40 2021-01-04
4  01-05-2021     50 2021-01-05

Now, we have successfully parsed the dates from the 'Date_Str' column and created a new 'Date' column with DateTime objects.

Time Series Resampling

In this chapter, I'll explore time series resampling techniques, including downsampling and upsampling, using an example dataset. Resampling is essential when working with time series data to change the frequency of the data points.

Here is an example dataset.

python
import pandas as pd

date_rng = pd.date_range(start='2021-01-01', end='2021-01-10', freq='D')
data = {'Date': date_rng, 'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}

df = pd.DataFrame(data)
print(df)
        Date  Value
0 2021-01-01     10
1 2021-01-02     20
2 2021-01-03     30
3 2021-01-04     40
4 2021-01-05     50
5 2021-01-06     60
6 2021-01-07     70
7 2021-01-08     80
8 2021-01-09     90
9 2021-01-10    100

Downsampling

Downsampling is the process of aggregating data at a lower frequency. Let's downsample our dataset to a 3-day frequency, computing the mean of the 'Value' column for each period.

First, we need to set the 'Date' column as the index of our DataFrame.

python
df.set_index('Date', inplace=True)
print(df)
            Value
Date
2021-01-01     10
2021-01-02     20
2021-01-03     30
2021-01-04     40
2021-01-05     50
2021-01-06     60
2021-01-07     70
2021-01-08     80
2021-01-09     90
2021-01-10    100

Now, let's perform the downsampling.

python
downsampled_df = df.resample('3D').mean()
print(downsampled_df)
               Value
Date
2021-01-01  20.000000
2021-01-04  46.666667
2021-01-07  73.333333
2021-01-10 100.000000

Upsampling

Upsampling is the process of increasing the frequency of the data. Let's upsample our dataset to an hourly frequency and fill the missing data using forward filling.

python
upsampled_df = df.resample('H').ffill()
print(upsampled_df.head(10))
                     Value
Date
2021-01-01 00:00:00     10
2021-01-01 01:00:00     10
2021-01-01 02:00:00     10
2021-01-01 03:00:00     10
2021-01-01 04:00:00     10
2021-01-01 05:00:00     10
2021-01-01 06:00:00     10
2021-01-01 07:00:00     10
2021-01-01 08:00:00     10
2021-01-01 09:00:00     10
2021-01-01 10:00:00     10

As you can see, the data has been upsampled to an hourly frequency, and the missing values have been filled using forward filling.

Alternatively, we can use interpolation to fill missing data when upsampling. Let's perform linear interpolation on our dataset.

upsampled_df_interpolated = df.resample('H').interpolate()
print(upsampled_df_interpolated.head(10))
                         Value
Date
2021-01-01 00:00:00  10.000000
2021-01-01 01:00:00  10.416667
2021-01-01 02:00:00  10.833333
2021-01-01 03:00:00  11.250000
2021-01-01 04:00:00  11.666667
2021-01-01 05:00:00  12.083333
2021-01-01 06:00:00  12.500000
2021-01-01 07:00:00  12.916667
2021-01-01 08:00:00  13.333333
2021-01-01 09:00:00  13.750000

Rolling Window Functions

In this chapter, I'll explore rolling window functions and their applications using an example dataset. Rolling window functions are useful for smoothing time series data and calculating various statistics within a specified window size.

Here is an example dataset.

python
import pandas as pd

date_rng = pd.date_range(start='2021-01-01', end='2021-01-10', freq='D')
data = {'Date': date_rng, 'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}

df = pd.DataFrame(data)
print(df)
        Date  Value
0 2021-01-01     10
1 2021-01-02     20
2 2021-01-03     30
3 2021-01-04     40
4 2021-01-05     50
5 2021-01-06     60
6 2021-01-07     70
7 2021-01-08     80
8 2021-01-09     90
9 2021-01-10    100

Basic Rolling Window Operations

Let's start by setting the 'Date' column as the index of our DataFrame.

python
df.set_index('Date', inplace=True)
print(df)
            Value
Date
2021-01-01     10
2021-01-02     20
2021-01-03     30
2021-01-04     40
2021-01-05     50
2021-01-06     60
2021-01-07     70
2021-01-08     80
2021-01-09     90
2021-01-10    100

Now, let's calculate the rolling mean with a window size of 3.

python
df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()
print(df)
            Value  Rolling_Mean
Date
2021-01-01     10           NaN
2021-01-02     20           NaN
2021-01-03     30     20.000000
2021-01-04     40     30.000000
2021-01-05     50     40.000000
2021-01-06     60     50.000000
2021-01-07     70     60.000000
2021-01-08     80     70.000000
2021-01-09     90     80.000000
2021-01-10    100     90.000000

Expanding Windows

Expanding windows compute a statistic cumulatively over a growing window size. Let's calculate the cumulative sum of our dataset using expanding windows.

python
df['Expanding_Sum'] = df['Value'].expanding().sum()
print(df)
            Value  Rolling_Mean  Expanding_Sum
Date
2021-01-01     10           NaN           10.0
2021-01-02     20           NaN           30.0
2021-01-03     30     20.000000           60.0
2021-01-04     40     30.000000           100.0
2021-01-05     50     40.000000           150.0
2021-01-06     60     50.000000           210.0
2021-01-07     70     60.000000           280.0
2021-01-08     80     70.000000           360.0
2021-01-09     90     80.000000           450.0
2021-01-10     100    90.000000           550.0

Custom Rolling Window Functions

We can also apply custom functions to a rolling window. Let's calculate the difference between the maximum and minimum values within a window size of 3.

def max_min_diff(series):
    return series.max() - series.min()

df['Max_Min_Diff'] = df['Value'].rolling(window=3).apply(max_min_diff)
print(df)
            Value  Rolling_Mean  Expanding_Sum  Max_Min_Diff
Date
2021-01-01     10           NaN           10.0           NaN
2021-01-02     20           NaN           30.0           NaN
2021-01-03     30     20.000000           60.0          20.0
2021-01-04     40     30.000000          100.0          20.0
2021-01-05     50     40.000000          150.0          20.0
2021-01-06     60     50.000000          210.0          20.0
2021-01-07     70     60.000000          280.0          20.0
2021-01-08     80     70.000000          360.0          20.0
2021-01-09     90     80.000000          450.0          20.0
2021-01-10    100     90.000000          550.0          20.0

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!