Introduction
Time series data is a sequence of data points collected or recorded at regular intervals. Analyzing time series data is crucial for forecasting, financial analysis, and understanding trends in various domains. This article provides an introduction to handling and analyzing time series data using the Python library, Pandas.
Working with Dates and Times
In this chapter, I'll work with an example dataset to demonstrate how to create and manipulate DateTime objects using the Pandas library. We'll also learn how to parse dates and times from strings, and format them as required.
Here is an example dataset.
data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'Value': [10, 20, 30, 40, 50]}
Creating DateTime Objects
First, let's import the necessary libraries and load our example dataset into a Pandas DataFrame.
import pandas as pd
data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
print(df)
Date Value
0 2021-01-01 10
1 2021-01-02 20
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50
Now, let's convert the 'Date' column from string to DateTime objects using pd.to_datetime()
.
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Date Value
0 2021-01-01 10
1 2021-01-02 20
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50
Formatting Dates and Times
We can format the DateTime objects in our DataFrame using the strftime()
function. Let's format the 'Date' column as 'Month-Day-Year'.
df['Formatted_Date'] = df['Date'].dt.strftime('%m-%d-%Y')
print(df)
Date Value Formatted_Date
0 2021-01-01 10 01-01-2021
1 2021-01-02 20 01-02-2021
2 2021-01-03 30 01-03-2021
3 2021-01-04 40 01-04-2021
4 2021-01-05 50 01-05-2021
Parsing Dates and Times from Strings
Let's assume we have a new column 'Date_Str' with dates in the format 'Month-Day-Year', and we want to parse them into DateTime objects.
data = {'Date_Str': ['01-01-2021', '01-02-2021', '01-03-2021', '01-04-2021', '01-05-2021'],
'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
print(df)
Date_Str Value
0 01-01-2021 10
1 01-02-2021 20
2 01-03-2021 30
3 01-04-2021 40
4 01-05-2021 50
To parse the 'Date_Str' column and convert it into DateTime objects, we can use the pd.to_datetime()
function with the format
parameter.
df['Date'] = pd.to_datetime(df['Date_Str'], format='%m-%d-%Y')
print(df)
Date_Str Value Date
0 01-01-2021 10 2021-01-01
1 01-02-2021 20 2021-01-02
2 01-03-2021 30 2021-01-03
3 01-04-2021 40 2021-01-04
4 01-05-2021 50 2021-01-05
Now, we have successfully parsed the dates from the 'Date_Str' column and created a new 'Date' column with DateTime objects.
Time Series Resampling
In this chapter, I'll explore time series resampling techniques, including downsampling and upsampling, using an example dataset. Resampling is essential when working with time series data to change the frequency of the data points.
Here is an example dataset.
import pandas as pd
date_rng = pd.date_range(start='2021-01-01', end='2021-01-10', freq='D')
data = {'Date': date_rng, 'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(data)
print(df)
Date Value
0 2021-01-01 10
1 2021-01-02 20
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50
5 2021-01-06 60
6 2021-01-07 70
7 2021-01-08 80
8 2021-01-09 90
9 2021-01-10 100
Downsampling
Downsampling is the process of aggregating data at a lower frequency. Let's downsample our dataset to a 3-day frequency, computing the mean of the 'Value' column for each period.
First, we need to set the 'Date' column as the index of our DataFrame.
df.set_index('Date', inplace=True)
print(df)
Value
Date
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
2021-01-09 90
2021-01-10 100
Now, let's perform the downsampling.
downsampled_df = df.resample('3D').mean()
print(downsampled_df)
Value
Date
2021-01-01 20.000000
2021-01-04 46.666667
2021-01-07 73.333333
2021-01-10 100.000000
Upsampling
Upsampling is the process of increasing the frequency of the data. Let's upsample our dataset to an hourly frequency and fill the missing data using forward filling.
upsampled_df = df.resample('H').ffill()
print(upsampled_df.head(10))
Value
Date
2021-01-01 00:00:00 10
2021-01-01 01:00:00 10
2021-01-01 02:00:00 10
2021-01-01 03:00:00 10
2021-01-01 04:00:00 10
2021-01-01 05:00:00 10
2021-01-01 06:00:00 10
2021-01-01 07:00:00 10
2021-01-01 08:00:00 10
2021-01-01 09:00:00 10
2021-01-01 10:00:00 10
As you can see, the data has been upsampled to an hourly frequency, and the missing values have been filled using forward filling.
Alternatively, we can use interpolation to fill missing data when upsampling. Let's perform linear interpolation on our dataset.
upsampled_df_interpolated = df.resample('H').interpolate()
print(upsampled_df_interpolated.head(10))
Value
Date
2021-01-01 00:00:00 10.000000
2021-01-01 01:00:00 10.416667
2021-01-01 02:00:00 10.833333
2021-01-01 03:00:00 11.250000
2021-01-01 04:00:00 11.666667
2021-01-01 05:00:00 12.083333
2021-01-01 06:00:00 12.500000
2021-01-01 07:00:00 12.916667
2021-01-01 08:00:00 13.333333
2021-01-01 09:00:00 13.750000
Rolling Window Functions
In this chapter, I'll explore rolling window functions and their applications using an example dataset. Rolling window functions are useful for smoothing time series data and calculating various statistics within a specified window size.
Here is an example dataset.
import pandas as pd
date_rng = pd.date_range(start='2021-01-01', end='2021-01-10', freq='D')
data = {'Date': date_rng, 'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(data)
print(df)
Date Value
0 2021-01-01 10
1 2021-01-02 20
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50
5 2021-01-06 60
6 2021-01-07 70
7 2021-01-08 80
8 2021-01-09 90
9 2021-01-10 100
Basic Rolling Window Operations
Let's start by setting the 'Date' column as the index of our DataFrame.
df.set_index('Date', inplace=True)
print(df)
Value
Date
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
2021-01-09 90
2021-01-10 100
Now, let's calculate the rolling mean with a window size of 3.
df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()
print(df)
Value Rolling_Mean
Date
2021-01-01 10 NaN
2021-01-02 20 NaN
2021-01-03 30 20.000000
2021-01-04 40 30.000000
2021-01-05 50 40.000000
2021-01-06 60 50.000000
2021-01-07 70 60.000000
2021-01-08 80 70.000000
2021-01-09 90 80.000000
2021-01-10 100 90.000000
Expanding Windows
Expanding windows compute a statistic cumulatively over a growing window size. Let's calculate the cumulative sum of our dataset using expanding windows.
df['Expanding_Sum'] = df['Value'].expanding().sum()
print(df)
Value Rolling_Mean Expanding_Sum
Date
2021-01-01 10 NaN 10.0
2021-01-02 20 NaN 30.0
2021-01-03 30 20.000000 60.0
2021-01-04 40 30.000000 100.0
2021-01-05 50 40.000000 150.0
2021-01-06 60 50.000000 210.0
2021-01-07 70 60.000000 280.0
2021-01-08 80 70.000000 360.0
2021-01-09 90 80.000000 450.0
2021-01-10 100 90.000000 550.0
Custom Rolling Window Functions
We can also apply custom functions to a rolling window. Let's calculate the difference between the maximum and minimum values within a window size of 3.
def max_min_diff(series):
return series.max() - series.min()
df['Max_Min_Diff'] = df['Value'].rolling(window=3).apply(max_min_diff)
print(df)
Value Rolling_Mean Expanding_Sum Max_Min_Diff
Date
2021-01-01 10 NaN 10.0 NaN
2021-01-02 20 NaN 30.0 NaN
2021-01-03 30 20.000000 60.0 20.0
2021-01-04 40 30.000000 100.0 20.0
2021-01-05 50 40.000000 150.0 20.0
2021-01-06 60 50.000000 210.0 20.0
2021-01-07 70 60.000000 280.0 20.0
2021-01-08 80 70.000000 360.0 20.0
2021-01-09 90 80.000000 450.0 20.0
2021-01-10 100 90.000000 550.0 20.0