Resampling or changing time series frequency
Changing the time series frequency: resampling
In this post we will see how to change the frequency of a time series. This is a very common operation because you often need to convert two time series to a common frequency to analyze them together.
Pandas DateTimeIndex allows us to set and change its frequency attribute, this also impacts the values of the DataFrame.
Upsampling vs Downsampling:
Upsampling: When you upsample by converting the data to a higher frequency, you create new rows and need to tell pandas how to fill or interpolate the missing values in these rows.
Downsampling: When you downsample, you reduce the number of rows, and need to tell pandas how to aggregate existing data.
In this post, we will focus on upsampling techniques. In the next post, we will look into downsampling.
First we will explore a few basic options that pandas provides to address resampling with
reindex(), before diving deeper into the
.reindex() to convert monthly to weekly data
We can use
.reindex()to conform an existing time series to a DateTimeIndex at a different frequency.
We start by creating monthly data and then convert this data to weekly frequency while applying various fill logic options.
end and frequency alias
import pandas as pd # Set start and end dates start = '2016-1-1' end = '2016-2-29' # Create monthly_dates here monthly_dates = pd.date_range(start=start, end=end, freq='M') monthly_dates
DatetimeIndex(['2016-01-31', '2016-02-29'], dtype='datetime64[ns]', freq='M')
Create and print the
pd.Series monthly, passing the list
[1, 2] as the data argument, and using monthly_dates as index.
# Create and print monthly here monthly = pd.Series([1, 2], index=monthly_dates) monthly
2016-01-31 1 2016-02-29 2 Freq: M, dtype: int64
Now let’s apply
.reindex() to monthly three times: first without additional options, then with
bfill and then with
ffill, print()-ing each result.
# Create weekly_dates here weekly_dates = pd.date_range(start=start, end=end, freq='W') # Print monthly, reindexed using weekly_dates print(monthly.reindex(weekly_dates)) print(monthly.reindex(weekly_dates, method='bfill')) print(monthly.reindex(weekly_dates, method='ffill'))
2016-01-03 NaN 2016-01-10 NaN 2016-01-17 NaN 2016-01-24 NaN 2016-01-31 1.0 2016-02-07 NaN 2016-02-14 NaN 2016-02-21 NaN 2016-02-28 NaN Freq: W-SUN, dtype: float64 2016-01-03 1 2016-01-10 1 2016-01-17 1 2016-01-24 1 2016-01-31 1 2016-02-07 2 2016-02-14 2 2016-02-21 2 2016-02-28 2 Freq: W-SUN, dtype: int64 2016-01-03 NaN 2016-01-10 NaN 2016-01-17 NaN 2016-01-24 NaN 2016-01-31 1.0 2016-02-07 1.0 2016-02-14 1.0 2016-02-21 1.0 2016-02-28 1.0 Freq: W-SUN, dtype: float64
Example: Convert montly unemployment data to weekly data
The civilian US unemployment rate is reported monthly. We may need more frequent data, but that’s no problem because we just saw how to upsample a time series.
We’ll work with the time series data for the last 20 years, and apply a few options to fill in missing values before plotting the weekly series.
data = pd.read_csv('data/stock_data/unrate_2000.csv', index_col='date', parse_dates=True) data
209 rows × 1 columns
# Show first five rows of weekly series print(data.asfreq('W').head()) # Show first five rows of weekly series with bfill option print(data.asfreq('W', method='bfill').head())
UNRATE date 2000-01-02 NaN 2000-01-09 NaN 2000-01-16 NaN 2000-01-23 NaN 2000-01-30 NaN UNRATE date 2000-01-02 4.1 2000-01-09 4.1 2000-01-16 4.1 2000-01-23 4.1 2000-01-30 4.1
Since we used
bfill, feb’s value
4.1 is used to backfill January weeks. Note that
2000-01-02 was also
NaN because we had values starting from
2000-01-01 and the actual start of the week was from
Similarly, when use the
ffill option, jan’s value
4.0 is used to forward-fill January weeks.
# Create weekly series with ffill option and show first five rows weekly_ffill = data.asfreq('W', method='ffill') print(weekly_ffill.head())
UNRATE date 2000-01-02 4.0 2000-01-09 4.0 2000-01-16 4.0 2000-01-23 4.0 2000-01-30 4.0
import matplotlib.pyplot as plt # Plot weekly_fill starting 2015 here weekly_ffill.loc['2015':].plot(figsize=(10,10)) plt.show()
Remember, we only had monthly unrate in our dataset, we used that to upsample by using ffill option to create weekly unrate dataset. In the above figure, we are only plotting urate from start of 2015.
Notice, how the intermediate values are same which creates all these step-like plot. We will see an alternative to this using
Instead of using
.reindex() we can also use the
Upsampling and interpolation using
Next, we will dive deeper into pandas’ capabilities to convert time series frequencies.
The resample method follows a logic similar to groupby: It groups data within a resampling period, and applies a method to this group. It takes the value that results from this method, and assigns a new date within the resampling period. The new date is determined by a so-called offset, and for instance can be at the beginning or end of the period, or a custom location.
The 208 data points imported using read_csv since 2000 have no frequency information. An inspection of the first rows shows that the data are reported for the first of each calendar month. When looking at resampling by month, we have so far focused on month-end frequency. In other words, after resampling, new data will be assigned the last calendar day for each month.
monthly = pd.read_csv('data/stock_data/unrate_2000.csv', index_col='date', parse_dates=True) monthly.head()
There are, however, quite a few alternatives as shown in the table. Depending on your context, you can resample to the beginning or end of either the calendar or business month. The example dates show how business dates may deviate from the calendar month due to weekends and holidays.
Resampling implements the following logic: When up-sampling, there will be more resampling periods than data points. Each resampling period will have a given date offset, for instance month-end frequency. You then need to decide how to create data for the new resampling periods. The new data points will be assigned to the date offsets.
In contrast, when down-sampling, there are more data points than resampling periods. Hence, you need to decide how to aggregate your data to obtain a single value for each date offset.
Resampling using ffill vs interpolate
ffillwill use only 1 value to fill the gaps, so the plot will look step-like. Whereas,
interpolateuses the something like linear model to get the values, so the plot looks smooth. Both are invoked on the
You can use
.resample() to set a frequency for the unemployment rate. Let’s use month start frequency given the reporting dates. When you apply the
resample() method, it returns a new object called Resampler object.
Just apply another method, and this object will again return a DataFrame. You can apply the asfreq method to just assign the data to their offset without modification. The dot-equal() method tells you that both approaches yield the same result.
Let’s now use a quarterly series, real GDP growth. You see that there is again no frequency info, but the first few rows confirm that the data are reported for the first day of each quarter.
gdp = pd.read_csv('data/stock_data/gdp_growth.csv', index_col='date', parse_dates=True) print(gdp.info()) gdp.head(2)
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 41 entries, 2007-01-01 to 2017-01-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gdp_growth 41 non-null float64 dtypes: float64(1) memory usage: 656.0 bytes None
You can use resample to convert this series to month start frequency, and then forward fill logic to fill the gaps. We’re using add_suffix to distinguish the column label from the variation that we’ll produce next.
gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill') gdp_1
121 rows × 1 columns
Resample also let’s you
interpolate the missing values, that is, fill in the values that lie on a straight line between existing quarterly growth rates. A look at the first few rows shows how interpolate averages existing values.
gdp_2 = gdp.resample('MS').interpolate().add_suffix('_interpolate') gdp_2
121 rows × 1 columns
Example showing the difference b/w ffill and interpolate
You have recently used the civilian US unemployment rate, and converted it from monthly to weekly frequency using simple forward or backfill methods.
Compare your previous approach to the new
# Inspect data here print(monthly.info()) # Create weekly dates weekly_dates = pd.date_range(monthly.index.min(), monthly.index.max(), freq='W') # Reindex monthly to weekly data weekly = monthly.reindex(weekly_dates) weekly
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 209 entries, 2000-01-01 to 2017-05-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UNRATE 209 non-null float64 dtypes: float64(1) memory usage: 3.3 KB None
905 rows × 1 columns
# Create ffill and interpolated columns weekly['ffill'] = weekly.UNRATE.ffill() weekly['interpolated'] = weekly.UNRATE.interpolate() # Plot weekly weekly.plot(figsize=(10, 10)) plt.show()
Plot shows how the new data points generated by interpolate lie on the line between the existing points, whereas forward filling creates a step-like pattern.
Compare two time series with different frequencies
When we are given a quarterly timeseries and we want to compare it with a monthly timeseries, we can upsample the quarterly timeseries using
interpolate as we have seen above.
For this, we will use the GDP data (quarterly) and unemployment data (monthly).
data = pd.read_csv('data/stock_data/debt_unemployment.csv', index_col='date', parse_dates=True) print(data.info()) data
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Debt/GDP 29 non-null float64 1 Unemployment 89 non-null float64 dtypes: float64(2) memory usage: 2.1 KB None
89 rows × 2 columns
.interpolate() to data and assign this to interpolated, then inspect the result.
interpolated_data = data.interpolate() print(interpolated_data.info()) interpolated_data
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Debt/GDP 89 non-null float64 1 Unemployment 89 non-null float64 dtypes: float64(2) memory usage: 2.1 KB None
89 rows × 2 columns
# Plot interpolated data here interpolated_data.plot(secondary_y='Unemployment', figsize=(10, 10)); plt.show()
As unemployment goes down, debt-to-GDP-ratio goes up. Remember we started with quarterly debt/GDP data and we upsampled it to compare it with monthly unemployment data.