# Basics of manipulating time series

## Basics of manipulating Time Series data

This post lays the foundations to leverage the powerful time series functionality made available by how Pandas represents dates, in particular by the **DateTimeIndex**. We will first see the basic time series *methods* and *transformations*. Using several examples we will understand the power of pandas’s `DateTimeIndex`

.

## Basic methods: parse dates, slicing and change frequency

Basic methods include:

- parsing dates provided as strings
- converting the result into the matching pandas data type called datetime64.
- selecting subperiods of your time series
- setting or changing the frequency of the DateTimeIndex.

You can change the frequency to a higher or lower value:

upsamplinginvolves increasing the time frequency, which requires generating new data.Downsamplingmeans decreasing the time frequency, which requires aggregating data.

### Example: Air Quality dataset

Our first data set is a time series data set with air quality data (ozone, pm25, and carbon monoxide for NYC, 2000-2017) . You will often have to deal with dates that are of type object, or string. You’ll notice a column called ‘date’ that is of data type ‘object’. However, when you print the first few rows using the dot-head method, you see that it contains dates. To convert the strings to the correct datatype use `pd.to_datetime()`

.

```
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data/air_quality_data/nyc.csv')
# Inspect data
print(data.info())
# Convert the date column to datetime64
data['date'] = pd.to_datetime(data['date'])
# Set date column as index
data.set_index('date', inplace=True)
# Inspect data
print(data.info())
# Plot data
data.plot(subplots=True, figsize=(10,10))
plt.show()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6317 entries, 0 to 6316
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 6317 non-null object
1 ozone 6317 non-null float64
2 pm25 6317 non-null float64
3 co 6317 non-null float64
dtypes: float64(3), object(1)
memory usage: 197.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6317 entries, 1999-07-01 to 2017-03-31
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ozone 6317 non-null float64
1 pm25 6317 non-null float64
2 co 6317 non-null float64
dtypes: float64(3)
memory usage: 197.4 KB
None
```

### Partial string indexing example: Compare stock price over three years

For this, we will look at the `yahoo`

stock price data. We will compare how the same stock price did over three years and plot it.

```
yahoo = pd.read_csv('data/stock_data/yahoo.csv')
yahoo['date'] = pd.to_datetime(yahoo['date'])
yahoo.set_index('date', inplace=True)
yahoo.head()
```

price | |
---|---|

date | |

2013-01-02 | 20.08 |

2013-01-03 | 19.78 |

2013-01-04 | 19.86 |

2013-01-07 | 19.40 |

2013-01-08 | 19.66 |

You can use strings that represent a complete date, or relevant **parts of a date**. For example, if you just pass a string representing a *year*, pandas returns all dates within this year.

If you pass **a slice** that starts with one month and ends at another, you get all dates within that range. Note that the date range will be inclusive of the end date, different from other intervals in python.

You can also use `.loc`

with a complete date and a column label to select a specific stock price.

```
# Create dataframe prices here
prices = pd.DataFrame()
# Select data for each year and concatenate with prices here
for year in ['2013', '2014', '2015']:
price_per_year = yahoo.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
# Plot prices
prices.plot(figsize=(10, 10))
plt.show()
```

### Set and change time series frequency using `asfreq()`

You may have noticed that our `DateTimeIndex`

did not have frequency information. You can set the frequency information using `.asfreq`

. The alias `'D'`

stands for calendar day frequency. As a result, the DateTimeIndex now contains many dates where stock wasn’t bought or sold.

These new dates have missing values. This is also called **upsampling**, because the new DataFrame is of higher frequency as the original version.

Now we will see how to assign a frequency to a DateTimeIndex, and then change this frequency. For this we use data on the daily carbon monoxide concentration in NYC, LA and Chicago from 2005-17.

We’ll set the frequency to calendar daily and then resample to monthly frequency, and visualize both series to see how the different frequencies affect the data.

```
co = pd.read_csv('data/air_quality_data/co_cities.csv', index_col='date', parse_dates=True)
```

```
co.info()
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1898 entries, 2005-01-01 to 2010-12-31
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Chicago 1898 non-null float64
1 Los Angeles 1898 non-null float64
2 New York 1898 non-null float64
dtypes: float64(3)
memory usage: 59.3 KB
```

```
co.head()
```

Chicago | Los Angeles | New York | |
---|---|---|---|

date | |||

2005-01-01 | 0.317763 | 0.777657 | 0.639830 |

2005-01-03 | 0.520833 | 0.349547 | 0.969572 |

2005-01-04 | 0.477083 | 0.626630 | 0.905208 |

2005-01-05 | 0.348822 | 0.613814 | 0.769176 |

2005-01-06 | 0.572917 | 0.792596 | 0.815761 |

```
# Inspect data
print(co.info())
# Set the frequency to calendar daily
co = co.asfreq('D')
# Plot the data
co.plot(subplots=True, figsize=(10, 10))
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2191 entries, 2005-01-01 to 2010-12-31
Freq: D
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Chicago 1898 non-null float64
1 Los Angeles 1898 non-null float64
2 New York 1898 non-null float64
dtypes: float64(3)
memory usage: 68.5 KB
None
array([<AxesSubplot:xlabel='date'>, <AxesSubplot:xlabel='date'>,
<AxesSubplot:xlabel='date'>], dtype=object)
```

```
# Set frequency to monthly
co = co.asfreq('M')
# Inspect data
print(co.info())
# Plot the data
co.plot(subplots=True, figsize=(10, 10))
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 72 entries, 2005-01-31 to 2010-12-31
Freq: M
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Chicago 68 non-null float64
1 Los Angeles 68 non-null float64
2 New York 68 non-null float64
dtypes: float64(3)
memory usage: 2.2 KB
None
array([<AxesSubplot:xlabel='date'>, <AxesSubplot:xlabel='date'>,
<AxesSubplot:xlabel='date'>], dtype=object)
```

## Manipulating Time Series: Lags, Shifts, changes, and percent_change

Now, we will begin to manipulate time series data. In particular, we will see:

- how to
**move our data across time**so that you can**compare values at different points in time**. This involves**shifting**values into the future, or creating**lags**by moving data into the past. - We will also see how to calculate changes between values at different points in time.
- Lastly we will see how to calculate the change between values in percentage terms, also called the
**rate of growth**.

Pandas has builtin methods for these calculations that leverage the DateTimeIndex.

### Shifting and Lagging to compare

The first method to manipulate time series is `.shift()`

, which allows you shift all values in a Series or DataFrame by a number of to a different time along the DateTimeIndex.

`periods`

```
# Import data here
google = pd.read_csv('data/stock_data/google.csv', parse_dates=['Date'], index_col='Date')
# Set data frequency to business daily
google = google.asfreq('B')
# Create 'lagged' and 'shifted'
google['lagged'] = google.Close.shift(periods=-90)
google['shifted'] = google.Close.shift(periods=90)
# Plot the google price series
google.plot(figsize=(10,10))
```

```
<AxesSubplot:xlabel='Date'>
```

```
google.head()
```

Close | lagged | shifted | |
---|---|---|---|

Date | |||

2014-01-02 | 556.00 | 511.00 | NaN |

2014-01-03 | 551.95 | 518.73 | NaN |

2014-01-06 | 558.10 | 529.92 | NaN |

2014-01-07 | 568.86 | 533.09 | NaN |

2014-01-08 | 570.04 | 526.65 | NaN |

```
google.tail()
```

Close | lagged | shifted | |
---|---|---|---|

Date | |||

2016-12-26 | NaN | NaN | 772.15 |

2016-12-27 | 791.55 | NaN | 772.08 |

2016-12-28 | 785.05 | NaN | 769.64 |

2016-12-29 | 782.79 | NaN | 769.41 |

2016-12-30 | 771.82 | NaN | 769.54 |

Fantastic! Now you can visually compare the time series to itself at different points in time.

The ‘shifted’ version of the stock price has all prices moved by 90 periods into the future. As a result, the first 90 values in the series are now missing.

In contrast, the lagged version of the stock price is moved 90 periods into the past. In this case, the last 90 values are now missing. To shift data into the past, use negative period numbers.

Shifting data is useful to compare data at different points in time.

### One-period percent change

**Calculate one-period percent change**: You can, for instance, calculate the rate of change from period to period, which is also called financial return in finance.

The method `.div()`

allows you not only to divide a Series by a value, but by an entire Series, for instance by another column in the same DataFrame. Pandas makes sure the dates for both series match up, and will divide the aligned values accordingly. As a result, you get the relative change from the last period for every price, that is, the ** factor by which you need to multiply the last price to get the current price**.

### Relative change in percent terms

**Calculate the relative change in percent terms**: As you have seen before, you can chain all DataFrame methods that return a DataFrame. The returned DataFrame will be used as input for the next calculation. Here, we are subtracting 1 and multiplying the result by 100 to obtain the relative change in percentage terms.

### Using pandas `diff()`

method

**diff method**: Another time series method is `.diff()`

, which calculates the change between values at different points in time. By default, the ‘diff’ takes the difference in value for two adjacent periods.

### Using pandas `pct_change()`

method

**pct_change**: Finally, since it is such a common operation, pandas has a builtin method for you to calculate the percent change directly. Just select a column and call pct_change. Multiply by 100 to get the same result as we got before when we calculated the returns.

**All these methods have the periods**: All these methods have a `'periods'`

keyword that you have already seen for `.shift`

and that defaults to the value 1. If you provide a higher value, you can calculate `returns/pct_change`

for data points several periods apart, as in this example, for prices three trading days apart.

### Example illustrating `.diff()`

Let’s load Yahoo stock prices for the years 2013 to 2015, set the frequency to business daily, and assigned the result to yahoo.

```
yahoo = pd.read_csv('data/stock_data/yahoo.csv')
yahoo['date'] = pd.to_datetime(yahoo['date'])
yahoo.set_index('date', inplace=True)
yahoo
```

price | |
---|---|

date | |

2013-01-02 | 20.08 |

2013-01-03 | 19.78 |

2013-01-04 | 19.86 |

2013-01-07 | 19.40 |

2013-01-08 | 19.66 |

... | ... |

2015-12-25 | NaN |

2015-12-28 | 33.60 |

2015-12-29 | 34.04 |

2015-12-30 | 33.37 |

2015-12-31 | 33.26 |

782 rows × 1 columns

```
# Create a new column called shifted_30 that contains the 'price'
# shifted by 30 business days into the future.
# Created shifted_30 here
yahoo['shifted_30'] = yahoo.price.shift(periods=30)
# Subtract 'shifted_30' from 'price', and assign the result to a new column, 'change_30'.
# Subtract shifted_30 from price
yahoo['change_30'] = yahoo.price - yahoo.shifted_30
# Apply .diff(), setting periods to 30, and assign the result to a new column, 'diff_30'.
# Get the 30-day price difference
yahoo['diff_30'] = yahoo.price.diff(periods=30)
# Inspect the last five rows of price
yahoo.tail()
```

price | shifted_30 | change_30 | diff_30 | |
---|---|---|---|---|

date | ||||

2015-12-25 | NaN | 32.19 | NaN | NaN |

2015-12-28 | 33.60 | 32.94 | 0.66 | 0.66 |

2015-12-29 | 34.04 | 32.86 | 1.18 | 1.18 |

2015-12-30 | 33.37 | 32.98 | 0.39 | 0.39 |

2015-12-31 | 33.26 | 32.62 | 0.64 | 0.64 |

```
# Show the value_counts of the difference between change_30 and diff_30
yahoo['change_30'].sub(yahoo['diff_30']).value_counts()
```

```
0.0 703
dtype: int64
```

There’s usually more than one way to get to the same result when working with data.

### Finding patterns in time series using `.pct_change()`

We can use `.pct_change()`

function to calculate returns for various calendar day periods, and plot the result to compare the different patterns. We will use the ‘GOOG’ stock prices for the years 2014-2016, set the frequency to calendar daily, and assign the result to google.

```
# Import data here
google = pd.read_csv('data/stock_data/google.csv', parse_dates=['Date'], index_col='Date')
# Set data frequency to business daily
google = google.asfreq('B')
```

Let’s create the columns `'daily_return'`

, `'monthly_return'`

, and `'annual_return'`

that contain the `pct_change()`

of ‘Close’ for 1, 30 and 360 calendar days, respectively, and multiply each by 100.

```
# Create daily_return
google['daily_return'] = google.Close.pct_change(periods=1).multiply(100)
# Create monthly_return
google['monthly_return'] = google.Close.pct_change(periods=30).multiply(100)
# Create annual_return
google['annual_return'] = google.Close.pct_change(periods=360).multiply(100)
# Plot the result
google.plot(subplots=True, figsize=(10,10))
plt.show()
```

Now we can see how the stock price returns where for three time periods: `daily`

, `monthly`

, and `annually`

.