# Correlation and Regression of two time series

## Correlation and Regression of two time series

### Correlation of two time series

Oftentimes, two time series vary together. For example, Consider the stock prices of `JP Morgan`

and the `S&P500`

. In general, when the market drops, JP Morgan drops as well, and when the market rises, JP Morgan also rises.

We can get this data from Yahoo Finance. I have pulled the stock price data fro `JPM`

and `SP500`

from `2010 to 2020`

.

```
import pandas as pd
# jpm data
jpm = pd.read_csv('data/JPM.csv', index_col='Date', parse_dates=True)
jpm = jpm[['Close']]
jpm.columns = ['JPM']
# sp500 data
sp500 = pd.read_csv('data/sp500_2020.csv', index_col='Date', parse_dates=True)
sp500 = sp500[['Close']]
sp500.columns = ['SP500']
# join both
data = jpm.join(sp500, how='inner')
data.head()
```

JPM | SP500 | |
---|---|---|

Date | ||

2010-01-14 | 44.689999 | 1148.459961 |

2010-01-15 | 43.680000 | 1136.030029 |

2010-01-19 | 43.279999 | 1150.229980 |

2010-01-20 | 43.400002 | 1138.040039 |

2010-01-21 | 40.540001 | 1116.479980 |

### SP500 (market) vs JPM stock in the last decade

```
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
ax1 = data.SP500.plot(color='blue', grid=True, label='SP500')
ax2 = data.JPM.plot(color='red', grid=True, secondary_y=True, label='JPM')
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
plt.legend(h1+h2, l1+l2, loc=2)
plt.show()
```

### Common Mistake: Correlation of Two Trending Series

Consider two time series that are both trending. Even if the two series are totally unrelated, you could still get a very high correlation. That’s why, when you look at the correlation of say, two stocks, you should look at the correlation of their *returns*, not their levels. For instance, the two series, **stock prices and UFO sightings**, both trend up over time. Of course, there is no relationship between those two series, but the correlation is high (0.94). If you compute the correlation of percent changes, the correlation goes down to approximately zero.

### Scatter plot of returns

So, let’s first calculate the returns of these two stocks.

```
daily_returns = data.pct_change().mul(100)
daily_returns.head()
```

JPM | SP500 | |
---|---|---|

Date | ||

2010-01-14 | NaN | NaN |

2010-01-15 | -2.260011 | -1.082313 |

2010-01-19 | -0.915753 | 1.249963 |

2010-01-20 | 0.277271 | -1.059783 |

2010-01-21 | -6.589864 | -1.894490 |

A scatter plot of the returns of JP Morgan and the returns of the market help to visualize the relationship between the two time series.

```
daily_returns.plot.scatter(x='JPM', y='SP500', figsize=(10, 10))
plt.title('JPM vs SP500 daily returns scatter plot')
```

```
Text(0.5, 1.0, 'JPM vs SP500 daily returns scatter plot')
```

### Calculate the correlation b/w JPM and SP500 returns

The correlation coefficient is a measure of how much two series vary together. A correlation of one means that the two series have a perfect linear relationship with no deviations. High correlations mean that the two series strongly vary together. A low correlation means they vary together, but there is a weak association. And a high negative correlation means they vary in opposite directions, but still with a linear relationship.

```
print(daily_returns['JPM'].corr(daily_returns['SP500']))
```

```
0.7790472716260044
```

## Simple Linear Regressions of Time Series

A simple linear regression finds the slope, `beta`

, and intercept, `alpha`

, of **a line** that’s the best fit between a dependent variable, `y`

, and an independent variable, `x`

.

The x’s and y’s can be two time series.

**A word about Regression**: A linear regression is also known as Ordinary Least Squares, or OLS, because it minimizes the sum of the squared distances between the *data points and the regression line*.

Regression techniques are very common, and therefore there are many packages in Python that can be used. In statsmodels, there is OLS. In numpy, there is polyfit, and if you set degree equals 1, it fits the data to a line, which is a linear regression. Pandas has an ols method, and scipy has a linear regression function. Beware that the order of x and y is not consistent across packages.

We will be using the `statsmodel`

package in this post, as it is very similar to other statistical tools like R, SAS and MATLAB.

### Example: Regression of Small Cap Returns on Large Cap

Now let’s assume that we’ll regress the returns of the small cap stocks on the returns of large cap stocks. As a first step, we compute returns from prices using the `pct_change`

method in pandas. We need to add a column of ones as a dependent, right hand side variable. The reason we have to do this is because the regression function assumes that if there is no constant column, then we want to run the regression without an intercept.

By adding a column of ones, `statsmodels`

will compute the *regression coefficient of that column as well*, which can be interpreted as the intercept of the line. The statsmodels method `add_constant()`

is a simple way to add a constant.

### Example using Regression of two time series

R-squared measures how closely the data fit the regression line.

So, the R-squared in a simple regression is related to the correlation between the two variables (time series).

In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.

In this example, we will start using the statistical package statsmodels, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.

We will take two series, `x`

(`GOOG`

stock) and `y`

(`FB`

stock), compute their correlation, and then regress y on x using the function `OLS(y,x)`

in the `statsmodels.api`

library. Most linear regressions contain a constant term which is the intercept. To include a constant using the function `OLS()`

, we need to add a column of `1's`

to the right hand size of the regression. To add a column of `1's`

use `add_constant()`

as shown below.

```
import pandas as pd
df = pd.read_csv('data/stock_data/goog_fb.csv', index_col='date', parse_dates=True)
print(df.head(2))
print(df.tail(2))
df.plot(figsize=(10, 10))
```

```
FB GOOG
date
2015-01-02 78.45 524.81
2015-01-05 77.19 513.87
FB GOOG
date
2016-12-29 116.35 782.79
2016-12-30 115.05 771.82
<AxesSubplot:xlabel='date'>
```

```
# Import the statsmodels module
import statsmodels.api as sm
# Compute correlation of x and y
x = pd.Series(df['GOOG'].values)
y = pd.Series(df['FB'].values)
correlation = x.corr(y)
print("The correlation between x and y is %4.2f" %(correlation))
# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])
# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)
# Regress y on dfx1
result = sm.OLS(y, dfx1).fit()
# Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())
```

```
The correlation between x and y is 0.93
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.873
Model: OLS Adj. R-squared: 0.873
Method: Least Squares F-statistic: 3444.
Date: Thu, 18 Feb 2021 Prob (F-statistic): 6.74e-227
Time: 14:10:19 Log-Likelihood: -1625.3
No. Observations: 504 AIC: 3255.
Df Residuals: 502 BIC: 3263.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -11.0498 1.961 -5.636 0.000 -14.902 -7.198
x 0.1693 0.003 58.684 0.000 0.164 0.175
==============================================================================
Omnibus: 18.447 Durbin-Watson: 0.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.496
Skew: -0.179 Prob(JB): 0.00526
Kurtosis: 2.390 Cond. No. 4.91e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
```

So, in this case, the slope of the regression is coefficient of x `0.1693`

. We can also pull individual params from the result, like so:

```
result.params
```

```
const -11.049824
x 0.169329
dtype: float64
```

Another statistic to take note of is the R-squared of `0.873`

. There is a relationship b/w R-squared and Correlation. Square root of R-square will give the magnitude of correlation. As shown here

```
import numpy as np
np.sqrt(0.873)
```

```
0.9343446901438462
```

## Summary

Generally from a scatter plot, we can see that the correlation measures how closely the data are clustered along a line. The R-squared also measures how well the linear regression line fits the data. So as you would expect, there is a relationship between correlation and R-squared.

The magnitude of the correlation is the square root of the R-squared. And the sign of the correlation is the sign of the slope of the regression line.

If the regression line is positively sloped, the correlation is positive and if the the regression line is negatively sloped, the correlation is negative. In the example you just analyzed

## Appendix

Pandas helps make many time series tasks quick and efficient. For instance, we routinely encounter a situation where you need to merge two time series. But before you merge

### Merging Time Series With Different Dates

Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.

The pandas `.join()`

method is a convenient tool to merge the stock and bond DataFrames on dates when both markets are open.

- Take the difference of the stock set minus the bond set to get those dates where the stock market has data but the bond market does not.
- Merge the two DataFrames into a new DataFrame, stocks_and_bonds using the
`.join()`

method, which has the syntax`df1.join(df2)`

.- To get the intersection of dates, use the argument
`how='inner'`

- To get the intersection of dates, use the argument

```
import pandas as pd
# Convert the stock index and bond index into sets
set_stock_dates = set(stocks.index)
set_bond_dates = set(bonds.index)
# Take the difference between the sets and print
print(set_stock_dates - set_bond_dates)
# Merge stocks and bonds DataFrames using join()
stocks_and_bonds = stocks.join(bonds, how='inner')
```