Pandas Indexes Demystified
Pandas Indexes
Pandas indexes are the most confusing thing about pandas. So let’s try to understand it. What are the advantages of using indices instead of just storing its’ values in columns ?
import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
drinks.index
RangeIndex(start=0, stop=193, step=1)
Note: Every dataframe has an index attribute and a columns attribute
Note: Indexes are sometimes called the row-labels.
drinks.columns
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
'total_litres_of_pure_alcohol', 'continent'],
dtype='object')
Index is type of special object, it is NOT refered to as “THE INDEX” though. When someone says “THE INDEX” or the row labels, they are talking about the row indexes.
Note: Neither the Index or the Columns are considered part of the dataframe contents. This can be seen when you run the shape command.
drinks.shape
(193, 6)
It turns out that both these Indexes and columns both default to integers. That is, if no index or columns are specified. For example, take a look at the movieusers dataframe
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
As we did not specify the header, the header and the row indexes are set to default integer values.
Note: Sometimes people just leave the default values for row indexes, but rarely they leave the default values for column names.
Coming back to our main question: Why does the Index exist ? There are 3 main reasons: - One is identification - Second is selection - Third is alignment
For identification, we are going to illustrate this by taking an example. We are going to do filtering the dataframe.
drinks[drinks.continent=='South America']
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
6 | Argentina | 193 | 25 | 221 | 8.3 | South America |
20 | Bolivia | 167 | 41 | 8 | 3.8 | South America |
23 | Brazil | 245 | 145 | 16 | 7.2 | South America |
35 | Chile | 130 | 124 | 172 | 7.6 | South America |
37 | Colombia | 159 | 76 | 3 | 4.2 | South America |
52 | Ecuador | 162 | 74 | 3 | 4.2 | South America |
72 | Guyana | 93 | 302 | 1 | 7.1 | South America |
132 | Paraguay | 213 | 117 | 74 | 7.3 | South America |
133 | Peru | 163 | 160 | 21 | 6.1 | South America |
163 | Suriname | 128 | 178 | 7 | 5.6 | South America |
185 | Uruguay | 115 | 35 | 220 | 6.6 | South America |
188 | Venezuela | 333 | 100 | 3 | 7.7 | South America |
The thing I want you to notice is that the Index (also known as the row labels), STAYED with the rows. It didn’t just re-number them as starting from 0. It kept the original row numbers. This is what we mean by, when we say that the Index is for “Identification”. It is so that we can identify what rows we are working with even if you filter the original data frame.
Next, let’s talk about “Selection”. And what I mean by this, “what if I want to grab a piece of this dataframe ?” . We are going to use our beloved loc
method to do that. The loc
method, allows me say “if I want number 245 in the above dataframe” then I can simply use the below line
drinks.loc[23, 'beer_servings']
245
So what’s the big deal here ? Why bother having an index ? Here’s the answer to it. Let me show you.
# We can set the index using set_index dataframe method
# When you set inplace=True, then you don't have to assign it to a variable. It does it inplace.
drinks.set_index('country', inplace=True)
drinks.head()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|
country | |||||
Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
Albania | 89 | 132 | 54 | 4.9 | Europe |
Algeria | 25 | 0 | 14 | 0.7 | Africa |
Andorra | 245 | 138 | 312 | 12.4 | Europe |
Angola | 217 | 57 | 45 | 5.9 | Africa |
We can see that the dataframe has changed. We can see above that the Series country has now become an index and the default (integer index) has disappeared.
drinks.index
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
...
'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
dtype='object', name='country', length=193)
Few things to observe here:
- It still has length 193
- It is of type ‘object’, which by the way is the same as default index, since it is an Index object.
- It has a new attribute called ‘name’
drinks.columns
Index(['beer_servings', 'spirit_servings', 'wine_servings',
'total_litres_of_pure_alcohol', 'continent'],
dtype='object')
Also note now that country is no longer listed as a column in the dataframe.
drinks.shape
(193, 5)
Also check the shape it now has 193 and only 5 columns. This is because the Index is now not part of the dataframe. Remember that.
So, because we have set the country as the index, we can now use our beloved loc
method to do something like this.
drinks.loc['Brazil', 'beer_servings']
245
So, by setting that index to something that was meaningful to us, we can now select data from the dataframe more easily!!!
Now, couple of issues to discuss, if you look at the dataframe output, you will notice that “country” is not listed in the columns row, it has a new separate row all for itself. Looks weird right ? That is actually the name of the index.
Now, you don’t need to have an index name. It is helpful, in that, it can serve as an identifier, that is it represents countries. But you can actually clear it out, if you so desire.
drinks.index.name = None
drinks.head()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|
Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
Albania | 89 | 132 | 54 | 4.9 | Europe |
Algeria | 25 | 0 | 14 | 0.7 | Africa |
Andorra | 245 | 138 | 312 | 12.4 | Europe |
Angola | 217 | 57 | 45 | 5.9 | Africa |
Now, let’s say you decide at some point that you want to use the default index and now you want to move that index (country index) back in as a column, then you do this.
Firstly, we need to give that index its’ name back!
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
And if you now check this out, we are back to our default integer index and the country which was the index, rejoined the dataframe as one of the columns. It is important to set the index name before we re-index, because pandas need to know what to call the new column.
Let’s do something extra. When we run the describe
method, see below
drinks.describe()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
count | 193.000000 | 193.000000 | 193.000000 | 193.000000 |
mean | 106.160622 | 80.994819 | 49.450777 | 4.717098 |
std | 101.143103 | 88.284312 | 79.697598 | 3.773298 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 20.000000 | 4.000000 | 1.000000 | 1.300000 |
50% | 76.000000 | 56.000000 | 8.000000 | 4.200000 |
75% | 188.000000 | 128.000000 | 59.000000 | 7.200000 |
max | 376.000000 | 438.000000 | 370.000000 | 14.400000 |
It is just a numerical summary of the numerical columns. And, I want you to notice that this is actually a dataframe. And as such, it has an index.
drinks.describe().index
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
This is the index!!
And these are the columns
drinks.describe().columns
Index(['beer_servings', 'spirit_servings', 'wine_servings',
'total_litres_of_pure_alcohol'],
dtype='object')
drinks.describe().loc[['mean','count'],:]
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
mean | 106.160622 | 80.994819 | 49.450777 | 4.717098 |
count | 193.000000 | 193.000000 | 193.000000 | 193.000000 |
So, the point here is not so much that you are going to do something with the describe output! You might, like above. But the main point here is that, a lot of methods in pandas return a dataframe. And when you know that there is something like a index and columns associated with that dataframe, and you recognize that, then you can interact with that resulting dataframe in cool ways.
A SERIES ALSO HAS AN INDEX
The last topic is “Arrangement”. What I want to show you, is that, a Series like a Dataframe also has an Index.
The Series index comes from the dataframe!!
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
Let’s say now I pick the continent series.
drinks.continent.head()
0 Asia
1 Europe
2 Africa
3 Europe
4 Africa
Name: continent, dtype: object
What we see here is that there is an Index for this Series and it came from the dataframe.
So the Index is on the left and the values are on the right.
Now, let’s pretend that we didn’t use the default index for the dataframe, and instead, we use the country as the index for the dataframe.
drinks.set_index('country', inplace=True)
drinks.head()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|
country | |||||
Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
Albania | 89 | 132 | 54 | 4.9 | Europe |
Algeria | 25 | 0 | 14 | 0.7 | Africa |
Andorra | 245 | 138 | 312 | 12.4 | Europe |
Angola | 217 | 57 | 45 | 5.9 | Africa |
SO, what is going to happen, if we are going to select the continent Series ??
drinks.continent.head()
country
Afghanistan Asia
Albania Europe
Algeria Africa
Andorra Europe
Angola Africa
Name: continent, dtype: object
This time, we are seeing the same thing as last time, the INDEX is passed on from the dataframe. The index is attached to each row!
drinks.continent.value_counts()
Africa 53
Europe 45
Asia 44
North America 23
Oceania 16
South America 12
Name: continent, dtype: int64
This is also a Series, and as such, it has an index.
drinks.continent.value_counts().index
Index(['Africa', 'Europe', 'Asia', 'North America', 'Oceania',
'South America'],
dtype='object')
And it has VALUES (to the right remember).
drinks.continent.value_counts().values
array([53, 45, 44, 23, 16, 12])
NOW because it is a SERIES, and not some value_counts object or something, we can use the index to select values from the Series. What I mean is, I can take this values
array and refer an index, such as ‘Africa’.
drinks.continent.value_counts()['Africa']
53
What I am saying is: “From this Series find index Africa and show me the value”
Sorting
drinks.continent.value_counts().sort_values()
South America 12
Oceania 16
North America 23
Asia 44
Europe 45
Africa 53
Name: continent, dtype: int64
Using the sort_values
method of the Series object, we can sort the VALUES of the Series.
WHAT IF YOU WANTED TO SORT ON THE INDEX ??
drinks.continent.value_counts().sort_index()
Africa 53
Asia 44
Europe 45
North America 23
Oceania 16
South America 12
Name: continent, dtype: int64
Remember the 3 reasons why an Index exists:
- Identification
- Selection
- Allignment
Allignment
In order to understand alignment, we first need to create another Series object.
people = pd.Series([300000, 85000], index=['Albania', 'Andorra'], name='population')
people
Albania 300000
Andorra 85000
Name: population, dtype: int64
Let’s say I want to use this tiny dataset and multiply this with drinks.beer_servings
to get the total number of people who were served beer.
drinks.beer_servings * people
Afghanistan NaN
Albania 26700000.0
Algeria NaN
Andorra 20825000.0
Angola NaN
Antigua & Barbuda NaN
Argentina NaN
Armenia NaN
Australia NaN
Austria NaN
Azerbaijan NaN
Bahamas NaN
Bahrain NaN
Bangladesh NaN
Barbados NaN
Belarus NaN
Belgium NaN
Belize NaN
Benin NaN
Bhutan NaN
Bolivia NaN
Bosnia-Herzegovina NaN
Botswana NaN
Brazil NaN
Brunei NaN
Bulgaria NaN
Burkina Faso NaN
Burundi NaN
Cabo Verde NaN
Cambodia NaN
...
Sudan NaN
Suriname NaN
Swaziland NaN
Sweden NaN
Switzerland NaN
Syria NaN
Tajikistan NaN
Tanzania NaN
Thailand NaN
Timor-Leste NaN
Togo NaN
Tonga NaN
Trinidad & Tobago NaN
Tunisia NaN
Turkey NaN
Turkmenistan NaN
Tuvalu NaN
USA NaN
Uganda NaN
Ukraine NaN
United Arab Emirates NaN
United Kingdom NaN
Uruguay NaN
Uzbekistan NaN
Vanuatu NaN
Venezuela NaN
Vietnam NaN
Yemen NaN
Zambia NaN
Zimbabwe NaN
dtype: float64
The countries not represented in the Series: people are ignored. And only the countries represented in the people series are multiplied. So how does pandas do this ? So here’s the thing, it aligned the people
series with the drink series. That is, it found Albania in the drink
series and then multiplied by the population value in the people
series.
So, in summary, alignment allows us to put the data together even if it not the same length. As long as you tell it, which rows correspond to which other rows. Because people
and drinks
series use country as the index and since we specified the index in people
, pandas used that index to match the rows in drinks
series.
One more example to hammer the point home. Suppose lets say you want to add the people Series to the drinks dataframe as a new column. To do this, we use the pd.concat method.
pd.concat([drinks, people], axis=1).head()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | population | |
---|---|---|---|---|---|---|
Afghanistan | 0 | 0 | 0 | 0.0 | Asia | NaN |
Albania | 89 | 132 | 54 | 4.9 | Europe | 300000.0 |
Algeria | 25 | 0 | 14 | 0.7 | Africa | NaN |
Andorra | 245 | 138 | 312 | 12.4 | Europe | 85000.0 |
Angola | 217 | 57 | 45 | 5.9 | Africa | NaN |
Notice how pandas automatically knew how to align the values in the merged dataframe. It does so, because of the index. Without the index, pandas wouldn’t know where to put them. So that is the beauty and elegance of having an index. Hope that explains it.