Pandas Indexes Demystified

10 minute read

Pandas Indexes

Pandas indexes are the most confusing thing about pandas. So let’s try to understand it. What are the advantages of using indices instead of just storing its’ values in columns ?

import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
drinks.index
RangeIndex(start=0, stop=193, step=1)

Note: Every dataframe has an index attribute and a columns attribute

Note: Indexes are sometimes called the row-labels.

drinks.columns
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

Index is type of special object, it is NOT refered to as “THE INDEX” though. When someone says “THE INDEX” or the row labels, they are talking about the row indexes.

Note: Neither the Index or the Columns are considered part of the dataframe contents. This can be seen when you run the shape command.

drinks.shape
(193, 6)

It turns out that both these Indexes and columns both default to integers. That is, if no index or columns are specified. For example, take a look at the movieusers dataframe

pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()
0 1 2 3 4
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

As we did not specify the header, the header and the row indexes are set to default integer values.

Note: Sometimes people just leave the default values for row indexes, but rarely they leave the default values for column names.

Coming back to our main question: Why does the Index exist ? There are 3 main reasons: - One is identification - Second is selection - Third is alignment

For identification, we are going to illustrate this by taking an example. We are going to do filtering the dataframe.

drinks[drinks.continent=='South America']
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
6 Argentina 193 25 221 8.3 South America
20 Bolivia 167 41 8 3.8 South America
23 Brazil 245 145 16 7.2 South America
35 Chile 130 124 172 7.6 South America
37 Colombia 159 76 3 4.2 South America
52 Ecuador 162 74 3 4.2 South America
72 Guyana 93 302 1 7.1 South America
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
163 Suriname 128 178 7 5.6 South America
185 Uruguay 115 35 220 6.6 South America
188 Venezuela 333 100 3 7.7 South America

The thing I want you to notice is that the Index (also known as the row labels), STAYED with the rows. It didn’t just re-number them as starting from 0. It kept the original row numbers. This is what we mean by, when we say that the Index is for “Identification”. It is so that we can identify what rows we are working with even if you filter the original data frame.

Next, let’s talk about “Selection”. And what I mean by this, “what if I want to grab a piece of this dataframe ?” . We are going to use our beloved loc method to do that. The loc method, allows me say “if I want number 245 in the above dataframe” then I can simply use the below line

drinks.loc[23, 'beer_servings']
245

So what’s the big deal here ? Why bother having an index ? Here’s the answer to it. Let me show you.

# We can set the index using set_index dataframe method
# When you set inplace=True, then you don't have to assign it to a variable. It does it inplace.
drinks.set_index('country', inplace=True)
drinks.head()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

We can see that the dataframe has changed. We can see above that the Series country has now become an index and the default (integer index) has disappeared.

drinks.index
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

Few things to observe here:

  • It still has length 193
  • It is of type ‘object’, which by the way is the same as default index, since it is an Index object.
  • It has a new attribute called ‘name’
drinks.columns
Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

Also note now that country is no longer listed as a column in the dataframe.

drinks.shape
(193, 5)

Also check the shape it now has 193 and only 5 columns. This is because the Index is now not part of the dataframe. Remember that.

So, because we have set the country as the index, we can now use our beloved loc method to do something like this.

drinks.loc['Brazil', 'beer_servings']
245

So, by setting that index to something that was meaningful to us, we can now select data from the dataframe more easily!!!

Now, couple of issues to discuss, if you look at the dataframe output, you will notice that “country” is not listed in the columns row, it has a new separate row all for itself. Looks weird right ? That is actually the name of the index.

Now, you don’t need to have an index name. It is helpful, in that, it can serve as an identifier, that is it represents countries. But you can actually clear it out, if you so desire.

drinks.index.name = None
drinks.head()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

Now, let’s say you decide at some point that you want to use the default index and now you want to move that index (country index) back in as a column, then you do this.

Firstly, we need to give that index its’ name back!

drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

And if you now check this out, we are back to our default integer index and the country which was the index, rejoined the dataframe as one of the columns. It is important to set the index name before we re-index, because pandas need to know what to call the new column.

Let’s do something extra. When we run the describe method, see below

drinks.describe()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000

It is just a numerical summary of the numerical columns. And, I want you to notice that this is actually a dataframe. And as such, it has an index.

drinks.describe().index
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

This is the index!!

And these are the columns

drinks.describe().columns
Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol'],
      dtype='object')
drinks.describe().loc[['mean','count'],:]
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
mean 106.160622 80.994819 49.450777 4.717098
count 193.000000 193.000000 193.000000 193.000000

So, the point here is not so much that you are going to do something with the describe output! You might, like above. But the main point here is that, a lot of methods in pandas return a dataframe. And when you know that there is something like a index and columns associated with that dataframe, and you recognize that, then you can interact with that resulting dataframe in cool ways.

A SERIES ALSO HAS AN INDEX

The last topic is “Arrangement”. What I want to show you, is that, a Series like a Dataframe also has an Index.

The Series index comes from the dataframe!!

drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa

Let’s say now I pick the continent series.

drinks.continent.head()
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

What we see here is that there is an Index for this Series and it came from the dataframe.

So the Index is on the left and the values are on the right.

Now, let’s pretend that we didn’t use the default index for the dataframe, and instead, we use the country as the index for the dataframe.

drinks.set_index('country', inplace=True)
drinks.head()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
country
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa

SO, what is going to happen, if we are going to select the continent Series ??

drinks.continent.head()
country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object

This time, we are seeing the same thing as last time, the INDEX is passed on from the dataframe. The index is attached to each row!

drinks.continent.value_counts()
Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

This is also a Series, and as such, it has an index.

drinks.continent.value_counts().index
Index(['Africa', 'Europe', 'Asia', 'North America', 'Oceania',
       'South America'],
      dtype='object')

And it has VALUES (to the right remember).

drinks.continent.value_counts().values
array([53, 45, 44, 23, 16, 12])

NOW because it is a SERIES, and not some value_counts object or something, we can use the index to select values from the Series. What I mean is, I can take this values array and refer an index, such as ‘Africa’.

drinks.continent.value_counts()['Africa']
53

What I am saying is: “From this Series find index Africa and show me the value”

Sorting

drinks.continent.value_counts().sort_values()
South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64

Using the sort_values method of the Series object, we can sort the VALUES of the Series.

WHAT IF YOU WANTED TO SORT ON THE INDEX ??

drinks.continent.value_counts().sort_index()
Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Remember the 3 reasons why an Index exists:

  • Identification
  • Selection
  • Allignment

Allignment

In order to understand alignment, we first need to create another Series object.

people = pd.Series([300000, 85000], index=['Albania', 'Andorra'], name='population')
people
Albania    300000
Andorra     85000
Name: population, dtype: int64

Let’s say I want to use this tiny dataset and multiply this with drinks.beer_servings to get the total number of people who were served beer.

drinks.beer_servings * people
Afghanistan                    NaN
Albania                 26700000.0
Algeria                        NaN
Andorra                 20825000.0
Angola                         NaN
Antigua & Barbuda              NaN
Argentina                      NaN
Armenia                        NaN
Australia                      NaN
Austria                        NaN
Azerbaijan                     NaN
Bahamas                        NaN
Bahrain                        NaN
Bangladesh                     NaN
Barbados                       NaN
Belarus                        NaN
Belgium                        NaN
Belize                         NaN
Benin                          NaN
Bhutan                         NaN
Bolivia                        NaN
Bosnia-Herzegovina             NaN
Botswana                       NaN
Brazil                         NaN
Brunei                         NaN
Bulgaria                       NaN
Burkina Faso                   NaN
Burundi                        NaN
Cabo Verde                     NaN
Cambodia                       NaN
                           ...    
Sudan                          NaN
Suriname                       NaN
Swaziland                      NaN
Sweden                         NaN
Switzerland                    NaN
Syria                          NaN
Tajikistan                     NaN
Tanzania                       NaN
Thailand                       NaN
Timor-Leste                    NaN
Togo                           NaN
Tonga                          NaN
Trinidad & Tobago              NaN
Tunisia                        NaN
Turkey                         NaN
Turkmenistan                   NaN
Tuvalu                         NaN
USA                            NaN
Uganda                         NaN
Ukraine                        NaN
United Arab Emirates           NaN
United Kingdom                 NaN
Uruguay                        NaN
Uzbekistan                     NaN
Vanuatu                        NaN
Venezuela                      NaN
Vietnam                        NaN
Yemen                          NaN
Zambia                         NaN
Zimbabwe                       NaN
dtype: float64

The countries not represented in the Series: people are ignored. And only the countries represented in the people series are multiplied. So how does pandas do this ? So here’s the thing, it aligned the people series with the drink series. That is, it found Albania in the drink series and then multiplied by the population value in the people series.

So, in summary, alignment allows us to put the data together even if it not the same length. As long as you tell it, which rows correspond to which other rows. Because people and drinks series use country as the index and since we specified the index in people, pandas used that index to match the rows in drinks series.

One more example to hammer the point home. Suppose lets say you want to add the people Series to the drinks dataframe as a new column. To do this, we use the pd.concat method.

pd.concat([drinks, people], axis=1).head()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent population
Afghanistan 0 0 0 0.0 Asia NaN
Albania 89 132 54 4.9 Europe 300000.0
Algeria 25 0 14 0.7 Africa NaN
Andorra 245 138 312 12.4 Europe 85000.0
Angola 217 57 45 5.9 Africa NaN

Notice how pandas automatically knew how to align the values in the merged dataframe. It does so, because of the index. Without the index, pandas wouldn’t know where to put them. So that is the beauty and elegance of having an index. Hope that explains it.

Updated: