Pandas indexes are the most confusing thing about pandas. So let’s try to understand it. What are the advantages of using indices instead of just storing its’ values in columns ?
import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
RangeIndex(start=0, stop=193, step=1)
Note: Every dataframe has an index attribute and a columns attribute
Note: Indexes are sometimes called the row-labels.
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol', 'continent'], dtype='object')
Index is type of special object, it is NOT refered to as “THE INDEX” though. When someone says “THE INDEX” or the row labels, they are talking about the row indexes.
Note: Neither the Index or the Columns are considered part of the dataframe contents. This can be seen when you run the shape command.
It turns out that both these Indexes and columns both default to integers. That is, if no index or columns are specified. For example, take a look at the movieusers dataframe
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()
As we did not specify the header, the header and the row indexes are set to default integer values.
Note: Sometimes people just leave the default values for row indexes, but rarely they leave the default values for column names.
Coming back to our main question: Why does the Index exist ? There are 3 main reasons: - One is identification - Second is selection - Third is alignment
For identification, we are going to illustrate this by taking an example. We are going to do filtering the dataframe.
The thing I want you to notice is that the Index (also known as the row labels), STAYED with the rows. It didn’t just re-number them as starting from 0. It kept the original row numbers. This is what we mean by, when we say that the Index is for “Identification”. It is so that we can identify what rows we are working with even if you filter the original data frame.
Next, let’s talk about “Selection”. And what I mean by this, “what if I want to grab a piece of this dataframe ?” . We are going to use our beloved
loc method to do that. The
loc method, allows me say “if I want number 245 in the above dataframe” then I can simply use the below line
So what’s the big deal here ? Why bother having an index ? Here’s the answer to it. Let me show you.
# We can set the index using set_index dataframe method # When you set inplace=True, then you don't have to assign it to a variable. It does it inplace. drinks.set_index('country', inplace=True) drinks.head()
We can see that the dataframe has changed. We can see above that the Series country has now become an index and the default (integer index) has disappeared.
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', ... 'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'], dtype='object', name='country', length=193)
Few things to observe here:
- It still has length 193
- It is of type ‘object’, which by the way is the same as default index, since it is an Index object.
- It has a new attribute called ‘name’
Index(['beer_servings', 'spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol', 'continent'], dtype='object')
Also note now that country is no longer listed as a column in the dataframe.
Also check the shape it now has 193 and only 5 columns. This is because the Index is now not part of the dataframe. Remember that.
So, because we have set the country as the index, we can now use our beloved
loc method to do something like this.
So, by setting that index to something that was meaningful to us, we can now select data from the dataframe more easily!!!
Now, couple of issues to discuss, if you look at the dataframe output, you will notice that “country” is not listed in the columns row, it has a new separate row all for itself. Looks weird right ? That is actually the name of the index.
Now, you don’t need to have an index name. It is helpful, in that, it can serve as an identifier, that is it represents countries. But you can actually clear it out, if you so desire.
drinks.index.name = None drinks.head()
Now, let’s say you decide at some point that you want to use the default index and now you want to move that index (country index) back in as a column, then you do this.
Firstly, we need to give that index its’ name back!
drinks.index.name = 'country' drinks.reset_index(inplace=True) drinks.head()
And if you now check this out, we are back to our default integer index and the country which was the index, rejoined the dataframe as one of the columns. It is important to set the index name before we re-index, because pandas need to know what to call the new column.
Let’s do something extra. When we run the
describe method, see below
It is just a numerical summary of the numerical columns. And, I want you to notice that this is actually a dataframe. And as such, it has an index.
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
This is the index!!
And these are the columns
Index(['beer_servings', 'spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol'], dtype='object')
So, the point here is not so much that you are going to do something with the describe output! You might, like above. But the main point here is that, a lot of methods in pandas return a dataframe. And when you know that there is something like a index and columns associated with that dataframe, and you recognize that, then you can interact with that resulting dataframe in cool ways.
A SERIES ALSO HAS AN INDEX
The last topic is “Arrangement”. What I want to show you, is that, a Series like a Dataframe also has an Index.
The Series index comes from the dataframe!!
Let’s say now I pick the continent series.
0 Asia 1 Europe 2 Africa 3 Europe 4 Africa Name: continent, dtype: object
What we see here is that there is an Index for this Series and it came from the dataframe.
So the Index is on the left and the values are on the right.
Now, let’s pretend that we didn’t use the default index for the dataframe, and instead, we use the country as the index for the dataframe.
drinks.set_index('country', inplace=True) drinks.head()
SO, what is going to happen, if we are going to select the continent Series ??
country Afghanistan Asia Albania Europe Algeria Africa Andorra Europe Angola Africa Name: continent, dtype: object
This time, we are seeing the same thing as last time, the INDEX is passed on from the dataframe. The index is attached to each row!
Africa 53 Europe 45 Asia 44 North America 23 Oceania 16 South America 12 Name: continent, dtype: int64
This is also a Series, and as such, it has an index.
Index(['Africa', 'Europe', 'Asia', 'North America', 'Oceania', 'South America'], dtype='object')
And it has VALUES (to the right remember).
array([53, 45, 44, 23, 16, 12])
NOW because it is a SERIES, and not some value_counts object or something, we can use the index to select values from the Series. What I mean is, I can take this
values array and refer an index, such as ‘Africa’.
What I am saying is: “From this Series find index Africa and show me the value”
South America 12 Oceania 16 North America 23 Asia 44 Europe 45 Africa 53 Name: continent, dtype: int64
sort_values method of the Series object, we can sort the VALUES of the Series.
WHAT IF YOU WANTED TO SORT ON THE INDEX ??
Africa 53 Asia 44 Europe 45 North America 23 Oceania 16 South America 12 Name: continent, dtype: int64
Remember the 3 reasons why an Index exists:
In order to understand alignment, we first need to create another Series object.
people = pd.Series([300000, 85000], index=['Albania', 'Andorra'], name='population') people
Albania 300000 Andorra 85000 Name: population, dtype: int64
Let’s say I want to use this tiny dataset and multiply this with
drinks.beer_servings to get the total number of people who were served beer.
drinks.beer_servings * people
Afghanistan NaN Albania 26700000.0 Algeria NaN Andorra 20825000.0 Angola NaN Antigua & Barbuda NaN Argentina NaN Armenia NaN Australia NaN Austria NaN Azerbaijan NaN Bahamas NaN Bahrain NaN Bangladesh NaN Barbados NaN Belarus NaN Belgium NaN Belize NaN Benin NaN Bhutan NaN Bolivia NaN Bosnia-Herzegovina NaN Botswana NaN Brazil NaN Brunei NaN Bulgaria NaN Burkina Faso NaN Burundi NaN Cabo Verde NaN Cambodia NaN ... Sudan NaN Suriname NaN Swaziland NaN Sweden NaN Switzerland NaN Syria NaN Tajikistan NaN Tanzania NaN Thailand NaN Timor-Leste NaN Togo NaN Tonga NaN Trinidad & Tobago NaN Tunisia NaN Turkey NaN Turkmenistan NaN Tuvalu NaN USA NaN Uganda NaN Ukraine NaN United Arab Emirates NaN United Kingdom NaN Uruguay NaN Uzbekistan NaN Vanuatu NaN Venezuela NaN Vietnam NaN Yemen NaN Zambia NaN Zimbabwe NaN dtype: float64
The countries not represented in the Series: people are ignored. And only the countries represented in the people series are multiplied. So how does pandas do this ? So here’s the thing, it aligned the
people series with the drink series. That is, it found Albania in the
drink series and then multiplied by the population value in the
So, in summary, alignment allows us to put the data together even if it not the same length. As long as you tell it, which rows correspond to which other rows. Because
drinks series use country as the index and since we specified the index in
people, pandas used that index to match the rows in
One more example to hammer the point home. Suppose lets say you want to add the people Series to the drinks dataframe as a new column. To do this, we use the pd.concat method.
pd.concat([drinks, people], axis=1).head()
Notice how pandas automatically knew how to align the values in the merged dataframe. It does so, because of the index. Without the index, pandas wouldn’t know where to put them. So that is the beauty and elegance of having an index. Hope that explains it.