Understanding Pandas melt and when to use it
Understanding pandas melt
Melting turns columns into rows. Whereas, pivot will take unique values from a column and creates a new columns.
This is how you can visualize what a melt operation looks like:
Imagine melting the columns and making the column-names as row values. And secondly, the value that is represented by the two variables will be its own separate column. In the above example, temparature of each city is tabulated against two variables: days and cities.
Pandas.melt()
unpivots a DataFrame from wide format to long format.
melt() function is useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis, leaving just two non-identifier columns, variable and value.
import pandas as pd
gapminder = pd.read_csv('gapminder.csv')
gapminder.head()
Unnamed: 0 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | Life expectancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Abkhazia |
1 | 1 | 28.21 | 28.20 | 28.19 | 28.18 | 28.17 | 28.16 | 28.15 | 28.14 | 28.13 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Afghanistan |
2 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Akrotiri and Dhekelia |
3 | 3 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Albania |
4 | 4 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Algeria |
5 rows × 219 columns
# remove the unnamed column
gapminder.drop('Unnamed: 0', axis='columns', inplace=True)
gapminder.head()
1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | ... | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | Life expectancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Abkhazia |
1 | 28.21 | 28.20 | 28.19 | 28.18 | 28.17 | 28.16 | 28.15 | 28.14 | 28.13 | 28.12 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Afghanistan |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Akrotiri and Dhekelia |
3 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | 35.40 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Albania |
4 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | 28.82 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Algeria |
5 rows × 218 columns
What are the parameters for melt
?
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None,
value_name='value', col_level=None)
- frame : DataFrame
- id_vars[tuple, list, or ndarray, optional] : Column(s) to use as identifier variables.
- value_vars[tuple, list, or ndarray, optional]: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name[scalar]: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- value_name[scalar, default ‘value’]: Name to use for the ‘value’ column.
- col_level[int or string, optional]: If columns are a MultiIndex then use this level to melt.
OK, so, let’s get to it. I am going to make Life expectancy
as the identifier variable, i.e, the one that should NOT be used in the un-pivoting operation. (Also note that the name Life expectancy is really the name of the country, but we will fix that later after melting).
# just give id_vars, since value_vars will be the rest of the columns.
gapminder_melt = gapminder.melt(id_vars='Life expectancy')
gapminder_melt.head(20)
Life expectancy | variable | value | |
---|---|---|---|
0 | Abkhazia | 1800 | NaN |
1 | Afghanistan | 1800 | 28.21 |
2 | Akrotiri and Dhekelia | 1800 | NaN |
3 | Albania | 1800 | 35.40 |
4 | Algeria | 1800 | 28.82 |
5 | American Samoa | 1800 | NaN |
6 | Andorra | 1800 | NaN |
7 | Angola | 1800 | 26.98 |
8 | Anguilla | 1800 | NaN |
9 | Antigua and Barbuda | 1800 | 33.54 |
10 | Argentina | 1800 | 33.20 |
11 | Armenia | 1800 | 34.00 |
12 | Aruba | 1800 | 34.42 |
13 | Australia | 1800 | 34.05 |
14 | Austria | 1800 | 34.40 |
15 | Azerbaijan | 1800 | 29.17 |
16 | Bahamas | 1800 | 35.18 |
17 | Bahrain | 1800 | 30.30 |
18 | Bangladesh | 1800 | 25.50 |
19 | Barbados | 1800 | 32.12 |
As noted above, since we did not give Variable Name and Value Name, default values are used.
# now change the column names
gapminder_melt.columns = ['country', 'year', 'life_expectancy']
gapminder_melt.head()
country | year | life_expectancy | |
---|---|---|---|
0 | Abkhazia | 1800 | NaN |
1 | Afghanistan | 1800 | 28.21 |
2 | Akrotiri and Dhekelia | 1800 | NaN |
3 | Albania | 1800 | 35.40 |
4 | Algeria | 1800 | 28.82 |
gapminder_melt.shape
(169260, 3)
gapminder.shape
(780, 218)
Conclusion
We have taken a wide dataframe and converted it into a long dataframe. Here’s a reminder of the tidy data principles!
According to the principles of tidy data:
- rows form observations
- columns form variables
- tidying data will make data cleaning easier
- melting turns columns into rows
- pivot will take unique values from column and create new columns