Understanding Pandas melt and when to use it

3 minute read

Understanding pandas melt

Melting turns columns into rows. Whereas, pivot will take unique values from a column and creates a new columns.

This is how you can visualize what a melt operation looks like:

melt.png

Imagine melting the columns and making the column-names as row values. And secondly, the value that is represented by the two variables will be its own separate column. In the above example, temparature of each city is tabulated against two variables: days and cities.

Pandas.melt() unpivots a DataFrame from wide format to long format. melt() function is useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis, leaving just two non-identifier columns, variable and value.

import pandas as pd
gapminder = pd.read_csv('gapminder.csv')
gapminder.head()
Unnamed: 0 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 Life expectancy
0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Abkhazia
1 1 28.21 28.20 28.19 28.18 28.17 28.16 28.15 28.14 28.13 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Afghanistan
2 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Akrotiri and Dhekelia
3 3 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Albania
4 4 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Algeria

5 rows × 219 columns

# remove the unnamed column
gapminder.drop('Unnamed: 0', axis='columns', inplace=True)
gapminder.head()
1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 Life expectancy
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Abkhazia
1 28.21 28.20 28.19 28.18 28.17 28.16 28.15 28.14 28.13 28.12 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Afghanistan
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Akrotiri and Dhekelia
3 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Albania
4 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Algeria

5 rows × 218 columns

What are the parameters for melt?

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, 
            value_name='value', col_level=None)
  • frame : DataFrame
  • id_vars[tuple, list, or ndarray, optional] : Column(s) to use as identifier variables.
  • value_vars[tuple, list, or ndarray, optional]: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
  • var_name[scalar]: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
  • value_name[scalar, default ‘value’]: Name to use for the ‘value’ column.
  • col_level[int or string, optional]: If columns are a MultiIndex then use this level to melt.

OK, so, let’s get to it. I am going to make Life expectancy as the identifier variable, i.e, the one that should NOT be used in the un-pivoting operation. (Also note that the name Life expectancy is really the name of the country, but we will fix that later after melting).

# just give id_vars, since value_vars will be the rest of the columns.
gapminder_melt = gapminder.melt(id_vars='Life expectancy')
gapminder_melt.head(20)
Life expectancy variable value
0 Abkhazia 1800 NaN
1 Afghanistan 1800 28.21
2 Akrotiri and Dhekelia 1800 NaN
3 Albania 1800 35.40
4 Algeria 1800 28.82
5 American Samoa 1800 NaN
6 Andorra 1800 NaN
7 Angola 1800 26.98
8 Anguilla 1800 NaN
9 Antigua and Barbuda 1800 33.54
10 Argentina 1800 33.20
11 Armenia 1800 34.00
12 Aruba 1800 34.42
13 Australia 1800 34.05
14 Austria 1800 34.40
15 Azerbaijan 1800 29.17
16 Bahamas 1800 35.18
17 Bahrain 1800 30.30
18 Bangladesh 1800 25.50
19 Barbados 1800 32.12

As noted above, since we did not give Variable Name and Value Name, default values are used.

# now change the column names
gapminder_melt.columns = ['country', 'year', 'life_expectancy']
gapminder_melt.head()
country year life_expectancy
0 Abkhazia 1800 NaN
1 Afghanistan 1800 28.21
2 Akrotiri and Dhekelia 1800 NaN
3 Albania 1800 35.40
4 Algeria 1800 28.82
gapminder_melt.shape
(169260, 3)
gapminder.shape
(780, 218)

Conclusion

We have taken a wide dataframe and converted it into a long dataframe. Here’s a reminder of the tidy data principles!

According to the principles of tidy data:

  • rows form observations
  • columns form variables
  • tidying data will make data cleaning easier
  • melting turns columns into rows
  • pivot will take unique values from column and create new columns

Tags:

Updated: