What the f*ck is Hierarchical Indexing in Pandas?

Hierarchical Indexing provides a way for you to work with higher-dimensional data in a lower-dimensional form.

·

7 min read

What the f*ck is Hierarchical Indexing in Pandas?

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher-dimensional data in a lower-dimensional form. Let’s start with a simple example; create a Series with a list of lists (or arrays) as the index:

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','b','c','c','d'],[1,1,3,1,2,3,1,2,3]])
print(data)

The output of this would be:

a  1    0.859885
   1   -0.388179
   3   -0.482289
b  1   -0.118181
   2    0.756211
   3   -0.084258
c  1   -2.395902
   2    0.765605
d  3    0.055030
dtype: float64

What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”. For example, in the first row, we have "a", but we do not have "a" in the second row. This implies that use the label that is directly above.

With a hierarchically indexed object, partial indexing is possible, enabling you to concisely select subsets of the data:

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','b','c','c','d'],[1,1,3,1,2,3,1,2,3]])
print(data['b'])

The output of this would be:

1   -1.072525
2   -0.171218
3    1.397263
dtype: float64

Selection is even possible from an “inner” level:

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','b','c','c','d'],[1,1,3,1,2,3,1,2,3]])
print(data[:,1])

The output of this would be:

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method:

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','b','c','c','d'],[1,2,3,1,2,3,1,2,3]])
print(data.unstack())

Note that if you have duplicate entries in the second index, then you cannot unstack a DataFrame. We can define unstacking in layman terms as converting muti-indexed Series into a DataFrame.

The output of the code above would be:

          1         2         3
a  1.606511 -0.571539 -0.378939
b  0.857553  0.896638 -1.325449
c  0.383329 -1.239296       NaN
d       NaN       NaN  0.493805

The opposite of unstack is stack. Stack implies converting a DataFrame to a multi-indexed Series.

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),index=[['a','a','a','b','b','b','c','c','d'],[1,2,3,1,2,3,1,2,3]])
print(data.unstack().stack())

The output of this code would be:

a  1    1.662083
   2   -0.582696
   3    0.196726
b  1    0.105105
   2    0.084335
   3    0.476131
c  1    0.698625
   2   -0.342482
d  3   -0.244857
dtype: float64

Even columns can have multi-level Indexing.

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
print(frame)

The output of this would be:

     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

01 . Swapping MultiLevel Indices 🔀

At times you will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel( ) takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
print(frame.swaplevel())

The output of this is:

     Ohio     Colorado
    Green Red    Green
1 a     0   1        2
2 a     3   4        5
1 b     6   7        8
2 b     9  10       11

sort_index, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use sort_index so that the result is lexicographically sorted by the indicated level:

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
print(frame.swaplevel().sort_index(level=0))

This will give us the expected output.

     Ohio     Colorado
    Green Red    Green
1 a     0   1        2
  b     6   7        8
2 a     3   4        5
  b     9  10       11

We can even convert a column into a multi-level index. Let us take an example. Let's look at our DataFrame first.

import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d': [0, 1, 2, 0, 1, 2, 3]})
print(frame)

The output of this is:

   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

Now I want c to be my first hierarchial index and d to be my second hierarchial index. For that, I will use a method called set_index.

import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d': [0, 1, 2, 0, 1, 2, 3]})
print(frame.set_index(['c','d']))

This will print the following:

       a  b
c   d
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

The opposite of this is reset_index( ) that converts multilevel indices into columns.

import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d': [0, 1, 2, 0, 1, 2, 3]})
print(frame.set_index(['c','d']).reset_index())

The output of this is:

     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

02 . Level Statistics 🏓

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

frame.sum(level='key2')

The output of this would be:

state Ohio Colorado
color Green Red Green
key2
1     6     8     10
2     12     14     16

Similarly, let us look at another example.

frame.sum(level='color', axis=1)

The output of this would be:

color Green Red
key1 key2
a     1 2    1
      2 8    4
b     1 14   7
      2 20  10

03 . Reshaping with Hierarchical Indexing 🏌️

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

  1. stack: This “rotates” or pivots from the columns in the data to the rows

  2. unstack: This pivots from the rows into the columns

Let's create a DataFrame with a hierarchical index:

import pandas as pd

# Creating a sample DataFrame with a hierarchical index
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['first', 'second'])
columns = ['X', 'Y']
data = [[1, 2], [3, 4], [5, 6], [7, 8]]

df = pd.DataFrame(data, index=index, columns=columns)

The DataFrame looks like this:

              X  Y
first second
A     1       1  2
      2       3  4
B     1       5  6
      2       7  8

The stack() function pivots the columns into rows, producing a Series with a hierarchical index:

df.stack()

The output of this will be:

first  second   
A      1       X    1
               Y    2
       2       X    3
               Y    4
B      1       X    5
               Y    6
       2       X    7
               Y    8
dtype: int64

Conversely, unstack() does the opposite; it pivots a level of the index labels to the columns:

df.stack( ).unstack( )

The output of this is:

         X  Y
first second      
A     1    1  2
      2    3  4
B     1    5  6
      2    7  8

This brings us back to the original DataFrame df. In summary, stack() and unstack() are useful for reshaping data between hierarchical indices and columns, allowing you to pivot data in different ways to suit your analysis or presentation needs.

This was all about Hierarchical Indexing in Pandas. See ya!