Home > Enterprise >  Bin DataFrame in 2D array for seaborn heatmap
Bin DataFrame in 2D array for seaborn heatmap

Time:01-25

I have a dataframe that looks like.

 ----------- ------- 
|     A     |   B   |
 ----------- ------- 
|     1     |   1   |
|     2     |   2   |
|     5     |   3   |
|    20     |   4   |
|    25     |   3   |
|    123    |   5   |
|    125    |   6   |
 ----------- ------- 

I want to bin the column A based on the ranges defined ranges with sum of the values in column B. This will then be feeded to seaborn to generate a heatmap.

 --------- ------ ------- ------- ------- ------- ------- ------- ------- ------- -------- 
|         | 0-10 | 11-20 | 21-30 | 31-40 | 41-50 | 51-60 | 61-70 | 71-80 | 81-90 | 91-100 |
 --------- ------ ------- ------- ------- ------- ------- ------- ------- ------- -------- 
|  0-100  |  6   |   4   |   3   |   0   |   0   |   0   |   0   |   0   |   0   |   0    |
| 101-200 |  0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0    |
| 201-300 |  0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0    |
| 301-400 |  0   |   0   |  11   |   0   |   0   |   0   |   0   |   0   |   0   |   0    |
| 401-500 |  0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0   |   0    |
 --------- ------ ------- ------- ------- ------- ------- ------- ------- ------- -------- 

One way to solve it by looping through the data and generate the array. I am looking at a pandas way if there is any.

I tried solving using seaborn.heatmap like so:

df.groupby([pd.cut(df.A, bins=[x for x in range(0,1001,100)], include_lowest=True, right=False),
            pd.cut(df.A, bins=[x for x in range(0,101,10)], include_lowest=True, right=False)])
           .B.sum().unstack()

But this only group's by the first 0-100 B values. Ignores the remaining.

CodePudding user response:

In your solution is used maximal range for range(0,101,10) like 101, so not matched values in A column greater like 100 - output are NaNs, so after aggregate sum get 0.

EDIT:

#create helper column with integer and modulo division
df['A1'] = df.A % 100

bins1= range(0,df.A.max() // 100 * 100   101, 100)
bins2= range(0,df.A1.max() // 10 * 10   11, 10)

labels1 = [f'{i}-{j}' if i == 0 else f'{i   1}-{j}' for i, j in zip(bins1[:-1], bins1[1:])] 
labels2 = [f'{i}-{j}' if i == 0 else f'{i   1}-{j}' for i, j in zip(bins2[:-1], bins2[1:])] 


df['a'] = pd.cut(df.A, bins=bins1,labels=labels1, include_lowest=True, right=True)
df['b'] = pd.cut(df.A1, bins=bins2,labels=labels2, include_lowest=True, right=True)
print (df)
     A  B  A1        a      b
0    1  1   1    0-100   0-10
1    2  2   2    0-100   0-10
2    5  3   5    0-100   0-10
3   20  4  20    0-100  11-20
4   25  3  25    0-100  21-30
5  123  5  23  101-200  21-30
6  125  6  25  101-200  21-30

df1 = df.pivot_table(index='a', columns='b', values='B', aggfunc='sum')
print (df1)
b        0-10  11-20  21-30
a                          
0-100       6      4      3
101-200     0      0     11
  •  Tags:  
  • Related