I have a dataframe that looks like.
----------- -------
| A | B |
----------- -------
| 1 | 1 |
| 2 | 2 |
| 5 | 3 |
| 20 | 4 |
| 25 | 3 |
| 123 | 5 |
| 125 | 6 |
----------- -------
I want to bin the column A based on the ranges defined ranges with sum of the values in column B. This will then be feeded to seaborn to generate a heatmap.
--------- ------ ------- ------- ------- ------- ------- ------- ------- ------- --------
| | 0-10 | 11-20 | 21-30 | 31-40 | 41-50 | 51-60 | 61-70 | 71-80 | 81-90 | 91-100 |
--------- ------ ------- ------- ------- ------- ------- ------- ------- ------- --------
| 0-100 | 6 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 101-200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 201-300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 301-400 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 401-500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
--------- ------ ------- ------- ------- ------- ------- ------- ------- ------- --------
One way to solve it by looping through the data and generate the array. I am looking at a pandas way if there is any.
I tried solving using seaborn.heatmap like so:
df.groupby([pd.cut(df.A, bins=[x for x in range(0,1001,100)], include_lowest=True, right=False),
pd.cut(df.A, bins=[x for x in range(0,101,10)], include_lowest=True, right=False)])
.B.sum().unstack()
But this only group's by the first 0-100 B values. Ignores the remaining.
CodePudding user response:
In your solution is used maximal range for range(0,101,10) like 101, so not matched values in A column greater like 100 - output are NaNs, so after aggregate sum get 0.
EDIT:
#create helper column with integer and modulo division
df['A1'] = df.A % 100
bins1= range(0,df.A.max() // 100 * 100 101, 100)
bins2= range(0,df.A1.max() // 10 * 10 11, 10)
labels1 = [f'{i}-{j}' if i == 0 else f'{i 1}-{j}' for i, j in zip(bins1[:-1], bins1[1:])]
labels2 = [f'{i}-{j}' if i == 0 else f'{i 1}-{j}' for i, j in zip(bins2[:-1], bins2[1:])]
df['a'] = pd.cut(df.A, bins=bins1,labels=labels1, include_lowest=True, right=True)
df['b'] = pd.cut(df.A1, bins=bins2,labels=labels2, include_lowest=True, right=True)
print (df)
A B A1 a b
0 1 1 1 0-100 0-10
1 2 2 2 0-100 0-10
2 5 3 5 0-100 0-10
3 20 4 20 0-100 11-20
4 25 3 25 0-100 21-30
5 123 5 23 101-200 21-30
6 125 6 25 101-200 21-30
df1 = df.pivot_table(index='a', columns='b', values='B', aggfunc='sum')
print (df1)
b 0-10 11-20 21-30
a
0-100 6 4 3
101-200 0 0 11
