I have the following dataframe (sample):
import pandas as pd
min_id = 1
max_id = 10
data = [['A', 2], ['A', 3], ['A', 1], ['A', 4], ['A', 4], ['A', 2],
['B', 4], ['B', 5], ['B', 7], ['B', 4], ['B', 2],
['C', 1], ['C', 3], ['C', 2], ['C', 1], ['C', 5], ['C', 2] ,['C', 1],
['D', 1], ['D', 1], ['D', 1], ['D', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'val'])
group val
0 A 2
1 A 3
2 A 1
3 A 4
4 A 4
5 A 2
6 B 4
7 B 5
8 B 7
9 B 4
10 B 2
11 C 1
12 C 3
13 C 2
14 C 1
15 C 5
16 C 2
17 C 1
18 D 1
19 D 1
20 D 1
21 D 1
I would like to create a column called "id" which shows the id with a min value of 1 (min_id) and a max value of 10 (max_id) per group. So the values between min and max depend on the number of rows per group. Here you can see the desired output:
data = [['A', 2, 1], ['A', 3, 2.8], ['A', 1, 4.6], ['A', 4, 6.4], ['A', 4, 8.2], ['A', 2, 10],
['B', 4, 1], ['B', 5, 3.25], ['B', 7, 5.5], ['B', 4, 7.75], ['B', 2, 10],
['C', 1, 1], ['C', 3, 2.5], ['C', 2, 4], ['C', 1, 5.5], ['C', 5, 7], ['C', 2, 8.5] ,['C', 1, 10],
['D', 1, 1], ['D', 1, 4], ['D', 1, 7], ['D', 1, 10]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'val', 'id'])
group val id
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
So I was wondering if anyone knows how to automatically create the column "id" using pandas? Please note that the number of rows could be way more then in the sample dataframe.
CodePudding user response:
One solution could be as follows:
import pandas as pd
import numpy as np
df['id'] = df.groupby('group', sort=False).size()\
.apply(lambda x: np.linspace(1,10,x)).explode().reset_index(drop=True)
print(df)
group val id
0 A 2 1.0
1 A 3 2.8
2 A 1 4.6
3 A 4 6.4
4 A 4 8.2
5 A 2 10.0
6 B 4 1.0
7 B 5 3.25
8 B 7 5.5
9 B 4 7.75
10 B 2 10.0
11 C 1 1.0
12 C 3 2.5
13 C 2 4.0
14 C 1 5.5
15 C 5 7.0
16 C 2 8.5
17 C 1 10.0
18 D 1 1.0
19 D 1 4.0
20 D 1 7.0
21 D 1 10.0
CodePudding user response:
Use:
def f(x):
x = pd.Series(np.nan, index=x.index)
x.iat[0] = min_id
x.iat[-1] = max_id
return x.interpolate()
df['ID'] = df.groupby('group')['group'].transform(f)
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
Or:
f = lambda x: np.linspace(min_id,max_id,len(x))
df['ID'] = df.groupby('group')['group'].transform(f)
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
