I have the following dataframe with two columns c1 and c2, I want to add a new column c3 based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
- Must be grouped based on
c1andc2, then for each group, the new columnc3must be populated sequentially fromvalueswhere the key is the value ofc1and each "sub group" will have subsequent values, IOWvalues[value_of_c1][idx], whereidxis the "sub group", example below - The first group
(1, 'a'), herec1is1, the "sub group""a"index is0(first sub group of 1) soc3for all rows in this group isvalues[1][0] - The second group
(1, 'b')herec1is still1but "sub group" is"b"so index1(second sub group of 1) so for all rows in this groupc3isvalues[1][1] - The third group
(2, 'y')herec1is now2, "sub group" is"a"and the index is0(first sub group of 2), so for all rows in this groupc3isvalues[2][0] - And so on
valueswill have the necessary elements to satisfy this logic.
Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
CodePudding user response:
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical , same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)
CodePudding user response:
In [203]: a = pd.DataFrame([[k, value, idx] for k,v in values.items() for idx,value in enumerate(v)], columns=['c1', 'c3', 'gr'])
...: b = df.assign(gr=df.groupby(['c1']).transform(lambda x: (x.ne(x.shift()).cumsum())- 1))
...: print(b)
...: b.merge(a).drop(columns='gr')
...:
# b
c1 c2 gr
0 1 a 0
1 1 a 0
2 1 a 0
3 1 b 1
4 1 b 1
5 1 b 1
6 2 y 0
7 2 y 0
8 2 y 0
9 2 z 1
10 2 z 1
11 2 z 1
Out[203]:
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
