I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1'] with 'X1', and then merge the rows together. For example, any instance of 'x1', 'x2', 'x3' or 'x4' will be replaced by 'X1', etc.
I can do this by selecting the rows that contain any of these strings and replacing them with 'X1':
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing 'X1' and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns ['lft', 'rel', 'rgt'] are unique while the 'num' column is added up for each of these rows. The row 1 above : ['X1' 'r1' 't2' 6] is the sum of two rows ['X1' 'r1' 't2' 4] and ['X1' 'r1' 't2' 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
CodePudding user response:
You can reverse replacement_dict mapping and sum num values after grouping by lft, rel and rgt columns.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# change values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# change values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
CodePudding user response:
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
CodePudding user response:
If you flip the keys and values of your replacement_dict, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
CodePudding user response:
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
CodePudding user response:
Here's a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace()uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columnslftandrgt- after filtering for rows with
'X1'found in eitherlftorrgt, usegroupby(),sum()andreset_index()to sum thenumcolumn for uniquelft, rel, rgtgroup keys and restore the group components from index levels to columns.
As an alternative, we can use query() to select only rows containing 'X1':
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )
