I have an example
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | | None |
| 0 | TXT2 | | None |
| 0 | 5 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | | None |
| 1 | TXT12 | | None |
| 1 | 0.50 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
I would like to move the value from COL_A for the last occurrence of a row with None in column COL_C per main_group.
This value should be moved to the first element in group for COL_B column and then the previous contents should be removed.
Here is how it should look like for main_group == 0:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None | <--- value "5" from the last row with "None" in `COL_C` in `main_group` == 0 was moved to the first row in the same group
| 0 | TXT2 | | None |
| 0 | 5 | | None | <--- After that this row should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
───────────── ──────── ──────── ────────
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 2 | 0.5 | 0.5 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
After this operation, df should looks like:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | 0.50 | None |
| 1 | TXT12 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | 0.5 | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
───────────── ──────── ──────── ────────
In the last step, I would like to concatenating selected COL_A column in each main_group where value in COL_C is None.
Example:
───────────── ──────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────── ──────── ────────
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
↓↓↓↓↓↓↓↓↓↓
───────────── ──────────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────────── ──────── ────────
| 0 | TXT1 TXT2 | 5 | None | <--- If there are more than 1 row with "None" in column `COL_C` in each group, then values in column `COL_A` should be "merged" into one row, and all others should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
The first solution I have is to .loc those rows in the group that have a value "None" in COL_C column and then assign to the first element (.iloc) value from the last row.
However, this solution is not quite correct, and I'm additionally convinced that this can be done using .groupby instead of iterating and searching for elements after each group but I can't do it correctly.
The solution I was able to get this way:
───────────── ──────────── ──────── ────────
| main_group | COL_A | COL_B | COL_C |
───────────── ──────────── ──────── ────────
| 0 | TXT1 TXT2 | 5 | None |
| NaN | NaN | NaN | NaN |
| NaN | NaN | NaN | NaN |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
This is partially transferred correctly, but there are still NaN values in the rows that should not be there anymore.
Of course I can delete those rows and re-index df but this solution relies on loops, which will certainly not be efficient for large df.
How can I do it with a workaround of these loops over individual groups and swapping values using .loc and .iloc?
CodePudding user response:
IIUC, try the following (explanation in comments):
#create indicator column for where COL_C is None
df["indicator"] = df["COL_C"].isnull()
#get the index of the last None value for each main_group
max_null = df.groupby("main_group")["indicator"].transform(lambda x: x.cumsum().idxmax())
#move the COL_A to COL_B for the first index of each group
df["COL_B"] = df["COL_B"].where(df.groupby("main_group").cumcount().ne(0), max_null.map(df["COL_A"]))
# #remove the last rows with None value for each main_group
df = df.drop(max_null.unique()).reset_index(drop=True)
# #concatenate COL_A per main_group
strings = df.groupby("main_group").apply(lambda x: x[x["indicator"]]["COL_A"].str.cat(sep=","))
#assign concatenated strings to COL_A
df["COL_A"] = df["COL_A"].where(~df["indicator"], df["main_group"].map(strings))
#drop duplicates from COL_A per group and drop the indicator column
df = df.drop_duplicates(["main_group","COL_A"]).drop("indicator", axis=1).reset_index(drop=True)
>>> df
main_group COL_A COL_B COL_C
0 0 TXT1,TXT2 5 NaN
1 0 1.93 1.93 0.0
2 0 7.6 7.6 1.0
3 0 2.46 2.46 1.0
4 1 TXT1,TXT12 0.5 NaN
5 1 0.45 0.45 0.0
6 1 0.31 0.31 1.0
7 1 0.35 0.35 1.0
8 1 0.73 0.73 1.0
9 2 4.15 4.15 0.0
10 2 2.98 2.98 0.0
11 2 1.53 1.53 0.0
12 3 4 4.0 0.0
13 3 0.95 0.95 1.0
14 3 1.35 1.35 1.0
15 3 1.79 1.79 1.0
CodePudding user response:
Here's a way to get the final result you've asked for:
print('\nInput df:'); print(df)
df = df.assign(range_index=df.index)
gb = df[df['COL_C'].isna()].groupby(['main_group'])
df2 = pd.concat([
gb.nth(0)['range_index'],
gb.last()['COL_A'].copy().rename('COL_B_update'),
gb['COL_A'].apply(list).str.slice(stop=-1).str.join(' ')
], axis=1).set_index('range_index')
emptyColA = df2['COL_A'].str.len() == 0
df2.loc[emptyColA, 'COL_A'] = df2.loc[emptyColA, 'COL_B_update']
print('\ndf2:'); print(df2)
df = df.join(df2, on='range_index', rsuffix='_list')
print('\ndf just after join():'); print(df)
df.loc[~df.COL_A_list.isna(), 'COL_A_update'] = df.COL_A_list
df.loc[~df['COL_C'].isna(), 'COL_A_update'] = df.COL_A
df = df.loc[~df.COL_A_update.isna()].drop(columns=['range_index', 'COL_A_list'])
print('\ndf after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:'); print(df)
df.loc[df['COL_C'].isna(), 'COL_B'] = df.loc[df['COL_C'].isna(), 'COL_B_update']
df.loc[df['COL_C'].isna(), 'COL_A'] = df.loc[df['COL_C'].isna(), 'COL_A_update']
df = df.drop(columns=['COL_B_update', 'COL_A_update']).rename(columns={'COL_A_update':'COL_A'}).reset_index(drop=True)
print('\nOutput df after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:'); print(df)
Explanation:
- duplicate the index in a new column
range_index - create a
groupbyobject onmain_groupfor rows withCOL_Cof None - with the
groupbyobject:- use
nth(0)to get therange_indexvalue for the first row in each group of contiguousNonevalues inCOL_C - use
last()andSeries.rename()to create columnCOL_B_updatewith needed values copied fromCOL_A - use
apply(list), as well asslice()andjoin()onSeries.str(a sequence accessor which, confusingly, acts here on alistrather than astr) to convert all but one of each group'sCOL_Avalues into a space separated string of such values concatthese three Series intodf2withrange_indexas the index
- use
- use
jointo add to df the new columnsCOL_B_updateand aCOL_A_list(a renaming ofCOL_Afrom df2) - create a new column
COL_A_updatecontainingCOL_Alist strings for rows that have them, and containingCOL_Avalues for rows whereCOL_Cis not None - remove all other rows (i.e., remove all rows but the first in each contiguous block of rows where
COL_Cis None), and drop intermediate columnsrange_indexandCOL_A_list - update rows where
COL_Cis None inCOL_BandCOL_AusingCOL_B_updateandCOL_A_update, and drop the intermediate columnsCOL_A_updateandCOL_B_update.
Input:
main_group COL_A COL_B COL_C
0 0 TXT1 None
1 0 TXT2 None
2 0 5 None
3 0 1.93 1.93 0
4 0 7.60 7.60 1
5 0 2.46 2.46 1
6 1 TXT11 None
7 1 TXT12 None
8 1 0.50 None
9 1 0.45 0.45 0
10 1 0.31 0.31 1
11 1 0.35 0.35 1
12 1 0.73 0.73 1
13 2 0.5 None
14 2 4.15 4.15 0
15 2 2.98 2.98 0
16 2 1.53 1.53 0
17 3 4.46 None
18 3 4.00 4.00 0
19 3 0.95 0.95 1
20 3 1.35 1.35 1
21 3 1.79 1.79 1
Here is df2 right before the join:
COL_B_update COL_A
range_index
0 5 TXT1 TXT2
6 0.50 TXT11 TXT12
13 0.5 0.5
17 4.46 4.46
Here is df right after the join, with new columns range_index, COL_B_update and COL_A_list:
main_group COL_A COL_B COL_C range_index COL_B_update COL_A_list
0 0 TXT1 None 0 5 TXT1 TXT2
1 0 TXT2 None 1 NaN NaN
2 0 5 None 2 NaN NaN
3 0 1.93 1.93 0 3 NaN NaN
4 0 7.60 7.60 1 4 NaN NaN
5 0 2.46 2.46 1 5 NaN NaN
6 1 TXT11 None 6 0.50 TXT11 TXT12
7 1 TXT12 None 7 NaN NaN
8 1 0.50 None 8 NaN NaN
9 1 0.45 0.45 0 9 NaN NaN
10 1 0.31 0.31 1 10 NaN NaN
11 1 0.35 0.35 1 11 NaN NaN
12 1 0.73 0.73 1 12 NaN NaN
13 2 0.5 None 13 0.5 0.5
14 2 4.15 4.15 0 14 NaN NaN
15 2 2.98 2.98 0 15 NaN NaN
16 2 1.53 1.53 0 16 NaN NaN
17 3 4.46 None 17 4.46 4.46
18 3 4.00 4.00 0 18 NaN NaN
19 3 0.95 0.95 1 19 NaN NaN
20 3 1.35 1.35 1 20 NaN NaN
21 3 1.79 1.79 1 21 NaN NaN
Here is df after creating COL_A_update, deleting unwanted rows, and dropping intermediate columns range_index and COL_A_list:
main_group COL_A COL_B COL_C COL_B_update COL_A_update
0 0 TXT1 None 5 TXT1 TXT2
3 0 1.93 1.93 0 NaN 1.93
4 0 7.60 7.60 1 NaN 7.60
5 0 2.46 2.46 1 NaN 2.46
6 1 TXT11 None 0.50 TXT11 TXT12
9 1 0.45 0.45 0 NaN 0.45
10 1 0.31 0.31 1 NaN 0.31
11 1 0.35 0.35 1 NaN 0.35
12 1 0.73 0.73 1 NaN 0.73
13 2 0.5 None 0.5 0.5
14 2 4.15 4.15 0 NaN 4.15
15 2 2.98 2.98 0 NaN 2.98
16 2 1.53 1.53 0 NaN 1.53
17 3 4.46 None 4.46 4.46
18 3 4.00 4.00 0 NaN 4.00
19 3 0.95 0.95 1 NaN 0.95
20 3 1.35 1.35 1 NaN 1.35
21 3 1.79 1.79 1 NaN 1.79
Output after updating COL_A and COL_B, and dropping intermediate columns COL_A_update and COL_B_update:
main_group COL_A COL_B COL_C
0 0 TXT1 TXT2 5 None
1 0 1.93 1.93 0
2 0 7.60 7.60 1
3 0 2.46 2.46 1
4 1 TXT11 TXT12 0.50 None
5 1 0.45 0.45 0
6 1 0.31 0.31 1
7 1 0.35 0.35 1
8 1 0.73 0.73 1
9 2 0.5 0.5 None
10 2 4.15 4.15 0
11 2 2.98 2.98 0
12 2 1.53 1.53 0
13 3 4.46 4.46 None
14 3 4.00 4.00 0
15 3 0.95 0.95 1
16 3 1.35 1.35 1
17 3 1.79 1.79 1
