Keep only the final or the latest rev of a file name-CodePudding

I have a dataframe with columns as below:

  Name                                                         Measurement
0 Blue_Water_Final_Rev_0                                             3
1 Blue_Water_Final_Rev_1                                             4
2 Blue_Water_Final_Rev_2                                             5
3 Red_Water_Final_Rev_0                                              7
4 Red_Water_Initial_Rev_0                                            6

I want to keep only the rows with the latest rev or rows with "Final" if the other is "Initial". In the case above, my output will be as below:

 Name                                                         Measurement
2 Blue_Water_Final_Rev_2                                             5
3 Red_Water_Final_Rev_0                                              7

How can I do this in python in my pandas dataframe? Thanks.

CodePudding user response：

If possible exist only Initial and no Final and need keep it use Series.str.extract for get 3 columns for groups, Final or Initial and number of revision, convert last column to integers and then sorting by all columns with DataFrame.sort_values and get last duplicates per groups by DataFrame.duplicated:

print (df)
                        Name  Measurement
0     Blue_Water_Final_Rev_0            3
1     Blue_Water_Final_Rev_1            4
2     Blue_Water_Final_Rev_2            5
3      Red_Water_Final_Rev_0            7
4    Red_Water_Initial_Rev_0            6
5  Green_Water_Initial_Rev_0            6

df1 = (df['Name'].str.extract(r'(?P<a>\w )_(?P<b>Final|Initial)_Rev_(?P<c>\d )$')
                 .assign(c=lambda x: x.c.astype(int)))

df = df[~df1.sort_values(['a','c','b'], ascending=[True, True, False])
            .duplicated('a', keep='last')]
print (df)
                        Name  Measurement
2     Blue_Water_Final_Rev_2            5
3      Red_Water_Final_Rev_0            7
5  Green_Water_Initial_Rev_0            6

But if need remove all Initial and processing only Final rows use first part same like above, only then filter out rows with Initial and for last revisions use DataFrame.loc with DataFrameGroupBy.idxmax:

df1 = (df['Name'].str.extract(r'(?P<a>\w )_(?P<b>Final|Initial)_Rev_(?P<c>\d )$')
                 .assign(c=lambda x: x.c.astype(int)))

df = df.loc[df1[df1.b.ne('Initial')].groupby('a')['c'].idxmax()]

print (df)
                     Name  Measurement
2  Blue_Water_Final_Rev_2            5
3   Red_Water_Final_Rev_0            7

CodePudding user response：

You can extract the name before "Final" and drop_duplicates with keep='last':


keep = (df['Name']
        .str.extract('^(.*)_Final', expand=False)
        .drop_duplicates(keep='last')
        .dropna()
        )

out = df.loc[keep.index]

NB. Assuming the data is sorted by revision.

Output:

                     Name  Measurement
2  Blue_Water_Final_Rev_2            5
3   Red_Water_Final_Rev_0            7

CodePudding user response：

you can you the df.iloc[2:4,:] for this