I have a dataframe with columns as below:
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
I want to keep only the rows with the latest rev or rows with "Final" if the other is "Initial". In the case above, my output will be as below:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
How can I do this in python in my pandas dataframe? Thanks.
CodePudding user response:
If possible exist only Initial and no Final and need keep it use Series.str.extract for get 3 columns for groups, Final or Initial and number of revision, convert last column to integers and then sorting by all columns with DataFrame.sort_values and get last duplicates per groups by DataFrame.duplicated:
print (df)
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
5 Green_Water_Initial_Rev_0 6
df1 = (df['Name'].str.extract(r'(?P<a>\w )_(?P<b>Final|Initial)_Rev_(?P<c>\d )$')
.assign(c=lambda x: x.c.astype(int)))
df = df[~df1.sort_values(['a','c','b'], ascending=[True, True, False])
.duplicated('a', keep='last')]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
5 Green_Water_Initial_Rev_0 6
But if need remove all Initial and processing only Final rows use first part same like above, only then filter out rows with Initial and for last revisions use DataFrame.loc with DataFrameGroupBy.idxmax:
df1 = (df['Name'].str.extract(r'(?P<a>\w )_(?P<b>Final|Initial)_Rev_(?P<c>\d )$')
.assign(c=lambda x: x.c.astype(int)))
df = df.loc[df1[df1.b.ne('Initial')].groupby('a')['c'].idxmax()]
print (df)
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
CodePudding user response:
You can extract the name before "Final" and drop_duplicates with keep='last':
keep = (df['Name']
.str.extract('^(.*)_Final', expand=False)
.drop_duplicates(keep='last')
.dropna()
)
out = df.loc[keep.index]
NB. Assuming the data is sorted by revision.
Output:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
CodePudding user response:
you can you the df.iloc[2:4,:] for this
