given a big dataframe, how can I remove all the things between two specific strings, here "start" and "end". Yesteday, I asked a similar question but removing the two strings (I post my solution here below). Now I would like to see how I can keep also the strings. I have been trying to get the next row using itertuples, but I still haven't figured out how to do it.
df1:
c1 c2
0 1 1
1 2 start
2 3 3
3 4 end
4 5 5
5 6 start
6 7 end
7 8 0
code to remove strings and things in between:
df = pd.DataFrame(
{'c1': [1, 2, 3, 4, 5, 6, 7, 8],
'c2': ['1', 'start', '3', 'end',
'5', 'start', 'end', 0]})
df2 = copy.copy(df)
flag = False
list_a = []
for j in df.itertuples():
if j.c2 == 'start':
flag = True
list_a.append((j))
elif j.c2 =='end':
flag = False
list_a.append((j))
elif flag:
list_a.append((j))
list_a = tuple(list_a)
to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
to_remove_df = to_remove_df["c2"]
removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)
output df2:
c1 c2
0 1 1
4 5 5
7 8 0
Wanted output:
df1:
c1 c2
0 1 1
1 2 start
3 4 end
4 5 5
5 6 start
6 7 end
7 8 0
CodePudding user response:
Assuming there is always a logical succession of "start" and "end" (i.e. start first, end after, no duplicate in between). You can simply take the odd chunks delimited by start/end:
m1 = df['c2'].isin(['start', 'end']).cumsum().mod(2).astype(bool)
df[~(m1|m1.shift())]
output:
c1 c2
0 1 1
4 5 5
7 8 0
CodePudding user response:
You just need to not add the indeces from the start and stop location to a.
Change this:
for j in df.itertuples():
if j.c2 == 'start':
flag = True
list_a.append((j))
elif j.c2 =='end':
flag = False
list_a.append((j))
elif flag:
list_a.append((j))
to this:
for j in df.itertuples():
if j.c2 == 'start':
flag = True
elif j.c2 =='end':
flag = False
elif flag:
list_a.append((j))
CodePudding user response:
Here a possible vectorial solution which should be easy to understand.
import pandas as pd
import numpy as np
# data
df = pd.DataFrame(
{'c1': [1, 2, 3, 4, 5, 6, 7, 8],
'c2': ['1', 'start', '3', 'end',
'5', 'start', 'end', 0]})
Now I want to mark the sequences containing start and end. In order to do so I first assign 1 where there is a start and 0 where there is a end. Then I forward fill the nan
diz = {"start": 1, "end": 0}
df["mask"] = df["c2"].map(diz).ffill()
Where the output is
c1 c2 mask
0 1 1 NaN
1 2 start 1.0
2 3 3 1.0
3 4 end 0.0
4 5 5 0.0
5 6 start 1.0
6 7 end 0.0
7 8 0 0.0
After that I set 1 where there is a end and I do not modify otherwise and fill nan with 0s.
df["mask"] = np.where(df["c2"].eq("end"), 1, df["mask"])
df["mask"] = df["mask"].fillna(0)
Here the output is
c1 c2 mask
0 1 1 0.0
1 2 start 1.0
2 3 3 1.0
3 4 end 1.0
4 5 5 0.0
5 6 start 1.0
6 7 end 1.0
7 8 0 0.0
Next I want to remove all elements where mask is 1 and c2 is not start or end.
df["to_remove"] = df["mask"].eq(1) & ~df["c2"].isin(['start', 'end'])
Here the output is
c1 c2 mask to_remove
0 1 1 0.0 False
1 2 start 1.0 False
2 3 3 1.0 True
3 4 end 1.0 False
4 5 5 0.0 False
5 6 start 1.0 False
6 7 end 1.0 False
7 8 0 0.0 False
Finally I remove all to_remove True and remove the columns we added on the way
df = df[df["to_remove"].ne(True)]\
.drop(columns=["mask", "to_remove"])
And you have your expected output
c1 c2
0 1 1
1 2 start
3 4 end
4 5 5
5 6 start
6 7 end
7 8 0
