Remove all rows between two sentinel strings in a column using pandas (but not the sentinel strings)-CodePudding

given a big dataframe, how can I remove all the things between two specific strings, here "start" and "end". Yesteday, I asked a similar question but removing the two strings (I post my solution here below). Now I would like to see how I can keep also the strings. I have been trying to get the next row using itertuples, but I still haven't figured out how to do it.

df1:

   c1     c2
0   1      1
1   2  start
2   3      3
3   4    end
4   5      5
5   6  start
6   7    end
7   8      0

code to remove strings and things in between:

df = pd.DataFrame(
    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
     'c2': ['1', 'start', '3', 'end',
            '5', 'start', 'end', 0]})

df2 = copy.copy(df)
flag = False
list_a = []



for j in df.itertuples():
    if j.c2 == 'start':
        flag = True
        list_a.append((j))
    elif j.c2 =='end':
        flag = False
        list_a.append((j))
    elif flag:
        list_a.append((j))

list_a = tuple(list_a) 
to_remove_df = pd.DataFrame(list_a, columns=['index','c1','c2'])
to_remove_df = to_remove_df["c2"]
removed_df = pd.merge(df, to_remove_df, on=["c2"], how="outer", indicator=True).query('_merge != "both"').drop('_merge', 1)

output df2:

Wanted output: df1:

   c1     c2
0   1      1
1   2  start
3   4    end
4   5      5
5   6  start
6   7    end
7   8      0

CodePudding user response：

Assuming there is always a logical succession of "start" and "end" (i.e. start first, end after, no duplicate in between). You can simply take the odd chunks delimited by start/end:

m1 = df['c2'].isin(['start', 'end']).cumsum().mod(2).astype(bool)
df[~(m1|m1.shift())]

output:

CodePudding user response：

You just need to not add the indeces from the start and stop location to a.

Change this:

for j in df.itertuples():
    if j.c2 == 'start':
        flag = True
        list_a.append((j))
    elif j.c2 =='end':
        flag = False
        list_a.append((j))
    elif flag:
        list_a.append((j))

to this:

for j in df.itertuples():
    if j.c2 == 'start':
        flag = True
    elif j.c2 =='end':
        flag = False
    elif flag:
        list_a.append((j))

CodePudding user response：

Here a possible vectorial solution which should be easy to understand.

import pandas as pd
import numpy as np
# data
df = pd.DataFrame(
    {'c1': [1, 2, 3, 4, 5, 6, 7, 8], 
     'c2': ['1', 'start', '3', 'end',
            '5', 'start', 'end', 0]})

Now I want to mark the sequences containing start and end. In order to do so I first assign 1 where there is a start and 0 where there is a end. Then I forward fill the nan

diz = {"start": 1, "end": 0}
df["mask"] = df["c2"].map(diz).ffill()

Where the output is

   c1     c2  mask
0   1      1   NaN
1   2  start   1.0
2   3      3   1.0
3   4    end   0.0
4   5      5   0.0
5   6  start   1.0
6   7    end   0.0
7   8      0   0.0

After that I set 1 where there is a end and I do not modify otherwise and fill nan with 0s.

df["mask"] = np.where(df["c2"].eq("end"), 1, df["mask"])
df["mask"] = df["mask"].fillna(0)

Here the output is

   c1     c2  mask
0   1      1   0.0
1   2  start   1.0
2   3      3   1.0
3   4    end   1.0
4   5      5   0.0
5   6  start   1.0
6   7    end   1.0
7   8      0   0.0

Next I want to remove all elements where mask is 1 and c2 is not start or end.

df["to_remove"] = df["mask"].eq(1) &  ~df["c2"].isin(['start', 'end'])

Here the output is

   c1     c2  mask  to_remove
0   1      1   0.0      False
1   2  start   1.0      False
2   3      3   1.0       True
3   4    end   1.0      False
4   5      5   0.0      False
5   6  start   1.0      False
6   7    end   1.0      False
7   8      0   0.0      False

Finally I remove all to_remove True and remove the columns we added on the way

df = df[df["to_remove"].ne(True)]\
    .drop(columns=["mask", "to_remove"])

And you have your expected output

   c1     c2
0   1      1
1   2  start
3   4    end
4   5      5
5   6  start
6   7    end
7   8      0