Identify and count segments between a start and an end marker-CodePudding

The goal is to fill values only between two values (start and end) with unique numbers (will be used in a groupby later on), notice how the values between end and start are still None in the desired output:

Code:

>>> df = pd.DataFrame(
       dict(
           flag=[None, 'start', None, None, 'end', 'start', 'end', None, 'start', None,'end',None],
       )
    )

>>> df 
     flag
0    None
1   start
2    None
3    None
4     end
5   start
6     end
7    None
8   start
9    None
10    end
11   None

CodePudding user response：

One option is to finde the index of all 'start' and 'end' strings and loop over a list with None values and replace the values at the correct position.

df = pd.DataFrame(
       dict(
           flag=[None, 'start', None, None, 'end', 'start', 'end', None, 'start', None,'end',None],
       )
    )

start_index = df[df['flag']=='start'].index
end_index = df[df['flag']=='end'].index

values = [None]*df.shape[0]
for i, (s, e) in enumerate(zip(start_index, end_index),1):
    for j in range(s,e 1):
        values[j]=i
df["flag_persiods"]=values
>>> df
     flag  flag_persiods
0    None            NaN
1   start            1.0
2    None            1.0
3    None            1.0
4     end            1.0
5   start            2.0
6     end            2.0
7    None            NaN
8   start            3.0
9    None            3.0
10    end            3.0
11   None            NaN

CodePudding user response：

This seem to work for me

# start by creating a col containing only None values 
df['flag_periods'] = None
#set the count to 0 
count = 0
#use end as False or True to know when a flag periods ends or starts 
end = True 
#loop over the df using indexes to fill the 'flag_periods' column wit the period
for index in df.index :     
    if df.at[index, 'flag'] == 'start':
        end = False 
        count  = 1 
        df.at[index, 'flag_periods'] = count
    elif df.at[index, 'flag'] == 'end':
        df.at[index, 'flag_periods'] = count
        end = True 
    else : 
        if end == False :
            df.at[index, 'flag_periods'] = count
        else : 
            df.at[index, 'flag_periods'] = None
            
df

CodePudding user response：

Usually problems like these are solved by fiddling with cumsum and shift.

The main idea for this solution is to identify rows where the number of "starts" seen is ahead of the number of "ends" seen by one.

The only assumption I made is that 'start' and 'end' alternate, beginning with a 'start'.

>>> values = df['flag'].eq('start').cumsum()
>>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0))
>>> df['flag_periods'] = df['flag'].mask(where, values)
>>> df 
     flag flag_periods
0    None         None
1   start            1
2    None            1
3    None            1
4     end            1
5   start            2
6     end            2
7    None         None
8   start            3
9    None            3
10    end            3
11   None         None

Visualization:

>>> df['values'] = df.eq('start').cumsum()
>>> df['end_cumsum'] = df['flag'].eq('end').cumsum()
>>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0)
>>> df['values-1'] = df['values'].sub(1)
>>> df['where'] = df['values-1'].eq(df['end_cumsum_s1'])
>>> df 
     flag  values  end_cumsum  end_cumsum_s1  values-1  where
0    None       0           0            0.0        -1  False
1   start       1           0            0.0         0   True
2    None       1           0            0.0         0   True
3    None       1           0            0.0         0   True
4     end       1           1            0.0         0   True
5   start       2           1            1.0         1   True
6     end       2           2            1.0         1   True
7    None       2           2            2.0         1  False
8   start       3           2            2.0         2   True
9    None       3           2            2.0         2   True
10    end       3           3            2.0         2   True
11   None       3           3            3.0         2  False

edit: added .fillna(0) to account for dataframes where the first value in the 'flag' column is 'start'.