I'd like to find the start and stop points of a column and flag them as shown below:
| value | flag |
|---|---|
| NaN | NaN |
| NaN | NaN |
| 1 | start |
| 2 | NaN |
| 1 | NaN |
| 3 | NaN |
| 2 | stop |
| NaN | NaN |
| 1 | start |
| 2 | stop |
CodePudding user response:
startoccurs when the current value isnotnulland the previous valueisnullstopoccurs when the current value isnotnulland the next valueisnull
Generate these conditions using shift and assign using loc:
start = df.value.notnull() & df.value.shift().isnull()
stop = df.value.notnull() & df.value.shift(-1).isnull()
df.loc[start, 'flag'] = 'start'
df.loc[stop, 'flag'] = 'stop'
# value flag
# 0 NaN NaN
# 1 NaN NaN
# 2 1.0 start
# 3 2.0 NaN
# 4 1.0 NaN
# 5 3.0 NaN
# 6 2.0 stop
# 7 NaN NaN
# 8 1.0 start
# 9 2.0 stop
Alternatively assign using mask:
df['flag'] = df['flag'].mask(start, 'start')
df['flag'] = df['flag'].mask(stop, 'stop')
CodePudding user response:
Here I iterated through the rows and used a flag to indicate if we were starting or not.
start_flag = 0
for index, row in df.iterrows():
if row['val'].isnull():
df.loc[index, 'flag'] = "NaN"
start_flag = 0
else:
if start_flag == 0:
df.loc[index, 'flag'] = "start"
start_flag = 1
else:
if (index < df.shape[0]-1 and df.loc[index 1, 'val'].isnull()) or index == df.shape[0]-1:
df.loc[index, 'flag'] = "stop"
CodePudding user response:
Here's what you need:
# Auxiliar columns to detect start and end
df['Past'] = df['Value'].shift(-1)
df['Future'] = df['Value'].shift(1)
# Auxiliar function to complete new column
def Find_start_stop_Null(row):
flag = np.nan
if ((not pd.isnull(row['Value'])) and (pd.isnull(row['Future']))):
flag = 'start'
elif ((not pd.isnull(row['Value'])) and (pd.isnull(row['Past']))):
flag = 'stop'
return flag
df['flag'] = df.apply(lambda row: Find_start_stop_Null(row), axis=1)
# Drop unnecessary columns
df = df.drop('Past', axis=1)
df = df.drop('Future', axis=1)
CodePudding user response:
df = pd.DataFrame({'value': [np.nan, np.nan, 1, 2, 1, 3, 2, np.nan, 1, 2]}, dtype='object')
df['flag'] = df.value.groupby(df.value.notna().diff().ne(0).cumsum()). \
apply(lambda se: pd.Series(['start'] [np.nan] * (se.size - 2) ['stop'], index=se.index) if se.notna().sum() else se)
print(df)
value flag
0 NaN NaN
1 NaN NaN
2 1 start
3 2 NaN
4 1 NaN
5 3 NaN
6 2 stop
7 NaN NaN
8 1 start
9 2 stop
