Iterating over lists in pandas dataframe to remove everything after certain value (if the value exis-CodePudding

I want to filter my dataframe values based on the occurrence of '1' in my column events. When a 1 occurres, everything after the 1 should be removed.

I want to do this for my whole dataframe, which looks like this:

import pandas as pd

df = pd.DataFrame([['00000000000 ', [4, 5, 5, 3, 2, 1, 5]],
                   ['00000000001', [4, 5, 5, 1, 2, 1, 5, 5, 5]],
                   ['00000000002 ', [4, 5, 1, 3, 2, 1, 5, 5, 5, 1]]],
                  columns=['session_id', 'events'])

This works with the following solution, like answered in this question.

df['events_short'] = ""
for i, row in df.iterrows():
    df.at[i, 'events_short'] = row['events'][:row['events'].index(1)]

This only works if the '1' occurs, when it doesn't, I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-175-e4d3f228e32f> in <module>()
      1 df['events_short'] = ""
      2 for i, row in df.iterrows():
----> 3     df.at[i, 'events_short'] = row['events'][:row['events'].index(1)]

ValueError: 1 is not in list

Therefore, I need an exception, for when the 1 does not occur in the array. Can someone help me to set this up? Thanks!

CodePudding user response：

You can use apply and find the first element in the list, and truncate it accordingly.

df['events_short']=df['events'].apply(lambda x:x[0:x.index(1)] if 1 in x else None)

If you want to include the 1:

df['events_short']=df['events'].apply(lambda x:x[0:x.index(1) 1] if 1 in x else None)

Note that apply as preferred (faster) than iterrow

CodePudding user response：

While @OnY's answer is nice, it requires to read twice each list (once to find if the index is existing, once to find it).

A more efficient approach might be to use a helper function with try/except:

def upto1(l):
    try:
        return l[:l.index(1)]
    except ValueError:
        return l
    
df['events2'] = df['events'].apply(upto1)

example:

    session_id                          events          events2
0  00000000000           [4, 5, 5, 3, 2, 1, 5]  [4, 5, 5, 3, 2]
1  00000000001     [4, 5, 5, 1, 2, 1, 5, 5, 5]        [4, 5, 5]
2  00000000002  [4, 5, 1, 3, 2, 1, 5, 5, 5, 1]           [4, 5]
3  00000000003                       [0, 2, 3]        [0, 2, 3]

CodePudding user response：

Building further off of @mozway's answer, it is (generally) good practice to avoid having the program intentionally raise an exception and catching, since the try-except can be slower than non-failing logic:

def upto1(l):
    return l[:l.index(1)] if 1 in l else l

df['events2'] = df['events'].apply(upto1)