Home > OS >  getting next elements by time with shift() in pandas dataframe
getting next elements by time with shift() in pandas dataframe

Time:01-26

Following up on this question, I have a dataframe like so, I want to create a new column next_domain and next_next_domain.

It's calculated by looking for the next domain (and next next domain) for an IP by the timestamp. How can i do this in pandas?

Input:

    domain      ip      timestamp
0   Google      101     2020-04-01 23:01:41
1   Google      101     2020-04-01 23:01:59
2   Google      101     2020-04-02 12:01:41
3   Facebook    101     2020-04-02 13:11:33
4   Facebook    101     2020-04-02 13:11:35
5   Youtube     103     2020-04-21 13:01:41
6   Youtube     103     2020-04-21 13:11:46
7   Youtube     103     2020-04-22 01:01:01
8   Google      103     2020-04-22 02:11:23
9   Facebook    103     2020-04-23 14:11:13
10  Youtube     103     2020-04-23 14:11:55

Expected output:

    domain      ip      timestamp              next_domain    next_next_domain
0   Google      101     2020-04-01 23:01:41    Facebook       N/A
1   Google      101     2020-04-01 23:01:59    Facebook       N/A
2   Google      101     2020-04-02 12:01:41    Facebook       N/A
3   Facebook    101     2020-04-02 13:11:33    N/A            N/A
4   Facebook    101     2020-04-02 13:11:35    N/A            N/A
5   Youtube     103     2020-04-21 13:01:41    Google         Facebook
6   Youtube     103     2020-04-21 13:11:46    Google         Facebook
7   Youtube     103     2020-04-22 01:01:01    Google         Facebook
8   Google      103     2020-04-22 02:11:23    Facebook       Youtube
9   Facebook    103     2020-04-23 14:11:13    Youtube        N/A
10  Youtube     103     2020-04-23 14:11:55    N/A            N/A

CodePudding user response:

Building on my answer to you previous question, you can craft a function and iterate the process:

def next_domain(df, col='domain', group='ip'):
    s = df[col]
    return (s.where(s.ne(s.shift())) # keep only first domain of each stretch
                      .groupby(df[group])                   # per group
                      .apply(lambda s: s.bfill().shift(-1)) # bfill and shift up
                    )

df['next_domain'] = next_domain(df, 'domain')
df['next_next_domain'] = next_domain(df, 'next_domain')

output:

      domain   ip            timestamp next_domain next_next_domain
0     Google  101  2020-04-01 23:01:41    Facebook              NaN
1     Google  101  2020-04-01 23:01:59    Facebook              NaN
2     Google  101  2020-04-02 12:01:41    Facebook              NaN
3   Facebook  101  2020-04-02 13:11:33         NaN              NaN
4   Facebook  101  2020-04-02 13:11:35         NaN              NaN
5    Youtube  103  2020-04-21 13:01:41      Google         Facebook
6    Youtube  103  2020-04-21 13:11:46      Google         Facebook
7    Youtube  103  2020-04-22 01:01:01      Google         Facebook
8     Google  103  2020-04-22 02:11:23    Facebook          Youtube
9   Facebook  103  2020-04-23 14:11:13     Youtube              NaN
10   Youtube  103  2020-04-23 14:11:55         NaN              NaN
  •  Tags:  
  • Related