Home > OS >  Creating flags for time series data before switch
Creating flags for time series data before switch

Time:01-20

I have a dataframe like so, I want to create a new column switch.

Switch is set to 1 if it is the last timestamp before the domain switches for a certain IP and 0 otherwise. If the timestamp is for a domain that does not switch yet (the last one for an IP), it will be 0 for switch. How can i do this in pandas?

Input:

    domain      ip      timestamp
0   Google      101     2020-04-01 23:01:41
1   Google      101     2020-04-01 23:01:59
2   Google      101     2020-04-02 12:01:41
3   Facebook    101     2020-04-02 13:11:33
4   Facebook    101     2020-04-02 13:11:35
5   Youtube     103     2020-04-21 13:01:41
6   Youtube     103     2020-04-21 13:11:46
7   Youtube     103     2020-04-22 01:01:01
8   Google      103     2020-04-22 02:11:23
9   Facebook    103     2020-04-23 14:11:13
10  Youtube     103     2020-04-23 14:11:55

Expected output:

In the following table , row 2 has switch = 1 because it switches to Facebook right after (as seen in the timestamp) for the same IP.

Row 7 is a switch because Youtube changes to Google for IP 103, Row 8 is a switch because Google changes to Facebook for IP 103, and Row 10 is not a switch because there is no domain after Youtube.

    domain      ip      timestamp              switch
0   Google      101     2020-04-01 23:01:41    0
1   Google      101     2020-04-01 23:01:59    0
2   Google      101     2020-04-02 12:01:41    1
3   Facebook    101     2020-04-02 13:11:33    0 
4   Facebook    101     2020-04-02 13:11:35    0
5   Youtube     103     2020-04-21 13:01:41    0
6   Youtube     103     2020-04-21 13:11:46    0
7   Youtube     103     2020-04-22 01:01:01    1
8   Google      103     2020-04-22 02:11:23    1
9   Facebook    103     2020-04-23 14:11:13    1
10  Youtube     103     2020-04-23 14:11:55    0

CodePudding user response:

here is one way :

df['switch'] = (df.groupby('ip')['domain'].shift(-1).fillna(df['domain'])!= df['domain']).astype(int)

output:

      domain   ip            timestamp  switch
0     Google  101  2020-04-01 23:01:41       0
1     Google  101  2020-04-01 23:01:59       0
2     Google  101  2020-04-02 12:01:41       1
3   Facebook  101  2020-04-02 13:11:33       0
4   Facebook  101  2020-04-02 13:11:35       0
5    Youtube  103  2020-04-21 13:01:41       0
6    Youtube  103  2020-04-21 13:11:46       0
7    Youtube  103  2020-04-22 01:01:01       1
8     Google  103  2020-04-22 02:11:23       1
9   Facebook  103  2020-04-23 14:11:13       1
10   Youtube  103  2020-04-23 14:11:55       0

CodePudding user response:

I am not receiving the expected output with the other solution. So here is an alternative solution:

import pandas as pd
from io import StringIO

# Create dataframe
data = StringIO("""index    domain    ip    timestamp
0    Google    101    2020-04-01 23:01:41
1    Google    101    2020-04-01 23:01:59
2    Google    101    2020-04-02 12:01:41
3    Facebook    101    2020-04-02 13:11:33
4    Facebook    101    2020-04-02 13:11:35
5    Youtube    103    2020-04-21 13:01:41
6    Youtube    103    2020-04-21 13:11:46
7    Youtube    103    2020-04-22 01:01:01
8    Google    103    2020-04-22 02:11:23
9    Facebook    103    2020-04-23 14:11:13
10    Youtube    103    2020-04-23 14:11:55""")
df = pd.read_csv(data, sep='    ', engine='python', index_col='index')

# Create switch column
df['switch'] = ((df['domain'].shift(-1) != df['domain']) & (df['ip'].shift(-1) == df['ip'])).astype(int)
print(df)

Output:

         domain   ip            timestamp  switch
index                                            
0        Google  101  2020-04-01 23:01:41       0
1        Google  101  2020-04-01 23:01:59       0
2        Google  101  2020-04-02 12:01:41       1
3      Facebook  101  2020-04-02 13:11:33       0
4      Facebook  101  2020-04-02 13:11:35       0
5       Youtube  103  2020-04-21 13:01:41       0
6       Youtube  103  2020-04-21 13:11:46       0
7       Youtube  103  2020-04-22 01:01:01       1
8        Google  103  2020-04-22 02:11:23       1
9      Facebook  103  2020-04-23 14:11:13       1
10      Youtube  103  2020-04-23 14:11:55       0
  •  Tags:  
  • Related