I have a dataframe like so, I want to create a new column switch.
Switch is set to 1 if it is the last timestamp before the domain switches for a certain IP and 0 otherwise. If the timestamp is for a domain that does not switch yet (the last one for an IP), it will be 0 for switch. How can i do this in pandas?
Input:
domain ip timestamp
0 Google 101 2020-04-01 23:01:41
1 Google 101 2020-04-01 23:01:59
2 Google 101 2020-04-02 12:01:41
3 Facebook 101 2020-04-02 13:11:33
4 Facebook 101 2020-04-02 13:11:35
5 Youtube 103 2020-04-21 13:01:41
6 Youtube 103 2020-04-21 13:11:46
7 Youtube 103 2020-04-22 01:01:01
8 Google 103 2020-04-22 02:11:23
9 Facebook 103 2020-04-23 14:11:13
10 Youtube 103 2020-04-23 14:11:55
Expected output:
In the following table , row 2 has switch = 1 because it switches to Facebook right after (as seen in the timestamp) for the same IP.
Row 7 is a switch because Youtube changes to Google for IP 103, Row 8 is a switch because Google changes to Facebook for IP 103, and Row 10 is not a switch because there is no domain after Youtube.
domain ip timestamp switch
0 Google 101 2020-04-01 23:01:41 0
1 Google 101 2020-04-01 23:01:59 0
2 Google 101 2020-04-02 12:01:41 1
3 Facebook 101 2020-04-02 13:11:33 0
4 Facebook 101 2020-04-02 13:11:35 0
5 Youtube 103 2020-04-21 13:01:41 0
6 Youtube 103 2020-04-21 13:11:46 0
7 Youtube 103 2020-04-22 01:01:01 1
8 Google 103 2020-04-22 02:11:23 1
9 Facebook 103 2020-04-23 14:11:13 1
10 Youtube 103 2020-04-23 14:11:55 0
CodePudding user response:
here is one way :
df['switch'] = (df.groupby('ip')['domain'].shift(-1).fillna(df['domain'])!= df['domain']).astype(int)
output:
domain ip timestamp switch
0 Google 101 2020-04-01 23:01:41 0
1 Google 101 2020-04-01 23:01:59 0
2 Google 101 2020-04-02 12:01:41 1
3 Facebook 101 2020-04-02 13:11:33 0
4 Facebook 101 2020-04-02 13:11:35 0
5 Youtube 103 2020-04-21 13:01:41 0
6 Youtube 103 2020-04-21 13:11:46 0
7 Youtube 103 2020-04-22 01:01:01 1
8 Google 103 2020-04-22 02:11:23 1
9 Facebook 103 2020-04-23 14:11:13 1
10 Youtube 103 2020-04-23 14:11:55 0
CodePudding user response:
I am not receiving the expected output with the other solution. So here is an alternative solution:
import pandas as pd
from io import StringIO
# Create dataframe
data = StringIO("""index domain ip timestamp
0 Google 101 2020-04-01 23:01:41
1 Google 101 2020-04-01 23:01:59
2 Google 101 2020-04-02 12:01:41
3 Facebook 101 2020-04-02 13:11:33
4 Facebook 101 2020-04-02 13:11:35
5 Youtube 103 2020-04-21 13:01:41
6 Youtube 103 2020-04-21 13:11:46
7 Youtube 103 2020-04-22 01:01:01
8 Google 103 2020-04-22 02:11:23
9 Facebook 103 2020-04-23 14:11:13
10 Youtube 103 2020-04-23 14:11:55""")
df = pd.read_csv(data, sep=' ', engine='python', index_col='index')
# Create switch column
df['switch'] = ((df['domain'].shift(-1) != df['domain']) & (df['ip'].shift(-1) == df['ip'])).astype(int)
print(df)
Output:
domain ip timestamp switch
index
0 Google 101 2020-04-01 23:01:41 0
1 Google 101 2020-04-01 23:01:59 0
2 Google 101 2020-04-02 12:01:41 1
3 Facebook 101 2020-04-02 13:11:33 0
4 Facebook 101 2020-04-02 13:11:35 0
5 Youtube 103 2020-04-21 13:01:41 0
6 Youtube 103 2020-04-21 13:11:46 0
7 Youtube 103 2020-04-22 01:01:01 1
8 Google 103 2020-04-22 02:11:23 1
9 Facebook 103 2020-04-23 14:11:13 1
10 Youtube 103 2020-04-23 14:11:55 0
