I need to add a column with changes ow worker coordinates through different stages. We have a DataFrame:
import pandas as pd
from geopy.distance import geodesic as GD
d = {'user_id': [26, 26, 26, 26, 26, 26, 9, 9, 9, 9],
'worker_latitude': [55.114410, 55.114459, 55.114379,
55.114462, 55.114372, 55.114389, 65.774064, 65.731034, 65.731034, 65.774057],
'worker_longitude': [38.927155, 38.927114, 38.927101, 38.927156,
38.927258, 38.927120, 37.532380, 37.611746, 37.611746, 37.532346],
'change':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data=d)
which looks like:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Then I need to count difference between person previous and current stage. So I use a function:
for group in df.groupby(by='user_id'):
group[1].reset_index(inplace=True,drop=True)
for i in range(1,len(group[1])):
first_xy=(group[1]['worker_latitude'][i-1],group[1]['worker_longitude'][i-1])
second_xy=(group[1]['worker_latitude'][i],group[1]['worker_longitude'][i])
print((round((GD(first_xy, second_xy).km),6)))
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
And then I get:
6.021576
0.0
6.021896
0.00605
0.008945
0.009884
0.011948
0.009007
display(df)
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 0
1 26 55.114459 38.927114 0
2 26 55.114379 38.927101 0
3 26 55.114462 38.927156 0
4 26 55.114372 38.927258 0
5 26 55.114389 38.927120 0
6 9 65.774064 37.532380 0
7 9 65.731034 37.611746 0
8 9 65.731034 37.611746 0
9 9 65.774057 37.532346 0
Which means that values are counted correctly, but for some reason they don't fit into 'change' column. What can be done?
CodePudding user response:
It doesn't works because you're accessing a copy of your DataFrame and trying to assign value to it.
However, it seems instead of iterating over the DataFrame inside groupby, it seems more intuitive to use groupby shift to get the first_xys first; then apply a custom function that applies GD between first_xy and second_xy to each row:
def func(x):
if x.notna().all():
first_xy = (x['prev_lat'], x['prev_long'])
second_xy = (x['worker_latitude'], x['worker_longitude'])
return round((GD(first_xy, second_xy).km), 6)
else:
return float('nan')
g = df.groupby('user_id')
df['prev_lat'] = g['worker_latitude'].shift()
df['prev_long'] = g['worker_longitude'].shift()
df['change'] = df.apply(func, axis=1)
df = df.drop(columns=['prev_lat','prev_long'])
Output:
user_id worker_latitude worker_longitude change
0 26 55.114410 38.927155 NaN
1 26 55.114459 38.927114 0.006050
2 26 55.114379 38.927101 0.008945
3 26 55.114462 38.927156 0.009884
4 26 55.114372 38.927258 0.011948
5 26 55.114389 38.927120 0.009007
6 9 65.774064 37.532380 NaN
7 9 65.731034 37.611746 6.021576
8 9 65.731034 37.611746 0.000000
9 9 65.774057 37.532346 6.021896
CodePudding user response:
I think that the problem could be the following line:
group[1]['change'][i]=round((GD(first_xy, second_xy).km),6)
You are updating the group variable, and you should update df variable instead. My suggestion for you fixing this property is:
df.loc[i, "change"] = round((GD(first_xy, second_xy).km),6)
Considering that i is the row number that you want update, and "change" is the column name.
