I have a dataframe that shows lineage (in the sample data provided it is a person's route). I have another dataframe that gives me preceding information about the routes, which should slot in the same format.
First sample looks like this:
'Lineage Step': {0: 1, 1: 2, 2: 3, 3: 4, 4:1, 5:2},
'From Country': {0: 'Spain', 1: 'Scotland', 2: 'England', 3: 'England', 4: 'Scotland', 5:'England'},
'From Town': {0: 'Madrid', 1: 'Edinburgh', 2: 'London', 3: 'London', 4: 'Edinburgh', 5: 'Manchester'},
'FromStreet': {0: 'Spanish St', 1: 'Main St', 2: 'Lower St', 3: 'Middle St', 4: 'London St', 5: 'Scotland St'},
'ToCountry': {0: 'Scotland', 1: 'England', 2: 'England', 3: 'England', 4: 'England', 5: 'England'},
'ToTown': {0: 'Edinburgh', 1: 'London', 2: 'London', 3: 'London', 4: 'Liverpool', 5: 'London'},
'ToStreet': {0: 'Lower St', 1: 'Middle St', 2: 'Upper St', 3: 'Upper St', 4: 'new St', 5: 'Old St'}}
sample_data = pd.DataFrame.from_dict(current_route)
The data I'd like to incorporate into this format looks like this:
upstream_route = {'From Country': {0: 'France', 1: 'Germany'},
'From Town': {0: 'Paris', 1: 'Berlin'},
'FromStreet': {0: 'French St', 1: 'Gerrman St'},
'ToCountry': {0: 'Spain', 1: 'Scotland'},
'ToTown': {0: 'Madrid', 1: 'Edinburgh'},
'ToStreet': {0: 'Spanish St', 1: 'London St'}}
upstream_data = pd.DataFrame.from_dict(upstream_route)
The data here is non-person specific. I.e. anyone who starts in Spain should now have the prior France leg of the journey added in.
The final dataset would look like this,
joined_route = {'AttributeName': {0: 'John', 1: 'John', 2: 'John', 3: 'John', 4: 'John', 5: 'Sally', 6: 'Sally', 7:'Sally'},
'Lineage Step': {0: 1, 1: 2, 2: 3, 3: 4, 4:5, 5:1, 6:2, 7:3},
'From Country': {0: 'France', 1:'Spain', 2: 'Scotland', 3: 'England', 4: 'England', 5: 'Germany', 6:'Scotland', 7: 'England'},
'From Town': {0: 'Paris', 1: 'Madrid', 2: 'Edinburgh', 3: 'London', 4: 'London', 5: 'Berlin', 6: 'Edinburgh', 7: 'London'},
'FromStreet': {0: 'French St', 1: 'Spanish St', 2: 'Main St', 3: 'Lower St', 4: 'Middle St', 5: 'German St', 6: 'London St', 7: 'Scotland St'},
'ToCountry': {0: 'Spain', 1: 'Scotland', 2: 'England', 3: 'England', 4: 'England', 5: 'Scotland', 6: 'England', 7: 'England'},
'ToTown': {0: 'Madrid', 1: 'Edinburgh', 2: 'London', 3: 'London', 4: 'Edinburgh', 5: 'London', 6: 'Liverpool', 7: 'London'},
'ToStreet': {0: 'Spanish St', 1: 'Lower St', 2: 'Middle St', 3: 'Upper St', 4: 'Upper St', 5: 'London St', 6: 'new St', 7: 'Old St'}}
joined_data = pd.DataFrame.from_dict(joined_route)
Where we see John and Sally have respectively had France and Germany added into their journeys, even though the added data didn't use the name variable.
One way I've thought about doing this is to merge the data like so:
sample_data.merge(upstream_data, left_on = ['From Country', 'From Town','FromStreet'], right_on = ['ToCountry', 'ToTown',
'ToStreet'], how = 'left', suffixes= ('_current', '_new'))
I'd then duplicate the lines and then delete alternating lines on the current / new columns, and merge the two (that's theoretical though,I'm not sure how the code would work in practice.
Alternatively, a way of joining the data vertically, so the new lines slot in automatically above based on a match, would be the quickest and easiest way of joining the data. Is this possible?
CodePudding user response:
Your first idea is doable (and seems to be the best approach).
First we merge and select the 'new' columns that we want. We also set 'Lineage Step to 0 as we will sort later:
df_up = pd.merge(sample_data, upstream_data, left_on = ['From Country', 'From Town','FromStreet'], right_on = ['ToCountry', 'ToTown',
'ToStreet'], how = 'left', suffixes= ('_current', ''))[['AttributeName','Lineage Step','From Country', 'From Town','FromStreet','ToCountry', 'ToTown',
'ToStreet']].dropna()
df_up['Lineage Step'] = 0
df_up
df_up looks like this:
AttributeName Lineage Step From Country From Town FromStreet ToCountry ToTown ToStreet
-- --------------- -------------- -------------- ----------- ------------ ----------- --------- ----------
0 John 0 France Paris French St Spain Madrid Spanish St
4 Sally 0 Germany Berlin Gerrman St Scotland Edinburgh London St
now we append this dataframe to sample_data and sort
df_jn = sample_data.append(df_up, ignore_index = True).sort_values(['AttributeName','Lineage Step'])
df_jn['Lineage Step'] =1
df_jn looks like this:
AttributeName Lineage Step From Country From Town FromStreet ToCountry ToTown ToStreet
-- --------------- -------------- -------------- ----------- ------------ ----------- --------- ----------
6 John 1 France Paris French St Spain Madrid Spanish St
0 John 2 Spain Madrid Spanish St Scotland Edinburgh Lower St
1 John 3 Scotland Edinburgh Main St England London Middle St
2 John 4 England London Lower St England London Upper St
3 John 5 England London Middle St England London Upper St
7 Sally 1 Germany Berlin Gerrman St Scotland Edinburgh London St
4 Sally 2 Scotland Edinburgh London St England Liverpool new St
5 Sally 3 England Manchester Scotland St England London Old St
