I have a file with columns A, B, C.
Next week I will get a new file with Columns A, B, C.
A and B are a unique combination. I want to generate a unique id based on these 2 fields.
When I get the new file, I want to apply the same thing to it, so if combination A, B is also in next week’s file, it will have the same ID as the previous week.
CodePudding user response:
One way is to maintain a dataset with unique ids which you have already created, e.g., in a pickle file. Let's call this tracker. Whenever you find new unique combinations for columns A and B, you add them to the dataset. To create ids, you can use pd.factorize. Be aware to ensure that new ids have to be different from the ones that are already created and included in the tracker.
This is a minimum working example:
import pandas as pd
import numpy as np
def factorize_data(df, tracker):
# merge already existing identifers
df_factorized = df.merge(
tracker,
on=['A','B'],
how='left'
)
# get rows which do not yet have a unique id
untracked = df_factorized.loc[df_factorized['id'].isnull(), ['A','B']]
if not untracked.empty:
# create new id
untracked['id'] = pd.factorize(pd._libs.lib.fast_zip([untracked.A.values, untracked.B.values]))[0].astype(int)
# ensure that new ids are not confused with old ids
max_id = 0 if tracker.empty else tracker['id'].max()
untracked['id'] = max_id 1
# update tracker
tracker_updated = pd.concat([tracker, untracked])
# update dataframe with new ids
df_factorized = df_factorized.merge(
untracked,
on=['A','B'],
how='left',
suffixes=('_old', '_new')
)
# combine old and new ids
df_factorized['id'] = np.where(df_factorized['id_old'].isnull(), df_factorized['id_new'], df_factorized['id_old'])
df_factorized['id'] = df_factorized['id'].astype(int)
df_factorized.drop(columns=['id_new','id_old'], inplace=True)
return df_factorized, tracker_updated
return df_factorized, tracker
# initialize tracking dataframe
tracker = pd.DataFrame(columns=['A','B','id'])
# data today
df_today = pd.DataFrame({
'A': [1,2,3],
'B': [4,5,6],
'C': [7,8,9]
})
# data tomorrow
df_tomorrow = pd.DataFrame({
'A': [1,4,5],
'B': [4,10,11],
'C': [7,12,13]
})
# create new column with unique ids and update tracker if necessary
df_today_factorized, tracker_updated = factorize_data(df_today, tracker)
# overwrite tracker
tracker = tracker_updated
# create new column with unique ids and update tracker if necessary
df_tomorrow_factorized, tracker_updated = factorize_data(df_tomorrow, tracker)
It transform df_today from
A B C
0 1 4 7
1 2 5 8
2 3 6 9
to
A B C id
0 1 4 7 1
1 2 5 8 2
2 3 6 9 3
and updates the initialized, empty tracker to the following dataset
A B id
0 1 4 1
1 2 5 2
2 3 6 3
Applying the procedure to tomorrow's dataframe, df_tomorrow,
A B C
0 1 4 7
1 4 10 12
2 5 11 13
transforms it to
A B C id
0 1 4 7 1
1 4 10 12 4
2 5 11 13 5
and update the tracker to
A B id
0 1 4 1
1 2 5 2
2 3 6 3
1 4 10 4
2 5 11 5
Note that the id of the first row in df_tomorrow relies on the id found in df_today while the other ones are created and added to the tracker.
