How to generate a unique id based on a combination of columns, but keep it consistent across differe-CodePudding

I have a file with columns A, B, C.

Next week I will get a new file with Columns A, B, C.

A and B are a unique combination. I want to generate a unique id based on these 2 fields.

When I get the new file, I want to apply the same thing to it, so if combination A, B is also in next week’s file, it will have the same ID as the previous week.

CodePudding user response：

One way is to maintain a dataset with unique ids which you have already created, e.g., in a pickle file. Let's call this tracker. Whenever you find new unique combinations for columns A and B, you add them to the dataset. To create ids, you can use pd.factorize. Be aware to ensure that new ids have to be different from the ones that are already created and included in the tracker.

This is a minimum working example:

import pandas as pd
import numpy as np

def factorize_data(df, tracker):
    # merge already existing identifers
    df_factorized = df.merge(
        tracker,
        on=['A','B'],
        how='left'
    )

    # get rows which do not yet have a unique id
    untracked = df_factorized.loc[df_factorized['id'].isnull(), ['A','B']]
    
    if not untracked.empty:
        # create new id
        untracked['id'] = pd.factorize(pd._libs.lib.fast_zip([untracked.A.values, untracked.B.values]))[0].astype(int)

        # ensure that new ids are not confused with old ids
        max_id = 0 if tracker.empty else tracker['id'].max()
        untracked['id']  = max_id   1

        # update tracker
        tracker_updated = pd.concat([tracker, untracked])

        # update dataframe with new ids
        df_factorized = df_factorized.merge(
            untracked,
            on=['A','B'],
            how='left',
            suffixes=('_old', '_new')
        )
        
        # combine old and new ids
        df_factorized['id'] = np.where(df_factorized['id_old'].isnull(), df_factorized['id_new'], df_factorized['id_old'])
        df_factorized['id'] = df_factorized['id'].astype(int)
        df_factorized.drop(columns=['id_new','id_old'], inplace=True)

        return df_factorized, tracker_updated

    return df_factorized, tracker

# initialize tracking dataframe
tracker = pd.DataFrame(columns=['A','B','id'])

# data today
df_today = pd.DataFrame({
    'A': [1,2,3],
    'B': [4,5,6],
    'C': [7,8,9]
})

# data tomorrow
df_tomorrow = pd.DataFrame({
    'A': [1,4,5],
    'B': [4,10,11],
    'C': [7,12,13]
})

# create new column with unique ids and update tracker if necessary
df_today_factorized, tracker_updated = factorize_data(df_today, tracker)

# overwrite tracker
tracker = tracker_updated

# create new column with unique ids and update tracker if necessary
df_tomorrow_factorized, tracker_updated = factorize_data(df_tomorrow, tracker)

It transform df_today from

   A  B  C  id
0  1  4  7   1
1  2  5  8   2
2  3  6  9   3

and updates the initialized, empty tracker to the following dataset

Applying the procedure to tomorrow's dataframe, df_tomorrow,

   A   B   C
0  1   4   7
1  4  10  12
2  5  11  13

transforms it to

   A   B   C  id
0  1   4   7   1
1  4  10  12   4
2  5  11  13   5

and update the tracker to

   A   B id
0  1   4  1
1  2   5  2
2  3   6  3
1  4  10  4
2  5  11  5

Note that the id of the first row in df_tomorrow relies on the id found in df_today while the other ones are created and added to the tracker.