Home > OS >  Pandas - Generate ID based on row values AND proximity of rows
Pandas - Generate ID based on row values AND proximity of rows

Time:01-20

id dob status sex work_code work_cat gross_income dependent_cat dependent_dob preparer city new_id
1 1981 Married MALE 1111 Government 15000 U 2001 Taxpayer BOSTON 1
2 1981 Married MALE 1111 Government 15000 U 2002 Taxpayer BOSTON 1
3 1983 SINGLE MALE 2222 Federal 20000 N 2009 Taxpayer MIAMI 2
4 1983 SINGLE MALE 2222 Federal 20000 N 2010 Taxpayer MIAMI 2
5 1975 SINGLE MALE 2222 Federal 12000 N/A - Taxpayer NYC 3
6 1983 Married MALE 2222 Federal 18000 U 2013 Taxpayer NYC 4
7 1981 Married MALE 1111 Government 12000 N 2005 Taxpayer PORLAND 5
8 1981 Married MALE 1111 Government 15000 N/A - Taxpayer BOSTON 6

About the data

I've a .txt dataset with 1.6m rows that looks like this, sans the new_id column which I want to generate. Each row is a taxfiller and their dependent (unless dependent_cat is N/A). When a row is followed by an identical one (sans the two dependent_X columns), it is assumed that this is in fact the same household, but a different dependent within the household (and thus, I wish to generate new_id = 1 for both, so I can later collapse the dataset such that each row is a household, and there is a column that counts the dependents and whatnot).

What I've done

I have in fact generated household ID's in the real dataset, but I did not manage to incorporate the 'row proximity'. Turns out its quite important - got too many cases like the example of #1,2,8 being bunched up into the same household when they shouldn't be. Got too many 0 income households with a count of >10 dependents due to this.

What I propose

I've tried to think this through, and I came up with:

looping through each row, assigning a new ID if dependent_cat is N/A, otherwise, verify if the previous row is identical in all columns (except dependent_cat and dependent_dob) if so, copy its new_id otherwise, new_id = last assigned id 1

Why my proposal sucks - looking for alternatives

However I know that looping through 1.6m dataframe rows is not very pythonic. I might try it, but relying on it would be continuing my lazy approach to Python.

Is there any way that hashing could help me here? I've only found hashing which would make id #1 and id #8 end up with the same new_id - leaving me at my current status. I'd need hashing to 'reset' when it passes id #2, such that it doesn't assign id #8 their same new_id. I would also prefer, but not require, new_id to be cardinal numbers as I've shown (such that max new_id, in the end, is the number of households in my dataset).

Related question: Pandas - Generate Unique ID based on row values

CodePudding user response:

(i) Create a list from the column names that are relevant to identify households (basically, dropped "id", "dependent_dob" since one is unique for every entry and the other is the personal info of the dependent).

(ii) use shift to shift index by 1 and see if consecutive rows match for the columns identified in (i). The idea is, if there is any difference, then it's a different household; otherwise the same household. This will create a boolean Series.

(iii) Use cumsum on the outcome of (ii) and assign the outcome to 'new_id' column.

dep_id_cols = ['dob', 'status', 'sex', 'work_code', 'work_cat', 
               'gross_income', 'dependent_cat', 'preparer', 'city']
df['new_id'] = df[dep_id_cols].ne(df[dep_id_cols].shift()).any(axis=1).cumsum()

Output:

   id   dob   status   sex  work_code    work_cat  gross_income dependent_cat  \
0   1  1981  Married  MALE       1111  Government         15000             U   
1   2  1981  Married  MALE       1111  Government         15000             U   
2   3  1983   SINGLE  MALE       2222     Federal         20000             N   
3   4  1983   SINGLE  MALE       2222     Federal         20000             N   
4   5  1975   SINGLE  MALE       2222     Federal         12000           NaN   
5   6  1983  Married  MALE       2222     Federal         18000             U   
6   7  1981  Married  MALE       1111  Government         12000             N   
7   8  1981  Married  MALE       1111  Government         15000           NaN   

  dependent_dob  preparer     city  new_id  
0          2001  Taxpayer   BOSTON       1  
1          2002  Taxpayer   BOSTON       1  
2          2009  Taxpayer    MIAMI       2  
3          2010  Taxpayer    MIAMI       2  
4             -  Taxpayer      NYC       3  
5          2013  Taxpayer      NYC       4  
6          2005  Taxpayer  PORLAND       5  
7             -  Taxpayer   BOSTON       6  
  •  Tags:  
  • Related