| id | dob | status | sex | work_code | work_cat | gross_income | dependent_cat | dependent_dob | preparer | city | new_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1981 | Married | MALE | 1111 | Government | 15000 | U | 2001 | Taxpayer | BOSTON | 1 |
| 2 | 1981 | Married | MALE | 1111 | Government | 15000 | U | 2002 | Taxpayer | BOSTON | 1 |
| 3 | 1983 | SINGLE | MALE | 2222 | Federal | 20000 | N | 2009 | Taxpayer | MIAMI | 2 |
| 4 | 1983 | SINGLE | MALE | 2222 | Federal | 20000 | N | 2010 | Taxpayer | MIAMI | 2 |
| 5 | 1975 | SINGLE | MALE | 2222 | Federal | 12000 | N/A | - | Taxpayer | NYC | 3 |
| 6 | 1983 | Married | MALE | 2222 | Federal | 18000 | U | 2013 | Taxpayer | NYC | 4 |
| 7 | 1981 | Married | MALE | 1111 | Government | 12000 | N | 2005 | Taxpayer | PORLAND | 5 |
| 8 | 1981 | Married | MALE | 1111 | Government | 15000 | N/A | - | Taxpayer | BOSTON | 6 |
About the data
I've a .txt dataset with 1.6m rows that looks like this, sans the new_id column which I want to generate. Each row is a taxfiller and their dependent (unless dependent_cat is N/A). When a row is followed by an identical one (sans the two dependent_X columns), it is assumed that this is in fact the same household, but a different dependent within the household (and thus, I wish to generate new_id = 1 for both, so I can later collapse the dataset such that each row is a household, and there is a column that counts the dependents and whatnot).
What I've done
I have in fact generated household ID's in the real dataset, but I did not manage to incorporate the 'row proximity'. Turns out its quite important - got too many cases like the example of #1,2,8 being bunched up into the same household when they shouldn't be. Got too many 0 income households with a count of >10 dependents due to this.
What I propose
I've tried to think this through, and I came up with:
looping through each row, assigning a new ID if dependent_cat is N/A, otherwise, verify if the previous row is identical in all columns (except dependent_cat and dependent_dob) if so, copy its new_id otherwise, new_id = last assigned id 1
Why my proposal sucks - looking for alternatives
However I know that looping through 1.6m dataframe rows is not very pythonic. I might try it, but relying on it would be continuing my lazy approach to Python.
Is there any way that hashing could help me here? I've only found hashing which would make id #1 and id #8 end up with the same new_id - leaving me at my current status. I'd need hashing to 'reset' when it passes id #2, such that it doesn't assign id #8 their same new_id. I would also prefer, but not require, new_id to be cardinal numbers as I've shown (such that max new_id, in the end, is the number of households in my dataset).
Related question: Pandas - Generate Unique ID based on row values
CodePudding user response:
(i) Create a list from the column names that are relevant to identify households (basically, dropped "id", "dependent_dob" since one is unique for every entry and the other is the personal info of the dependent).
(ii) use shift to shift index by 1 and see if consecutive rows match for the columns identified in (i). The idea is, if there is any difference, then it's a different household; otherwise the same household. This will create a boolean Series.
(iii) Use cumsum on the outcome of (ii) and assign the outcome to 'new_id' column.
dep_id_cols = ['dob', 'status', 'sex', 'work_code', 'work_cat',
'gross_income', 'dependent_cat', 'preparer', 'city']
df['new_id'] = df[dep_id_cols].ne(df[dep_id_cols].shift()).any(axis=1).cumsum()
Output:
id dob status sex work_code work_cat gross_income dependent_cat \
0 1 1981 Married MALE 1111 Government 15000 U
1 2 1981 Married MALE 1111 Government 15000 U
2 3 1983 SINGLE MALE 2222 Federal 20000 N
3 4 1983 SINGLE MALE 2222 Federal 20000 N
4 5 1975 SINGLE MALE 2222 Federal 12000 NaN
5 6 1983 Married MALE 2222 Federal 18000 U
6 7 1981 Married MALE 1111 Government 12000 N
7 8 1981 Married MALE 1111 Government 15000 NaN
dependent_dob preparer city new_id
0 2001 Taxpayer BOSTON 1
1 2002 Taxpayer BOSTON 1
2 2009 Taxpayer MIAMI 2
3 2010 Taxpayer MIAMI 2
4 - Taxpayer NYC 3
5 2013 Taxpayer NYC 4
6 2005 Taxpayer PORLAND 5
7 - Taxpayer BOSTON 6
