Pandas - Generate ID based on row values AND proximity of rows-CodePudding

id	dob	status	sex	work_code	work_cat	gross_income	dependent_cat	dependent_dob	preparer	city	new_id
1	1981	Married	MALE	1111	Government	15000	U	2001	Taxpayer	BOSTON	1
2	1981	Married	MALE	1111	Government	15000	U	2002	Taxpayer	BOSTON	1
3	1983	SINGLE	MALE	2222	Federal	20000	N	2009	Taxpayer	MIAMI	2
4	1983	SINGLE	MALE	2222	Federal	20000	N	2010	Taxpayer	MIAMI	2
5	1975	SINGLE	MALE	2222	Federal	12000	N/A	-	Taxpayer	NYC	3
6	1983	Married	MALE	2222	Federal	18000	U	2013	Taxpayer	NYC	4
7	1981	Married	MALE	1111	Government	12000	N	2005	Taxpayer	PORLAND	5
8	1981	Married	MALE	1111	Government	15000	N/A	-	Taxpayer	BOSTON	6

About the data

I've a .txt dataset with 1.6m rows that looks like this, sans the new_id column which I want to generate. Each row is a taxfiller and their dependent (unless dependent_cat is N/A). When a row is followed by an identical one (sans the two dependent_X columns), it is assumed that this is in fact the same household, but a different dependent within the household (and thus, I wish to generate new_id = 1 for both, so I can later collapse the dataset such that each row is a household, and there is a column that counts the dependents and whatnot).

What I've done

I have in fact generated household ID's in the real dataset, but I did not manage to incorporate the 'row proximity'. Turns out its quite important - got too many cases like the example of #1,2,8 being bunched up into the same household when they shouldn't be. Got too many 0 income households with a count of >10 dependents due to this.

What I propose

I've tried to think this through, and I came up with:

looping through each row, assigning a new ID if dependent_cat is N/A, otherwise, verify if the previous row is identical in all columns (except dependent_cat and dependent_dob) if so, copy its new_id otherwise, new_id = last assigned id 1

Why my proposal sucks - looking for alternatives

However I know that looping through 1.6m dataframe rows is not very pythonic. I might try it, but relying on it would be continuing my lazy approach to Python.

Is there any way that hashing could help me here? I've only found hashing which would make id #1 and id #8 end up with the same new_id - leaving me at my current status. I'd need hashing to 'reset' when it passes id #2, such that it doesn't assign id #8 their same new_id. I would also prefer, but not require, new_id to be cardinal numbers as I've shown (such that max new_id, in the end, is the number of households in my dataset).

CodePudding user response：

(i) Create a list from the column names that are relevant to identify households (basically, dropped "id", "dependent_dob" since one is unique for every entry and the other is the personal info of the dependent).

(ii) use shift to shift index by 1 and see if consecutive rows match for the columns identified in (i). The idea is, if there is any difference, then it's a different household; otherwise the same household. This will create a boolean Series.

(iii) Use cumsum on the outcome of (ii) and assign the outcome to 'new_id' column.

dep_id_cols = ['dob', 'status', 'sex', 'work_code', 'work_cat', 
               'gross_income', 'dependent_cat', 'preparer', 'city']
df['new_id'] = df[dep_id_cols].ne(df[dep_id_cols].shift()).any(axis=1).cumsum()

Output:

   id   dob   status   sex  work_code    work_cat  gross_income dependent_cat  \
0   1  1981  Married  MALE       1111  Government         15000             U   
1   2  1981  Married  MALE       1111  Government         15000             U   
2   3  1983   SINGLE  MALE       2222     Federal         20000             N   
3   4  1983   SINGLE  MALE       2222     Federal         20000             N   
4   5  1975   SINGLE  MALE       2222     Federal         12000           NaN   
5   6  1983  Married  MALE       2222     Federal         18000             U   
6   7  1981  Married  MALE       1111  Government         12000             N   
7   8  1981  Married  MALE       1111  Government         15000           NaN   

  dependent_dob  preparer     city  new_id  
0          2001  Taxpayer   BOSTON       1  
1          2002  Taxpayer   BOSTON       1  
2          2009  Taxpayer    MIAMI       2  
3          2010  Taxpayer    MIAMI       2  
4             -  Taxpayer      NYC       3  
5          2013  Taxpayer      NYC       4  
6          2005  Taxpayer  PORLAND       5  
7             -  Taxpayer   BOSTON       6