I'm merging 2 pretty large data frames, the shape of RD_4ML is (97058, 24) while the shape of NewDF is (104047, 3). They share a common column called 'personUID', below is the merge code I used.
Final_DF = RD_4ML.merge(NewDF, how='left', on='personUID')
Final_DF.fillna('none', inplace=True)
Final_DF.sample()
DF sample output:
|personUID| |code| |Death| |diagnosis_code_type| |lr|
|abc123| |ICD10| |1| |none| |none|
Essentially the columns from RD_4ML populate while the 2 columns from NewDF return "none" values. Does anyone know how to solve an error like this?
CodePudding user response:
I think the 'personUID' column does not match in the two dataframe. Ensure that they have the same data type.
CodePudding user response:
Merge with how='left' takes every entry from the left dataframe and tries to find a corresponding matching id in the right dataframe. For all nonmatching ones, it will fill in nans for the columns coming from the right frame. In SQL that is called a left join. As an example you can have a look at this here
df1 = pd.DataFrame({"uid":range(4), "values": range(4)})
df2 = pd.DataFrame({"uid":range(5, 9), "values2": range(4)})
df1.merge(df2, how="left", on='uid')
# OUTPUT
uid values values2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
here yous see all uids from the left dataframe end up in the merged dataframe and as no matching entry was found, the column from the right dataframe is set to NaN.
If your goal is, to end up with only those that have a match, you can change from "left" to "inner". For more information about that, just have a look at the great pandas docs.
