I can count how many are in L_df but not in A_df (in their 'id' column) in numpy:
missing_data = np.isin(L_df['id'], A_df['id'], invert=True).sum()
What is the equivalent code in PySpark to count number of missing data?
CodePudding user response:
You can use an anti join. Quoting the documentation from here
Anti Join: An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
Assuming you load the dataframes L_df and A_df as spark dataframes, you can use DataFrame.join with anti join as follows:
L_df.join(A_df, on='id', how='anti').count()
