I have a dataframe, nsdf, which I would like to sample 5% of. nsdf looks something like this:

col1
8
7
7
8
7
8
8
7
(... and so on)

I sample nsdf like so:

sdf = nsdf.sample(0.05)

I would then like to remove the rows in sdf from nsdf. Now, here I would think I could use nsdf.subtract(sdf), but that would remove all rows in nsdf that match any row from sdf. For example, if sdf contained

col1
7
8

Then every row in nsdf would be removed, as they are all either a 7 or an 8. Is there a way to remove only the number of 7's/8's (or whatever else) that appears in sdf? More specifically, in this example I would like to end up with an nsdf that contains the same data but one 7 fewer and one 8 fewer.

CodePudding user response：

The behavior of subtract is to remove all instances of a row in the left dataframe if present in the right dataframe. What you are looking for is exceptAll.

Example:

Data Setup

df = spark.createDataFrame([(7,), (8,), (7,), (8,)], ("col1", ))

Scenario 1:


df1 = spark.createDataFrame([(7,), (8,)], ("col1", ))

df.exceptAll(df1).show()

Output

 ---- 
|col1|
 ---- 
|   7|
|   8|
 ----

Scenario 2:

df2 = spark.createDataFrame([(7,), (7,), (8,)], ("col1", ))

df.exceptAll(df2).show()

Output

 ---- 
|col1|
 ---- 
|   8|
 ----