I have a dataframe, nsdf, which I would like to sample 5% of. nsdf looks something like this:
col1 8 7 7 8 7 8 8 7 (... and so on)
I sample nsdf like so:
sdf = nsdf.sample(0.05)
I would then like to remove the rows in sdf from nsdf. Now, here I would think I could use nsdf.subtract(sdf), but that would remove all rows in nsdf that match any row from sdf. For example, if sdf contained
col1 7 8
Then every row in nsdf would be removed, as they are all either a 7 or an 8. Is there a way to remove only the number of 7's/8's (or whatever else) that appears in sdf? More specifically, in this example I would like to end up with an nsdf that contains the same data but one 7 fewer and one 8 fewer.
CodePudding user response:
The behavior of subtract is to remove all instances of a row in the left dataframe if present in the right dataframe. What you are looking for is exceptAll.
Example:
Data Setup
df = spark.createDataFrame([(7,), (8,), (7,), (8,)], ("col1", ))
Scenario 1:
df1 = spark.createDataFrame([(7,), (8,)], ("col1", ))
df.exceptAll(df1).show()
Output
----
|col1|
----
| 7|
| 8|
----
Scenario 2:
df2 = spark.createDataFrame([(7,), (7,), (8,)], ("col1", ))
df.exceptAll(df2).show()
Output
----
|col1|
----
| 8|
----
