I want to create a DataFrame that breaks down the genres of movies into separate columns, with each individual genre column having a value of 1 for movies that are of that genre.
from this movie dataframe
to this dataframe with distinct genre column created, 1 for true and 0 for false
I'm using Databricks PySpark. many thanks!
CodePudding user response:
I would first get the unique values of the dataframe column in a list, and then iterate over the list. The name of dataframe is taken as df here
unique_vals = df.select('genres').distinct().rdd.flatMap(lambda x: x).collect()
Now lets iterate over the list
df1=df
for i in unique_vals:
df2 = df1.withColumn(i,F.when(F.col('centroid')==i,1).otherwise(0))
df1=df2
df2.show()
CodePudding user response:
I think this would work
df.groupby().pivot('genres').agg(lit(1)).fillna(0)
