Split distinct values in a column into multiple columns-CodePudding

I want to create a DataFrame that breaks down the genres of movies into separate columns, with each individual genre column having a value of 1 for movies that are of that genre.

from this movie dataframe

to this dataframe with distinct genre column created, 1 for true and 0 for false

I'm using Databricks PySpark. many thanks!

CodePudding user response：

I would first get the unique values of the dataframe column in a list, and then iterate over the list. The name of dataframe is taken as df here

unique_vals = df.select('genres').distinct().rdd.flatMap(lambda x: x).collect()

Now lets iterate over the list

df1=df
for i in unique_vals:
    df2 = df1.withColumn(i,F.when(F.col('centroid')==i,1).otherwise(0))
    df1=df2
df2.show()

CodePudding user response：

I think this would work

df.groupby().pivot('genres').agg(lit(1)).fillna(0)