I'm new to pyspark, not sure if there's an easy way to do this.
I have a df with people's interests for example:
| name | interest |
|---|---|
| A | gym |
| A | food |
| A | games |
| B | games |
from this df, I would like to create a new one like following:
| name | interests |
|---|---|
| A | gym;food;games |
| B | games |
Can someone help with this? Sorry in advance if i didn't explain clear enough of the question.
CodePudding user response:
You can use concat_ws and collect_list from pyspark.sql.functions:
from pyspark.sql import functions as F
df.groupBy("name").agg(
F.concat_ws(";", F.collect_list("interest")
).alias("interest")).show(truncate=False)
prints:
---- --------------
|name|interest |
---- --------------
|A |gym;food;games|
|B |games |
---- --------------
Remember to assign it back to a new dataframe
concat_ws: Concatenates multiple input string columns together into a single string column, using the given separator.collect_list:
CodePudding user response:
schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd
