I have a dataframe having dictionary with multiple lists and I would like to create a dataframe by extracting on a certain element 'Student'.
| 1 | 2 | 3 | 4 |
|---|---|---|---|
| {"Student":["Grad","School"], "Comments": "Finished Education"} | {"Student":["New"], "Comments": "Started Education", Location : ["USA", "China", "Australia"]} | {"Student": ["Middle", "School"], "ID" : ["1000", "2000"]}} | {"Student": ["Med","School"]} |
Expected output:
| Student |
|---|
| Grad, School |
| New |
| Middle, School |
| Med, School |
I have tried to read the dataframe into a dictionary, but was unable to retrieve only the 'Student' element from the dictionary.
data_dict = {}
df = df.toPandas()
for column in df.columns:
data_dict[column] = df[column].values.tolist()
Student = [data for data in data_dict.values()]
CodePudding user response:
First of all, what you have is not dictionaries and not lists. When you have a Spark dataframe, what you call a dictionary is a struct. And what you call a list is an array. Data types can be inspected with df.printSchema().
You can extract the "Student" fields from the structs, then add everything to an array in order to finally explode.
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[((["Grad","School"], "Finished Education"),(['New'], "Started Education", ["USA", "China", "Australia"]),(["Middle", "School"], ["1000", "2000"]),(["Med","School"],))],
'`1` struct<Student:array<string>,Comments:string>, `2` struct<Student:array<string>,Comments:string,Location:array<string>>, `3` struct<Student:array<string>,ID:array<string>>, `4` struct<Student:array<string>>')
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
df.show()
# ----------------
# | Student|
# ----------------
# | [Grad, School]|
# | [New]|
# |[Middle, School]|
# | [Med, School]|
# ----------------
Another working option:
to_melt = [f"`{c}`.Student" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) Student")
