Is there an easy way to convert the below sample json into Pyspark dataframe ?
Input :
{
"user": {
"1": {
"name": "Joe",
"age": 28
},
"2" :{
"name": "Chris",
"age": 14
},
"3" :{
"name": "Hally",
"age": 21
}
}
}
Expected output (dataframe):
root
|-- age: long (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
--- --- -----
|age|id |name |
--- --- -----
|28 |1 |Joe |
|14 |2 |Chris|
|21 |3 |Hally|
--- --- -----
CodePudding user response:
You can just chain with DataFrame and apply Series
out = pd.DataFrame(input).user.apply(pd.Series).rename_axis('id').reset_index()
Out[110]:
id name age
0 1 Joe 28
1 2 Chris 14
2 3 Hally 21
CodePudding user response:
Read the JSON into dataframe, get the list of user IDs then create an array and explode it into multiple rows like this:
from pyspark.sql import functions as F
json_example = '{"user":{"1":{"name":"Joe","age":28},"2":{"name":"Chris","age":14},"3":{"name":"Hally","age":21}}}'
df = spark.read.json(spark.sparkContext.parallelize([json_example]))
result = df.select(F.array(*[
F.struct(F.lit(i).alias("id"), F.col(f"user.{i}.name"), F.col(f"user.{i}.age"))
for i in df.select("user.*").columns
]).alias("users")).selectExpr("inline(users)")
result.show()
# --- ----- ---
#| id| name|age|
# --- ----- ---
#| 1| Joe| 28|
#| 2|Chris| 14|
#| 3|Hally| 21|
# --- ----- ---
Or simply using stack expression:
from itertools import chain
df1 = df.select("user.*")
result = df1.selectExpr(
f"stack({len(df1.columns)}," ','.join(chain(*[(f"`{c}`", c) for c in df1.columns])) ")"
).selectExpr("col1 as id", "col0.*")
