Home > Software design >  Pyspark : Convert nested JSON struct to pyspark dataframe
Pyspark : Convert nested JSON struct to pyspark dataframe

Time:02-08

Is there an easy way to convert the below sample json into Pyspark dataframe ?

Input :

{
"user": {
  "1": {
    "name": "Joe",
    "age": 28
  },
  "2" :{
    "name": "Chris",
    "age": 14
  },
  "3" :{
    "name": "Hally",
    "age": 21
  }
}
}

Expected output (dataframe):

root
 |-- age: long (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

 --- --- ----- 
|age|id |name |
 --- --- ----- 
|28 |1  |Joe  |
|14 |2  |Chris|
|21 |3  |Hally|
 --- --- ----- 

CodePudding user response:

You can just chain with DataFrame and apply Series

out = pd.DataFrame(input).user.apply(pd.Series).rename_axis('id').reset_index()
Out[110]: 
  id   name  age
0  1    Joe   28
1  2  Chris   14
2  3  Hally   21

CodePudding user response:

Read the JSON into dataframe, get the list of user IDs then create an array and explode it into multiple rows like this:

from pyspark.sql import functions as F

json_example = '{"user":{"1":{"name":"Joe","age":28},"2":{"name":"Chris","age":14},"3":{"name":"Hally","age":21}}}'
df = spark.read.json(spark.sparkContext.parallelize([json_example]))

result = df.select(F.array(*[
    F.struct(F.lit(i).alias("id"), F.col(f"user.{i}.name"), F.col(f"user.{i}.age"))
    for i in df.select("user.*").columns
]).alias("users")).selectExpr("inline(users)")

result.show()
# --- ----- --- 
#| id| name|age|
# --- ----- --- 
#|  1|  Joe| 28|
#|  2|Chris| 14|
#|  3|Hally| 21|
# --- ----- --- 

Or simply using stack expression:

from itertools import chain

df1 = df.select("user.*")

result = df1.selectExpr(
    f"stack({len(df1.columns)},"   ','.join(chain(*[(f"`{c}`", c) for c in df1.columns]))   ")"
).selectExpr("col1 as id", "col0.*")
  •  Tags:  
  • Related