Home > Mobile >  Load spark data frame from a pretty-printed text file
Load spark data frame from a pretty-printed text file

Time:02-02

I have a thousands of json files, and the content within each file is similar to the following:

{
"1" : { "key":"key1", "val":"val1" },
"2" : { "key":"key2", "val":"val2" },
"3" : { "key":"key3", "val":"val3" }
.
.
.
}

What is the proper way to load those files into a spark dataframe so as a result i will have

 ------ ---------------------------------- 
|id    | val.                             |
 ------ ---------------------------------- 
|1     | { "key":"key1", "val":"val1" }   |
|2     | { "key":"key2", "val":"val2" }   |
|3     | { "key":"key2", "val":"val2" }   |
 ------ ---------------------------------- 

I 'v tried to load the json as multiline

val df= spark.read.option("multiline","true").json(small_file)

but the result was one row and three columns

 ------------------------ ------------------------ ---------------- 
|1                       |2                       |3               |
 ------------------------ ------------------------ ---------------- 
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
 ------------------------ ------------------------ ---------------- 

What i did also was loading the files into a Map

 val keys = df.columns
 val values = df.collect().last.toSeq
 val myMap = keys.zip(values).toMap
 
 println(myMap)
 // output
 // Map(1-> [key1, val1], 2-> [key2, val2], 3-> [key3, val3])

But i did not figure how to create a dataframe from this Map

CodePudding user response:

That is a multiline JSON file, you can read such file specifying the multiline option like this:

val spark = SparkSession
  .builder()
  .appName("JSONReader")
  .master("local")
  .getOrCreate()

val multiline_df = spark.read.option("multiline","true")
  .json("multiline-file.json")

multiline_df.show(false)

The result will be something like this:

[info]  ------------ ------------ ------------ 
[info] |1           |2           |3           |
[info]  ------------ ------------ ------------ 
[info] |[key1, val1]|[key2, val2]|[key3, val3]|
[info]  ------------ ------------ ------------ 
[info] 

CodePudding user response:

I was able to achieve the result using the following steps:

As mentioned in the question , the result df after loading will look like

 ------------------------ ------------------------ ---------------- 
|1                       |2                       |3               |
 ------------------------ ------------------------ ---------------- 
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
 ------------------------ ------------------------ ---------------- 

1- Cast the columns to string

val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))

2- Create a columns string

val str_cols=df.columns.mkString(",")

3- Create a new df using the casted values in step1

val df1 = df.withColumn("temp",
                        explode(arrays_zip(array(cols:_*),
                        split(lit(str_cols),","))))
                        .select("temp.*")
                        .toDF("vals","index")
  •  Tags:  
  • Related