I have a thousands of json files, and the content within each file is similar to the following:
{
"1" : { "key":"key1", "val":"val1" },
"2" : { "key":"key2", "val":"val2" },
"3" : { "key":"key3", "val":"val3" }
.
.
.
}
What is the proper way to load those files into a spark dataframe so as a result i will have
------ ----------------------------------
|id | val. |
------ ----------------------------------
|1 | { "key":"key1", "val":"val1" } |
|2 | { "key":"key2", "val":"val2" } |
|3 | { "key":"key2", "val":"val2" } |
------ ----------------------------------
I 'v tried to load the json as multiline
val df= spark.read.option("multiline","true").json(small_file)
but the result was one row and three columns
------------------------ ------------------------ ----------------
|1 |2 |3 |
------------------------ ------------------------ ----------------
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
------------------------ ------------------------ ----------------
What i did also was loading the files into a Map
val keys = df.columns
val values = df.collect().last.toSeq
val myMap = keys.zip(values).toMap
println(myMap)
// output
// Map(1-> [key1, val1], 2-> [key2, val2], 3-> [key3, val3])
But i did not figure how to create a dataframe from this Map
CodePudding user response:
That is a multiline JSON file, you can read such file specifying the multiline option like this:
val spark = SparkSession
.builder()
.appName("JSONReader")
.master("local")
.getOrCreate()
val multiline_df = spark.read.option("multiline","true")
.json("multiline-file.json")
multiline_df.show(false)
The result will be something like this:
[info] ------------ ------------ ------------
[info] |1 |2 |3 |
[info] ------------ ------------ ------------
[info] |[key1, val1]|[key2, val2]|[key3, val3]|
[info] ------------ ------------ ------------
[info]
CodePudding user response:
I was able to achieve the result using the following steps:
As mentioned in the question , the result df after loading will look like
------------------------ ------------------------ ----------------
|1 |2 |3 |
------------------------ ------------------------ ----------------
|{ "key":"key1", "val..} ||{"key":"key2", "val..} |{"key":"key3"...|
------------------------ ------------------------ ----------------
1- Cast the columns to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
2- Create a columns string
val str_cols=df.columns.mkString(",")
3- Create a new df using the casted values in step1
val df1 = df.withColumn("temp",
explode(arrays_zip(array(cols:_*),
split(lit(str_cols),","))))
.select("temp.*")
.toDF("vals","index")
