How to read a csv file in Spark which has a structure like:
id,name,address
1,"ashu","building","street","area","city","state","pin"
When using a reader:
val df = spark.read
.option("header",true)
.csv("input/input1.csv")
I am getting record till the third value in CSV.
--- ---- --------
| id|name| address|
--- ---- --------
| 1|ashu|building|
--- ---- --------
How to ask Spark to read all the values starting from third value till the last one in single dataframe column "address" like:
--- ---- -----------------------------------------------
| id|name| address |
--- ---- -----------------------------------------------
| 1|ashu|"building","street","area","city","state","pin"|
--- ---- -----------------------------------------------
CodePudding user response:
I'm making my answer fit your requirements to use CSV. This is the least painful way to do what you want to do.
Modify your CSV file so that it use "|" to split fields instead of ",". This will allow you to have ',' inside your columns.
id,name,address
1|"ashu"|"building","street","area","city","state","pin"
Modify you code:
val df = spark.read
.option("header",true)
.option("delimiter", '|')
.csv("input/input1.csv")
CodePudding user response:
If you can fix your input files to use another delimiter character than you should do that.
However, if you don't have that possibility, you can still read the file without header and specify a custom schema. Then, concatenate the 6 address columns to get the desired dataframe:
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address1", StringType, true),
StructField("address2", StringType, true),
StructField("address3", StringType, true),
StructField("address4", StringType, true),
StructField("address5", StringType, true),
StructField("address6", StringType, true)
)
)
val input = spark.read.schema(schema).csv("input/input1.csv")
val df = input.filter("name != 'name'").withColumn(
"address",
concat_ws(", ", (1 to 6).map(n => col(s"address$n")):_*)
).select("id", "name", "address")
df.show(false)
// --- ---- ----------------------------------------
//|id |name|address |
// --- ---- ----------------------------------------
//|1 |ashu|building, street, area, city, state, pin|
// --- ---- ----------------------------------------
