I have a DataFrame including some columns with StructType and ArrayType. I want to cast all IntegerType columns to DoubleType. I found some solutions for this problem. For example this answer does the thing similar to what I want. But the problem is, it does not change the data types of columns that are nested in a StructType or ArrayType column.
For example I have a DataFrame with below schema:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
After executing below script:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
It turns out like this:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
As you can see some columns are converted to DoubleType, but the columns that are in ArrayType and StructType are not converted.
I want final schema be like this:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
How can I achieve such a thing?
Thank you in advance
CodePudding user response:
You can add case clauses to handle ArrayType and StructType like this:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
The function uses casting with DLL strings format when the type of column is StructType or array of nested structs. e.g. if you have to cast the struct column ratio which has type struct<max:int,min:int> without having to recreate the the whole struct you'd do:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
Now applying this to your input example:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)
