Home > Software engineering >  How does schema inference work in spark.read.parquet?
How does schema inference work in spark.read.parquet?

Time:01-25

I'm trying to read a parquet file on spark and I have a question.

How is the type inferred when loading a parquet file with spark.read.parquet?

  • 1. Parquet Type INT32 -> Spark Type IntegerType
  • 2. Parquet inferred from actual stored values -> Spark IntegerType

Is there a dictionary for mapping like 1? Or is it inferred from the actual stored values like 2?

CodePudding user response:

Spark uses the parquet schema to parse it to an internal representation (i.e, StructType), it is a bit hard to find this information on spark docs. I went through the code to find the mapping you are looking for here:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L197-L281

  •  Tags:  
  • Related