Load spark bucketed table from disk previously written via saveAsTable-CodePudding

Version: DBR 8.4 | Spark 3.1.2

Spark allows me to create a bucketed hive table and save it to a location of my choosing.

df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
    .saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)

I have verified that this saves the table data to my specified path (in my case, blob storage).

In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.

The only thing that appears to work is

df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
    .write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
    .saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)

Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)

CodePudding user response：

You can use spark SQL to create that table in your catalog spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")

CodePudding user response：

I found the answer at https://www.programmerall.com/article/3196638561/

Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use spark.read.parquet() like reading regular files. The data read in this way does not carry bucket information. The correct way is to use the CREATE TABLE statement. For details, refer to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

Examples are as follows:

spark.sql(
  """
    |CREATE TABLE bucketed
    | (name string)
    |  USING PARQUET
    |  CLUSTERED BY (name) INTO 10 BUCKETS
    |  LOCATION '/path/to'
    |""".stripMargin)