Retention and archival policy on Hive data-CodePudding

We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.

After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.

Is there any hive built-in feature that does the above?

CodePudding user response：

Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)

I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.

If you are using partitions you may wish to do some additional cleanup.

MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.

It's better to drop partitions:

ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;

CodePudding user response：

In Hive you can implement partitions retention (since Hive 3.1.0)

For example to drop partitions and their data after 7 days:

ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');