Home > Mobile >  how to get stats from database tables pyspark?
how to get stats from database tables pyspark?

Time:01-20

spark.sql("""SHOW DATABASES""").show()
spark.sql("""SHOW TABLES IN Nbadb""").show()
 --------- 
|namespace|
 --------- 
|  default|
|    nbadb|
 --------- 

 --------- ------------ ----------- 
|namespace|   tableName|isTemporary|
 --------- ------------ ----------- 
|    nbadb|       games|      false|
|    nbadb|games_detail|      false|
|    nbadb|     players|      false|
|    nbadb|     ranking|      false|
|    nbadb|       teams|      false|
 --------- ------------ ----------- 

def show_catalog_stats():


spark.sql("""SHOW DATABASES""").show()
spark.sql("""SHOW TABLES IN Nbadb""").show()

here is my code I try to create function which return tableNames and data Count(using spark.sql commands). how to do that?

CodePudding user response:

You can use the dataframe summary methods on pyspark for your use case. Follow the Pyspark 3.x Dataframe summary methods or Pyspark 2.x summary methods documentat.

df = spark.table("database_name.table_name")

df.summary("count").show(truncate=False)

Update:

Getting count summaries for all tables in a Database

def get_table_counts(dbname):
    table_names = spark.catalog.listTables(dbname)
    table_values_extracted = [f"{table.database}.{table.name}" for table in table_names]

    for table_name in table_values_extracted:
        print(table_name)
        spark.table(table_name).summary("count").show(truncate=False)
    
    return 0
  •  Tags:  
  • Related