spark.sql("""SHOW DATABASES""").show()
spark.sql("""SHOW TABLES IN Nbadb""").show()
---------
|namespace|
---------
| default|
| nbadb|
---------
--------- ------------ -----------
|namespace| tableName|isTemporary|
--------- ------------ -----------
| nbadb| games| false|
| nbadb|games_detail| false|
| nbadb| players| false|
| nbadb| ranking| false|
| nbadb| teams| false|
--------- ------------ -----------
def show_catalog_stats():
spark.sql("""SHOW DATABASES""").show()
spark.sql("""SHOW TABLES IN Nbadb""").show()
here is my code I try to create function which return tableNames and data Count(using spark.sql commands). how to do that?
CodePudding user response:
You can use the dataframe summary methods on pyspark for your use case. Follow the Pyspark 3.x Dataframe summary methods or Pyspark 2.x summary methods documentat.
df = spark.table("database_name.table_name")
df.summary("count").show(truncate=False)
Update:
Getting count summaries for all tables in a Database
def get_table_counts(dbname):
table_names = spark.catalog.listTables(dbname)
table_values_extracted = [f"{table.database}.{table.name}" for table in table_names]
for table_name in table_values_extracted:
print(table_name)
spark.table(table_name).summary("count").show(truncate=False)
return 0
