I pulled an SQL Query result from BigQyery with Python, and stored the result is in a dataframe, one of the fields event_params has been saved as a list of sets as follows
[{'key': 'update_with_analytics', 'value': {'string_value': None, 'int_value': 0.0, 'float_value': None, 'double_value': None}}
{'key': 'firebase_event_origin', 'value': {'string_value': 'auto', 'int_value': None, 'float_value': None, 'double_value': None}}
{'key': 'firebase_conversion', 'value': {'string_value': None, 'int_value': 1.0, 'float_value': None, 'double_value': None}}
{'key': 'previous_first_open_count', 'value': {'string_value': None, 'int_value': 0.0, 'float_value': None, 'double_value': None}}]
I would like to store the values of the above set in several columns like so:
event_params.key | event_params.value.string_value | event_params.value.int_value | event_params.value.float_value | event_params.value.double_value
This is my python code
import google.oauth2.service_account as service_account
from google.cloud import bigquery
import datetime
from time import perf_counter
import sqlalchemy
import pandas as pd
today = datetime.datetime.today()
yesterday = today - datetime.timedelta(days=2)
numdays = 2
dateList = []
gcp_project = '***'
bqtable = 'analytics_###'
client = bigquery.Client(project=gcp_project,credentials=credentials)
table_ref = client.dataset(bqtable)
for x in range (0, numdays):
date = yesterday - datetime.timedelta(days = x)
date_str = date.strftime('%Y%m%d')
dateList.append(date_str)
df = pd.DataFrame()
for i in dateList:
query = f"""SELECT * FROM `***.analytics_####.events_{i}`"""
def bq2bi(sql):
query = client.query(sql)
query_results = query.result()
return query_results.to_dataframe()
df = pd.concat([df,bq2bi(query)])
main_df_2 = df.applymap(str)
print(main_df_2)
try:
main_df_2.to_sql(con=con, name='Firebase', if_exists='replace')
except:
print('database failed to update')
else:
print("database updated")
Stop = perf_counter()
print("Script ran in",round(Stop-Start,2),"seconds")
if there is a better way to pull the data from bigquery without it being converted to a list of sets please do advise!
CodePudding user response:
you can use the unnest() function from bigquery in your query to flatten out the data set.
you can also pass the date handing to bigquery using the _table_suffix
The query would look something like:
select
events.*,
params.*
from
`project_id.analytics.events_*` as events,
unnest(event_params) as params
where
_table_suffix >= format_date("%Y%m%d", date_add(current_date('Australia/Sydney'), INTERVAL -7 day))
Explanation:
The unnest() function will change pull out the event_params into individual columns. From something looking like:
event_date | event_params
2022-01-01 | [{ key: 'update_with_analytics', ...},{key: 'firebase_conversion', ...}]
to
event_date | key | ...
2022-01-01 | update_with_analytics' | ...
2022-01-01 | firebase_conversion | ...
the _table_suffix allows you to query the multiple tables in one go. You will have table names like events_20220101 events_20220102 etc (one for each date). You can use a wildcard in the tablename to address multiple dates. So events_202201* would find all days in January 2022 and events_* would match all tables.
When you use a wildcard in the table name you get access to a virtual column _table_suffix which contains the matched part of the table name
The example above then uses some date functions to pull out the last seven days data
On a final note, you may want to edit your queries to just return the columns that you need as bigquery charges based on the amount of data queried.
