Home > Software engineering >  Converting List of Sets to Columns
Converting List of Sets to Columns

Time:01-14

I pulled an SQL Query result from BigQyery with Python, and stored the result is in a dataframe, one of the fields event_params has been saved as a list of sets as follows

[{'key': 'update_with_analytics', 'value': {'string_value': None, 'int_value': 0.0, 'float_value': None, 'double_value': None}}
 {'key': 'firebase_event_origin', 'value': {'string_value': 'auto', 'int_value': None, 'float_value': None, 'double_value': None}}
 {'key': 'firebase_conversion', 'value': {'string_value': None, 'int_value': 1.0, 'float_value': None, 'double_value': None}}
 {'key': 'previous_first_open_count', 'value': {'string_value': None, 'int_value': 0.0, 'float_value': None, 'double_value': None}}]

I would like to store the values of the above set in several columns like so:

event_params.key | event_params.value.string_value | event_params.value.int_value | event_params.value.float_value | event_params.value.double_value

This is my python code

import google.oauth2.service_account as service_account
from google.cloud import bigquery
import datetime
from time import perf_counter
import sqlalchemy
import pandas as pd

today = datetime.datetime.today()
yesterday = today - datetime.timedelta(days=2)
numdays = 2
dateList = []

gcp_project = '***'
bqtable = 'analytics_###'

client = bigquery.Client(project=gcp_project,credentials=credentials)
table_ref = client.dataset(bqtable)


for x in range (0, numdays):
    date = yesterday - datetime.timedelta(days = x)
    date_str = date.strftime('%Y%m%d') 
    dateList.append(date_str)

df = pd.DataFrame()

for i in dateList:
    query = f"""SELECT * FROM `***.analytics_####.events_{i}`"""
    def bq2bi(sql):
        query = client.query(sql)    
        query_results = query.result()
        return query_results.to_dataframe()
    df = pd.concat([df,bq2bi(query)])
main_df_2 = df.applymap(str)

print(main_df_2)    

try:
    main_df_2.to_sql(con=con, name='Firebase', if_exists='replace')
except: 
    print('database failed to update')
else:
    print("database updated")

Stop = perf_counter()
print("Script ran in",round(Stop-Start,2),"seconds")

if there is a better way to pull the data from bigquery without it being converted to a list of sets please do advise!

CodePudding user response:

you can use the unnest() function from bigquery in your query to flatten out the data set.

you can also pass the date handing to bigquery using the _table_suffix

The query would look something like:

select 
   events.*, 
   params.*
from 
   `project_id.analytics.events_*` as events,
   unnest(event_params) as params
where
   _table_suffix >= format_date("%Y%m%d", date_add(current_date('Australia/Sydney'), INTERVAL -7 day))

Explanation:

The unnest() function will change pull out the event_params into individual columns. From something looking like:

event_date | event_params
2022-01-01 | [{ key: 'update_with_analytics', ...},{key: 'firebase_conversion', ...}]

to

event_date | key                    | ...
2022-01-01 | update_with_analytics' | ...
2022-01-01 | firebase_conversion    | ...

the _table_suffix allows you to query the multiple tables in one go. You will have table names like events_20220101 events_20220102 etc (one for each date). You can use a wildcard in the tablename to address multiple dates. So events_202201* would find all days in January 2022 and events_* would match all tables.

When you use a wildcard in the table name you get access to a virtual column _table_suffix which contains the matched part of the table name

The example above then uses some date functions to pull out the last seven days data

On a final note, you may want to edit your queries to just return the columns that you need as bigquery charges based on the amount of data queried.

  •  Tags:  
  • Related