I have created a data frame from a sql query that generates bin data for a histogram: min, max, and bin size. I want to use this data to create a list of bins that I can then merge back into the original data frame
Here is a sample of my query output
My query produces one row per model. I want to then create a list that starts at the initial_bin (0) and then increments by bin_size (7) until it reaches the maximum (435). The resulting column will be called bin_seq.
My code below shows where I'm at so far. I can generate a list with the desired sequence using a for loop and the range() function, but am unsure on how to merge this back into the data frame.
'''
import pyodbc
import pandas as pd
bin_sql = """
select 'test_model_1' model, 0 initial_bin , 435 maximum , 7 bin_size
from analytics.model_data
"""
bin_data = pd.read_sql(bin_sql,conn)
bin_max = int(bin_data["maximum"] bin_data["bin_size"])
bin_size = int(bin_data["bin_size"])
bin_start = int(bin_data["initial_bin"])
for i in range(bin_start, bin_max, bin_size):
print(i, end=', ')
'''
Here is the image of the desired output
One additional wrinkle on this is that my final output will have multiple models, each with its own maximum and bin size. So I will need to be able to generate a custom sequence for each model and then join the sequence back to the original model name.
CodePudding user response:
This code will generate the desire output you want. You didn't mentioned how to calculate bin_seq.
import pyodbc import pandas as pd bin_sql = """ select 'test_model_1' model, 0 initial_bin , 435 maximum , 7 bin_size from analytics.model_data """ bin_data = pd.read_sql(bin_sql,conn) bin_max = int(bin_data["maximum"] bin_data["bin_size"]) bin_size = int(bin_data["bin_size"]) bin_start = int(bin_data["initial_bin"]) bin_sq=[] ## For creating bin Sequence temp=0 ## Generating bin_seq for i in range(bin_start, bin_max, bin_size): bin_sq.append(temp) temp=temp bin_size ## Adding bin_seq in dataframe bin_data['bin_seq']=bin_sq
CodePudding user response:
Try:
Setup a MRE:
df = pd.DataFrame([['test_model', 0, 435, 7]],
columns=['model', 'initial_bin', 'maximum', 'bin_size'])
print(df)
# Output
model initial_bin maximum bin_size
0 test_model 0 435 7
bin_seq = lambda x: range(x['initial_bin'], x['maximum'], x['bin_size'])
df = df.assign(bin_seq=df.apply(bin_seq, axis=1)) \
.explode('bin_seq').reset_index(drop=True)
print(df)
# Output
model initial_bin maximum bin_size bin_seq
0 test_model 0 435 7 0
1 test_model 0 435 7 7
2 test_model 0 435 7 14
3 test_model 0 435 7 21
4 test_model 0 435 7 28
.. ... ... ... ... ...
58 test_model 0 435 7 406
59 test_model 0 435 7 413
60 test_model 0 435 7 420
61 test_model 0 435 7 427
62 test_model 0 435 7 434
[63 rows x 5 columns]
