I have a json file which looks like this:
{
"data": {
"success": true,
"timeseries": true,
"start_date": "2022-10-01",
"end_date": "2022-10-04",
"base": "EUR",
"rates": {
"2022-10-01": {
"NG": 0.1448939471560284
},
"2022-10-02": {
"NG": 0.14487923291390148
},
"2022-10-03": {
"NG": 0.1454857922753868
},
"2022-10-04": {
"NG": 0.1507352356663182
}
},
"unit": "per MMBtu"
}
}
I want to create a dataframe which looks like this:
Date NG base
2022-10-01 0.144894 EUR
2022-10-02 0.144879 EUR
2022-10-03 0.145486 EUR
2022-10-04 0.150735 EUR
This is what I tried:
with open(r'C:\Users\EH\Desktop\tools\json_files\blue_file.json','r') as f:
data = json.loads(f.read())
df1 = pd.DataFrame(data['data']['rates'])
df1 = df1.T
df2 = pd.DataFrame(data['data'])
df2 = df2.base
merge = [df1, df2]
df3 = pd.concat(merge)
print(df3)
My current output:
NG 0
2022-10-01 0.144894 NaN
2022-10-02 0.144879 NaN
2022-10-03 0.145486 NaN
2022-10-04 0.150735 NaN
2022-10-01 NaN EUR
2022-10-02 NaN EUR
2022-10-03 NaN EUR
2022-10-04 NaN EUR
As you can see something is not going okay, I dont understand where the NaNs come from.
CodePudding user response:
JSON is a very flexible format and no function in pandas can parse all types of JSON. You need to preprocess the JSON before constructing the dataframe:
with open("data.json") as fp:
data = json.load(fp)
df = pd.DataFrame(
[(date, rate["NG"]) for date, rate in data["data"]["rates"].items()],
columns=["Date", "NG"],
).assign(base=data["data"]["base"])
CodePudding user response:
Using what you had done, you just needed to specify the axis on which you want to concatenate the 2 dataframes by default pandas concatenated one below the other.
pd.concat(merge, axis=1) # default 0
CodePudding user response:
simply:
import json
with open(r'path.json','r') as f:
data = json.loads(f.read())
df=pd.DataFrame(data['data']).reset_index()
df=df.join(pd.json_normalize(df.pop('rates')))
print(df)
'''
index success timeseries start_date end_date base unit NG
0 2022-10-01 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1448939471560284
1 2022-10-02 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.14487923291390148
2 2022-10-03 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1454857922753868
3 2022-10-04 true true 2022-10-01 2022-10-04 EUR per MMBtu 0.1507352356663182
'''
CodePudding user response:
You were almost there. The reason you are getting NaNs is that you need to specify different axis when concatenating the dataframes: pd.concat(merge, axis=1).
Here is my version:
import json
import pandas as pd
with open("blue_file.json", "r") as f:
data = json.loads(f.read())
df = pd.DataFrame.from_dict(data["data"]["rates"], orient="index")
df["base"] = data["data"]["base"]
df
Output:
NG base
2022-10-01 0.144894 EUR
2022-10-02 0.144879 EUR
2022-10-03 0.145486 EUR
2022-10-04 0.150735 EUR
CodePudding user response:
Assuming that the file is stored in a variable called data
data = { "data": { "success": True, "timeseries": True, "start_date": "2022-10-01", "end_date": "2022-10-04", "base": "EUR", "rates": { "2022-10-01": { "NG": 0.1448939471560284 }, "2022-10-02": { "NG": 0.14487923291390148 }, "2022-10-03": { "NG": 0.1454857922753868 }, "2022-10-04": { "NG": 0.1507352356663182 } }, "unit": "per MMBtu" } }
First, one will read the data to a dataframe with pandas.DataFrame.from_dict as follows
df = pd.DataFrame.from_dict(data['data']['rates'], orient='index')
[Out]:
NG
2022-10-01 0.144894
2022-10-02 0.144879
2022-10-03 0.145486
2022-10-04 0.150735
But, as one wants doesn't want the columns with date to be the index, and one wants that column to have the name Date, one will start by resetting the index using pandas.DataFrame.reset_index
df = df.reset_index()
[Out]:
index NG
0 2022-10-01 0.144894
1 2022-10-02 0.144879
2 2022-10-03 0.145486
3 2022-10-04 0.150735
And then one will be able to rename the columns to the desired output format pandas.DataFrame.rename
df = df.rename(columns={'index':'Date'})
[Out]:
Date NG
0 2022-10-01 0.144894
1 2022-10-02 0.144879
2 2022-10-03 0.145486
3 2022-10-04 0.150735
As it is still missing the base column, one can pick it up from the data as follows
df['base'] = data['data']['base']
[Out]:
Date NG base
0 2022-10-01 0.144894 EUR
1 2022-10-02 0.144879 EUR
2 2022-10-03 0.145486 EUR
3 2022-10-04 0.150735 EUR
Even though one already has the desired output, one can wrap everything into a one-liner as follows
df = pd.DataFrame.from_dict(data['data']['rates'], orient='index').reset_index().rename(columns={'index':'Date'}).assign(base=data['data']['base'])
[Out]:
Date NG base
0 2022-10-01 0.144894 EUR
1 2022-10-02 0.144879 EUR
2 2022-10-03 0.145486 EUR
3 2022-10-04 0.150735 EUR
