Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
CodePudding user response:
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
CodePudding user response:
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]
