How to access nested data in a pandas dataframe?-CodePudding

Here's an example of the data I'm working with:

  values                                            variable.variableName    timeZone
0 [{'value': [],                                    turbidity                PST
  'qualifier': [], 
  'qualityControlLevel': [], 
  'method': [{
      'methodDescription': '[TS087: YSI 6136]', 
      'methodID': 15009}], 
  'source': [], 
  'offset': [], 
  'sample': [], 
  'censorCode': []}, 
 {'value': [{
      'value': '17.2', 
      'qualifiers': ['P'], 
      'dateTime': '2022-01-05T12:30:00.000-08:00'},
     {'value': '17.5', 
      'qualifiers': ['P'], 
      'dateTime': '2022-01-05T14:00:00.000-08:00'}
  }]
1 [{'value':                                        degC                     PST
     [{'value': '9.3', 
       'qualifiers': ['P'], 
       'dateTime': '2022-01-05T12:30:00.000-08:00'}, 
      {'value': '9.4', 
       'qualifiers': ['P'], 
       'dateTime': '2022-01-05T12:45:00.000-08:00'},
  }]

I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:

import requests
import pandas as pd

url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()

json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)

new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']

# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light, 
780-900 nm, detection angle 90 &#177;2.5&#176;, formazin nephelometric units (FNU)'])

This outputs:

turbidity df
Empty DataFrame
Columns: []
Index: []

degC df
     value        qualifiers       dateTime
0    9.3          P                2022-01-05T12:30:00.000-08:00    
1    9.4          P                2022-01-05T12:45:00.000-08:00

Whereas I want my output to be something like:

turbidity df
     value        qualifiers       dateTime
0    17.2         P                2022-01-05T12:30:00.000-08:00    
1    17.5         P                2022-01-05T14:00:00.000-08:00


degC df
     value        qualifiers       dateTime
0    9.3          P                2022-01-05T12:30:00.000-08:00    
1    9.4          P                2022-01-05T12:45:00.000-08:00

Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?

CodePudding user response：

I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.

You can then use

new_df = df.explode("values")

which will split the "turbidity" row into two.

You can then filter rows with empty "value" dictionaries and apply .explode() once again.

You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.

CodePudding user response：

This JSON is nested deep so I think it requires a few steps to transform into what you want.

# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])

# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])

# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True), 
                pd.json_normalize(df.value).explode('qualifiers')], axis=1)

Resulted dataframe should look like this.

    variable.variableName      value qualifiers               dateTime
0   Temperature, water, &#176;C 10.7          P 2022-01-06T12:15:00.000-08:00
1   Temperature, water, &#176;C 10.7          P 2022-01-06T12:30:00.000-08:00
2   Temperature, water, &#176;C 10.7          P 2022-01-06T12:45:00.000-08:00
3   Temperature, water, &#176;C 10.8          P 2022-01-06T13:00:00.000-08:00

If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.

df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]