Home > Enterprise >  generate a new Dataframe from mongodb collection nested array
generate a new Dataframe from mongodb collection nested array

Time:01-04

I'm trying to generate a new dataframe from an mongodb collection, the goal is to make a new df representing only 'events' column:

e.g:

{
    "_id" : 1641008579,
    "status" : "init",
    "description" : "Test",
    "attachment" : null,
    "start" : "08:00",
    "user" : "Jenny",
    "timestamp" : ISODate("2022-01-01T04:43:11.380Z"),
    "events" : [ 
        {
            "id" : 1641008580,
            "status" : "start",
            "description" : "First Event",
            "user" : "Jenny",
            "timestamp" : ISODate("2022-01-01T04:43:11.380Z")
        }, 
        {
            "id" : 1641008581,
            "status" : "progress",
            "description" : "Middle of the Event",
            "user" : "Joe",
            "timestamp" : ISODate("2022-01-01T05:43:11.380Z")
        }, 
        {
            "id" : 1641008582,
            "status" : "end",
            "description" : "Last Event",
            "user" : "Alain",
            "timestamp" : ISODate("2022-01-01T06:43:11.380Z")
        }
    ]
}

Any idea how to start an approach ìn order to get the following ?

event_df should be like the following :

    id          status      description             user    timestamp
0   1641008580  start       First Event             Jenny   "2022-01-01T04:43:11.380Z"
1   1641008581  progress    Middle of the Event     Joe     "2022-01-01T05:43:11.380Z"
2   1641008582  end         Last Event              Alain   "2022-01-01T06:43:11.380Z"

/K

CodePudding user response:

Pandas' pandas.json_normalize method works perfectly here, which will "Normalize semi-structured JSON data into a flat table." returning a DataFrame.

API Reference -> pandas.json_normalize

import json
import pandas as pd

with open('mongo.json') as json_file: # retrieve the json file
    data = json.load(json_file) # deserialize the json file to a dict 
    events_df = pd.json_normalize(data['events']) # normalize and create a dataframe 
    print(events_df)

CodePudding user response:

Here is a function just after loading the collection:

def set_event_2_df(last_situation):
    for doc in last_situation:
        for k, v in doc.items():
            try:
                if k == 'events':
                    for i, e in enumerate(doc['events']):
                        new_row = {
                            'id': str(doc['events'][i]['id']),
                            'status': doc['events'][i]['status'],
                            'description': doc['events'][i]['description'],
                            'user': doc['events'][i]['user'],
                            'timestamp': doc['events'][i]['timestamp']
                        }
                        df_event = df_event.append(new_row, ignore_index=True)
            except Exception as e:
                print('EXP - {}'.format(e))
  •  Tags:  
  • Related