I'm trying to generate a new dataframe from an mongodb collection, the goal is to make a new df representing only 'events' column:
e.g:
{
"_id" : 1641008579,
"status" : "init",
"description" : "Test",
"attachment" : null,
"start" : "08:00",
"user" : "Jenny",
"timestamp" : ISODate("2022-01-01T04:43:11.380Z"),
"events" : [
{
"id" : 1641008580,
"status" : "start",
"description" : "First Event",
"user" : "Jenny",
"timestamp" : ISODate("2022-01-01T04:43:11.380Z")
},
{
"id" : 1641008581,
"status" : "progress",
"description" : "Middle of the Event",
"user" : "Joe",
"timestamp" : ISODate("2022-01-01T05:43:11.380Z")
},
{
"id" : 1641008582,
"status" : "end",
"description" : "Last Event",
"user" : "Alain",
"timestamp" : ISODate("2022-01-01T06:43:11.380Z")
}
]
}
Any idea how to start an approach ìn order to get the following ?
event_df should be like the following :
id status description user timestamp
0 1641008580 start First Event Jenny "2022-01-01T04:43:11.380Z"
1 1641008581 progress Middle of the Event Joe "2022-01-01T05:43:11.380Z"
2 1641008582 end Last Event Alain "2022-01-01T06:43:11.380Z"
/K
CodePudding user response:
Pandas' pandas.json_normalize method works perfectly here, which will "Normalize semi-structured JSON data into a flat table." returning a DataFrame.
API Reference -> pandas.json_normalize
import json
import pandas as pd
with open('mongo.json') as json_file: # retrieve the json file
data = json.load(json_file) # deserialize the json file to a dict
events_df = pd.json_normalize(data['events']) # normalize and create a dataframe
print(events_df)
CodePudding user response:
Here is a function just after loading the collection:
def set_event_2_df(last_situation):
for doc in last_situation:
for k, v in doc.items():
try:
if k == 'events':
for i, e in enumerate(doc['events']):
new_row = {
'id': str(doc['events'][i]['id']),
'status': doc['events'][i]['status'],
'description': doc['events'][i]['description'],
'user': doc['events'][i]['user'],
'timestamp': doc['events'][i]['timestamp']
}
df_event = df_event.append(new_row, ignore_index=True)
except Exception as e:
print('EXP - {}'.format(e))
