Home > database >  Looping over Pandas dataframe cells which have JSON type lists of dictionaries for counting values
Looping over Pandas dataframe cells which have JSON type lists of dictionaries for counting values

Time:01-06

I have a 3X3 data frame (df) whose columns are "observation" and "features". Each cell of "features" columns has many lists of dictionaries in them.

I would like to write a loop which will go over each cells of "features" columns and will count the number of unique two-digit categories, simultaneously these counted numbers will be put to another data frame which will have two columns: "Categories" (such as "02 Pppppppp") which will give the name of two-digit categories and their counted numbers in "Count" column.

observation = [1, 2, 3]
features = [
    """[[{'id': '2211', 'name': '11 Mmmmm'},
        {'id': '3142', 'name': '1112 Mmmmm Ooooo'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2401', 'name': '0204 Pppppppp Mmmmm'},
        {'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'},
        {'id': '2389', 'name': '0202 Pppppppp Atata'}],
       [{'id': '2211', 'name': '11 Mmmmm'},
        {'id': '3053', 'name': '1103 Mmmmm Sssssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2211', 'name': '11 Mmmmm'},
        {'id': '2206', 'name': '06 Bbbbbb'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}]]""",
    """[[{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2666', 'name': '0699 Other Bbbbbb'},
            {'id': '2486', 'name': '0399 Other Kkkkkk '},
            {'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2620', 'name': '0604 Genetics'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2211', 'name': '11 Mmmmm'}],
           [{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2921', 'name': '0912 Wwwww Apapa'},
            {'id': '2209', 'name': '09 Wwwww '},
            {'id': '2844', 'name': '0904 Wwwww Enenen'}]]""",
    """[[{'id': '2921', 'name': '0912 Wwwww Apapa'},
            {'id': '2209', 'name': '09 Wwwww '},
            {'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2471', 'name': '0306 Kkkkkk Chch'}],
           [{'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2471', 'name': '0306 Kkkkkk Chch'}]]""",
]

d = {"Observations": observation, "Features": features}
df = pd.DataFrame(data=d)

CodePudding user response:

You could try to :

  • find and parse all the string dictionaries in features with the help of Python standard library re and ast modules,
  • count unique values and make a dataframe of them,
  • finally, exclude categories which names do not begin with two digits using Pandas str.match method.

Like this:

import ast
import re

import pandas as pd

records = [
    ast.literal_eval(record)
    for string in features
    for record in re.findall(r"{. \W}", string)
]
df = pd.DataFrame(records)

new_df = df["name"].value_counts().to_frame().reset_index(drop=False)
new_df.columns = ["Categories", "Count"]

new_df = new_df.loc[new_df["Categories"].str.match(r"\d{2}\s"), :].reset_index(
    drop=True
)

print(new_df)
# Ouputs
     Categories  Count
0  02 Pppppppp      10
1     06 Bbbbbb      5
2      11 Mmmmm      4
3    03 Kkkkkk       3
4     09 Wwwww       2
  •  Tags:  
  • Related