Looping over Pandas dataframe cells which have JSON type lists of dictionaries for counting values-CodePudding

I have a 3X3 data frame (df) whose columns are "observation" and "features". Each cell of "features" columns has many lists of dictionaries in them.

I would like to write a loop which will go over each cells of "features" columns and will count the number of unique two-digit categories, simultaneously these counted numbers will be put to another data frame which will have two columns: "Categories" (such as "02 Pppppppp") which will give the name of two-digit categories and their counted numbers in "Count" column.

observation = [1, 2, 3]
features = [
    """[[{'id': '2211', 'name': '11 Mmmmm'},
        {'id': '3142', 'name': '1112 Mmmmm Ooooo'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2401', 'name': '0204 Pppppppp Mmmmm'},
        {'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'},
        {'id': '2389', 'name': '0202 Pppppppp Atata'}],
       [{'id': '2211', 'name': '11 Mmmmm'},
        {'id': '3053', 'name': '1103 Mmmmm Sssssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2211', 'name': '11 Mmmmm'},
        {'id': '2206', 'name': '06 Bbbbbb'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}],
       [{'id': '2202', 'name': '02 Pppppppp '},
        {'id': '2421', 'name': '0299 Pppppppp Sssss'}]]""",
    """[[{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2666', 'name': '0699 Other Bbbbbb'},
            {'id': '2486', 'name': '0399 Other Kkkkkk '},
            {'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2620', 'name': '0604 Genetics'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2211', 'name': '11 Mmmmm'}],
           [{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
            {'id': '2206', 'name': '06 Bbbbbb'}],
           [{'id': '2921', 'name': '0912 Wwwww Apapa'},
            {'id': '2209', 'name': '09 Wwwww '},
            {'id': '2844', 'name': '0904 Wwwww Enenen'}]]""",
    """[[{'id': '2921', 'name': '0912 Wwwww Apapa'},
            {'id': '2209', 'name': '09 Wwwww '},
            {'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2471', 'name': '0306 Kkkkkk Chch'}],
           [{'id': '2203', 'name': '03 Kkkkkk '},
            {'id': '2471', 'name': '0306 Kkkkkk Chch'}]]""",
]

d = {"Observations": observation, "Features": features}
df = pd.DataFrame(data=d)

CodePudding user response：

You could try to :

find and parse all the string dictionaries in features with the help of Python standard library re and ast modules,
count unique values and make a dataframe of them,
finally, exclude categories which names do not begin with two digits using Pandas str.match method.

Like this:

import ast
import re

import pandas as pd

records = [
    ast.literal_eval(record)
    for string in features
    for record in re.findall(r"{. \W}", string)
]
df = pd.DataFrame(records)

new_df = df["name"].value_counts().to_frame().reset_index(drop=False)
new_df.columns = ["Categories", "Count"]

new_df = new_df.loc[new_df["Categories"].str.match(r"\d{2}\s"), :].reset_index(
    drop=True
)

print(new_df)
# Ouputs
     Categories  Count
0  02 Pppppppp      10
1     06 Bbbbbb      5
2      11 Mmmmm      4
3    03 Kkkkkk       3
4     09 Wwwww       2