I have a 3X3 data frame (df) whose columns are "observation" and "features". Each cell of "features" columns has many lists of dictionaries in them.
I would like to write a loop which will go over each cells of "features" columns and will count the number of unique two-digit categories, simultaneously these counted numbers will be put to another data frame which will have two columns: "Categories" (such as "02 Pppppppp") which will give the name of two-digit categories and their counted numbers in "Count" column.
observation = [1, 2, 3]
features = [
"""[[{'id': '2211', 'name': '11 Mmmmm'},
{'id': '3142', 'name': '1112 Mmmmm Ooooo'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2401', 'name': '0204 Pppppppp Mmmmm'},
{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'},
{'id': '2389', 'name': '0202 Pppppppp Atata'}],
[{'id': '2211', 'name': '11 Mmmmm'},
{'id': '3053', 'name': '1103 Mmmmm Sssssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2211', 'name': '11 Mmmmm'},
{'id': '2206', 'name': '06 Bbbbbb'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}],
[{'id': '2202', 'name': '02 Pppppppp '},
{'id': '2421', 'name': '0299 Pppppppp Sssss'}]]""",
"""[[{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
{'id': '2206', 'name': '06 Bbbbbb'}],
[{'id': '2666', 'name': '0699 Other Bbbbbb'},
{'id': '2486', 'name': '0399 Other Kkkkkk '},
{'id': '2203', 'name': '03 Kkkkkk '},
{'id': '2620', 'name': '0604 Genetics'},
{'id': '2206', 'name': '06 Bbbbbb'}],
[{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
{'id': '2206', 'name': '06 Bbbbbb'}],
[{'id': '2211', 'name': '11 Mmmmm'}],
[{'id': '2581', 'name': '0601 Bbbbbb Cbcbcb'},
{'id': '2206', 'name': '06 Bbbbbb'}],
[{'id': '2921', 'name': '0912 Wwwww Apapa'},
{'id': '2209', 'name': '09 Wwwww '},
{'id': '2844', 'name': '0904 Wwwww Enenen'}]]""",
"""[[{'id': '2921', 'name': '0912 Wwwww Apapa'},
{'id': '2209', 'name': '09 Wwwww '},
{'id': '2203', 'name': '03 Kkkkkk '},
{'id': '2471', 'name': '0306 Kkkkkk Chch'}],
[{'id': '2203', 'name': '03 Kkkkkk '},
{'id': '2471', 'name': '0306 Kkkkkk Chch'}]]""",
]
d = {"Observations": observation, "Features": features}
df = pd.DataFrame(data=d)
CodePudding user response:
You could try to :
- find and parse all the string dictionaries in
featureswith the help of Python standard library re and ast modules, - count unique values and make a dataframe of them,
- finally, exclude categories which names do not begin with two digits using Pandas str.match method.
Like this:
import ast
import re
import pandas as pd
records = [
ast.literal_eval(record)
for string in features
for record in re.findall(r"{. \W}", string)
]
df = pd.DataFrame(records)
new_df = df["name"].value_counts().to_frame().reset_index(drop=False)
new_df.columns = ["Categories", "Count"]
new_df = new_df.loc[new_df["Categories"].str.match(r"\d{2}\s"), :].reset_index(
drop=True
)
print(new_df)
# Ouputs
Categories Count
0 02 Pppppppp 10
1 06 Bbbbbb 5
2 11 Mmmmm 4
3 03 Kkkkkk 3
4 09 Wwwww 2
