Count elements in defined groups in pandas dataframe-CodePudding

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.

I could do something like

elem_list = [1,5,2]

for e in elemt_list:
 (df["col1"]==e).sum()

but isn't there a better way like

elem_list = [1,5,2]
df["col1"].count_elements(elem_list)

#1 5    # 1 occurs 5 times
#5 3    # 5 occurs 3 times
#2 0    # 2 occurs 0 times

Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.

CodePudding user response：

Pass to the Categorical which will return 0 for missing item

pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]: 
1    3
5    0
2    1
dtype: int64

CodePudding user response：

You could do something like that:

df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()

# col1
# 1       17
# 0       10
# dtype: int64

CodePudding user response：

First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:

df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)

CodePudding user response：

You can use value_counts and reindex:

df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})

elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)

output:

1    5
5    2
2    0

benchmark (100k values):

# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})

df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)