Count values in a pandas dataframe and use those some values for creating a subdataframe-CodePudding

I have a pandas dataframe. I want to count all the values in a column, to understand which of them is repeated. Then, I want to extract just the repeated values, and I want to use them to create a subdataframe.

Let’s make an example. Say that this is my dataframe:

df =

    type        color       name
0   fruit       red         apple
1   fruit       yellow      banana
2   meat        brown       steak
3   fruit       green       apple
4   fruit       orange      orange
5   veg         orange      carrot
6   fruit       yellow      apple
7   meat        brown       steak
8   veg         orange      carrot

I want to know if in the column ‘name’ is there any repeated value. To do it, I use this line of code:

df['name'].value_counts().loc[lambda x : x>1]

And this is what I get:

apple   3
steak   2
carrot  2

Then, I want to create a subdataframe, filtering the ‘name’ column with ‘apple’, ‘steak’, ‘carrot’, to find the values related to the other column. This can be done with a proper function, of course.

The desired output is:

sub_df =

    type        color       name
0   fruit       red         apple
1   fruit       green       apple
2   fruit       yellow      apple
3   meat        steak       brown
4   meat        steak       brown
5   veg         orange      carrot
6   veg         orange      carrot

I’ve tried different kind of code, with no success. I think the problem is in the use of df.count_values() because it gives me a Pandas series with the number of occurrencies, with no way to access the values that this method counts.

Any suggestion on how to do it?

CodePudding user response：

You do not need to do this in two steps, here is how to use groupby and filter to achieve the final result:

df.groupby('name').filter(lambda g: g['type'].count() > 1).sort_values('name')

output:


    type    color   name
0   fruit   red     apple
3   fruit   green   apple
6   fruit   yellow  apple
5   veg     orange  carrot
8   veg     orange  carrot
2   meat    brown   steak
7   meat    brown   steak

CodePudding user response：

For the next time, please provide better test data (data to copy and paste).

I think your desired output is wrong, because there is a steakvalue in color column.

I've tried the following that should do what you want. I think you understand the code, i only added the following line:

df[df["name"].isin(y.index.tolist())]

it searches in the name column of the dataframe for all values (isin) of the index values of the Series. If you want to have a complete new dataframe with own index you can add .reset_index() to the above line.

import pandas as pd

df = pd.DataFrame([
    ["fruit", "red", "apple"],
    ["fruit", "yellow", "banana"],
    ["meat", "brown", "steak"],
    ["fruit", "green", "apple"],
    ["fruit", "orange", "orange"],
    ["veg", "orange", "carrot"],
    ["fruit", "yellow", "apple"],
    ["meat", "brown", "steak"],
    ["veg", "orange", "carrot"]
],
    columns=["type", "color", "name"])

print(df)

y = df['name'].value_counts().loc[lambda x: x > 1]

print(y)

df_2 = df[df["name"].isin(y.index.tolist())]

print(df_2)

Output:

    type   color    name
0  fruit     red   apple
1  fruit  yellow  banana
2   meat   brown   steak
3  fruit   green   apple
4  fruit  orange  orange
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot
apple     3
steak     2
carrot    2
Name: name, dtype: int64
    type   color    name
0  fruit     red   apple
2   meat   brown   steak
3  fruit   green   apple
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot