Home > Enterprise >  Count values in a pandas dataframe and use those some values for creating a subdataframe
Count values in a pandas dataframe and use those some values for creating a subdataframe

Time:01-21

I have a pandas dataframe. I want to count all the values in a column, to understand which of them is repeated. Then, I want to extract just the repeated values, and I want to use them to create a subdataframe.

Let’s make an example. Say that this is my dataframe:

df =

    type        color       name
0   fruit       red         apple
1   fruit       yellow      banana
2   meat        brown       steak
3   fruit       green       apple
4   fruit       orange      orange
5   veg         orange      carrot
6   fruit       yellow      apple
7   meat        brown       steak
8   veg         orange      carrot

I want to know if in the column ‘name’ is there any repeated value. To do it, I use this line of code:

df['name'].value_counts().loc[lambda x : x>1] 

And this is what I get:

apple   3
steak   2
carrot  2

Then, I want to create a subdataframe, filtering the ‘name’ column with ‘apple’, ‘steak’, ‘carrot’, to find the values related to the other column. This can be done with a proper function, of course.

The desired output is:

sub_df =

    type        color       name
0   fruit       red         apple
1   fruit       green       apple
2   fruit       yellow      apple
3   meat        steak       brown
4   meat        steak       brown
5   veg         orange      carrot
6   veg         orange      carrot

I’ve tried different kind of code, with no success. I think the problem is in the use of df.count_values() because it gives me a Pandas series with the number of occurrencies, with no way to access the values that this method counts.

Any suggestion on how to do it?

CodePudding user response:

You do not need to do this in two steps, here is how to use groupby and filter to achieve the final result:

df.groupby('name').filter(lambda g: g['type'].count() > 1).sort_values('name')

output:


    type    color   name
0   fruit   red     apple
3   fruit   green   apple
6   fruit   yellow  apple
5   veg     orange  carrot
8   veg     orange  carrot
2   meat    brown   steak
7   meat    brown   steak

CodePudding user response:

For the next time, please provide better test data (data to copy and paste).

I think your desired output is wrong, because there is a steakvalue in color column.

I've tried the following that should do what you want. I think you understand the code, i only added the following line:

df[df["name"].isin(y.index.tolist())]

it searches in the name column of the dataframe for all values (isin) of the index values of the Series. If you want to have a complete new dataframe with own index you can add .reset_index() to the above line.

import pandas as pd

df = pd.DataFrame([
    ["fruit", "red", "apple"],
    ["fruit", "yellow", "banana"],
    ["meat", "brown", "steak"],
    ["fruit", "green", "apple"],
    ["fruit", "orange", "orange"],
    ["veg", "orange", "carrot"],
    ["fruit", "yellow", "apple"],
    ["meat", "brown", "steak"],
    ["veg", "orange", "carrot"]
],
    columns=["type", "color", "name"])

print(df)

y = df['name'].value_counts().loc[lambda x: x > 1]

print(y)

df_2 = df[df["name"].isin(y.index.tolist())]

print(df_2)

Output:

    type   color    name
0  fruit     red   apple
1  fruit  yellow  banana
2   meat   brown   steak
3  fruit   green   apple
4  fruit  orange  orange
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot
apple     3
steak     2
carrot    2
Name: name, dtype: int64
    type   color    name
0  fruit     red   apple
2   meat   brown   steak
3  fruit   green   apple
5    veg  orange  carrot
6  fruit  yellow   apple
7   meat   brown   steak
8    veg  orange  carrot
  •  Tags:  
  • Related