I have a pandas dataframe. I want to count all the values in a column, to understand which of them is repeated. Then, I want to extract just the repeated values, and I want to use them to create a subdataframe.
Let’s make an example. Say that this is my dataframe:
df =
type color name
0 fruit red apple
1 fruit yellow banana
2 meat brown steak
3 fruit green apple
4 fruit orange orange
5 veg orange carrot
6 fruit yellow apple
7 meat brown steak
8 veg orange carrot
I want to know if in the column ‘name’ is there any repeated value. To do it, I use this line of code:
df['name'].value_counts().loc[lambda x : x>1]
And this is what I get:
apple 3
steak 2
carrot 2
Then, I want to create a subdataframe, filtering the ‘name’ column with ‘apple’, ‘steak’, ‘carrot’, to find the values related to the other column. This can be done with a proper function, of course.
The desired output is:
sub_df =
type color name
0 fruit red apple
1 fruit green apple
2 fruit yellow apple
3 meat steak brown
4 meat steak brown
5 veg orange carrot
6 veg orange carrot
I’ve tried different kind of code, with no success. I think the problem is in the use of df.count_values() because it gives me a Pandas series with the number of occurrencies, with no way to access the values that this method counts.
Any suggestion on how to do it?
CodePudding user response:
You do not need to do this in two steps, here is how to use groupby and filter to achieve the final result:
df.groupby('name').filter(lambda g: g['type'].count() > 1).sort_values('name')
output:
type color name
0 fruit red apple
3 fruit green apple
6 fruit yellow apple
5 veg orange carrot
8 veg orange carrot
2 meat brown steak
7 meat brown steak
CodePudding user response:
For the next time, please provide better test data (data to copy and paste).
I think your desired output is wrong, because there is a steakvalue in color column.
I've tried the following that should do what you want. I think you understand the code, i only added the following line:
df[df["name"].isin(y.index.tolist())]
it searches in the name column of the dataframe for all values (isin) of the index values of the Series. If you want to have a complete new dataframe with own index you can add .reset_index() to the above line.
import pandas as pd
df = pd.DataFrame([
["fruit", "red", "apple"],
["fruit", "yellow", "banana"],
["meat", "brown", "steak"],
["fruit", "green", "apple"],
["fruit", "orange", "orange"],
["veg", "orange", "carrot"],
["fruit", "yellow", "apple"],
["meat", "brown", "steak"],
["veg", "orange", "carrot"]
],
columns=["type", "color", "name"])
print(df)
y = df['name'].value_counts().loc[lambda x: x > 1]
print(y)
df_2 = df[df["name"].isin(y.index.tolist())]
print(df_2)
Output:
type color name
0 fruit red apple
1 fruit yellow banana
2 meat brown steak
3 fruit green apple
4 fruit orange orange
5 veg orange carrot
6 fruit yellow apple
7 meat brown steak
8 veg orange carrot
apple 3
steak 2
carrot 2
Name: name, dtype: int64
type color name
0 fruit red apple
2 meat brown steak
3 fruit green apple
5 veg orange carrot
6 fruit yellow apple
7 meat brown steak
8 veg orange carrot
