Goal: extract a list of first N distinct values of a column.
Distinct, meaning different from each other but not unique to the entire DataFrame.
For example, the first 5 distinct values of col A.
DataFrame:
A B C
0 BERT foo bar
1 BERT foo bar
2 MLP foo bar
3 Albert foo bar
4 Albert foo bar
5 Albert foo bar
6 Roberta foo bar
7 Roberta v2 foo bar
8 Roberta v2 foo bar
9 BigBird foo bar
10 Muppet foo bar
Desired Output:
top_5 = ['BERT', 'MLP', 'Albert', 'Roberta', 'Roberta v2']
Effectively, ignoring duplicates and all other preceding distinct values.
Please let me know if there's anything else I should clarify in this post.
CodePudding user response:
Use Series.unique with select first 5 values and converting to list:
first_5_unique = df.A.unique()[:5].tolist()
Or use Series.drop_duplicates with Series.head:
first_5_unique = df.A.drop_duplicates().head().tolist()
CodePudding user response:
If you have a large dataframe, a very efficient solution is to use a generator with help from itertools/more_itertools.unique_everseen:
# pip install more-itertools
from itertools import islice
from more_itertools import unique_everseen
list(islice(unique_everseen(df['A']), 5))
This is several orders of magnitude faster if you are working with thousands of rows as iteration will stop as soon as enough elements are collected (vs reading the whole column with pandas' unique)
