First N distinct column values-CodePudding

Goal: extract a list of first N distinct values of a column.

Distinct, meaning different from each other but not unique to the entire DataFrame.

For example, the first 5 distinct values of col A.

DataFrame:

             A    B    C
0         BERT  foo  bar
1         BERT  foo  bar
2          MLP  foo  bar
3       Albert  foo  bar
4       Albert  foo  bar
5       Albert  foo  bar
6      Roberta  foo  bar
7   Roberta v2  foo  bar
8   Roberta v2  foo  bar
9      BigBird  foo  bar
10      Muppet  foo  bar

Desired Output:

top_5 = ['BERT', 'MLP', 'Albert', 'Roberta', 'Roberta v2']

Effectively, ignoring duplicates and all other preceding distinct values.

Please let me know if there's anything else I should clarify in this post.

CodePudding user response：

Use Series.unique with select first 5 values and converting to list:

first_5_unique = df.A.unique()[:5].tolist()

Or use Series.drop_duplicates with Series.head:

first_5_unique = df.A.drop_duplicates().head().tolist()

CodePudding user response：

If you have a large dataframe, a very efficient solution is to use a generator with help from itertools/more_itertools.unique_everseen:

# pip install more-itertools
from itertools import islice
from more_itertools import unique_everseen

list(islice(unique_everseen(df['A']), 5))

This is several orders of magnitude faster if you are working with thousands of rows as iteration will stop as soon as enough elements are collected (vs reading the whole column with pandas' unique)