Return the unique values of a specific column line by line-CodePudding

I have the following dataframe

  column1 column2  
0 Paul       xx       
1 John       aa      
2 Paul       gg      
3 John       xx       
4 John       bb       
5 George     gg       
6 Paul       gg
7 john       xx
.
n Jonathan   ff

I want to have the information of each person in one row. On the same row I want to have the index but in another column. So I want a dataframe like this:

  column1     column2   column3  
 0 Paul        0,2,6     xx, gg, gg       
 1 John        1,3,4,7   aa, xx, bb, xx             
 5 George      5         gg       
 .
 .
 .
 n Jonathan    n         ff

In order to make the above dataframe i execute

df2 = df.reset_index().groupby('column1').agg(list).reset_index()
ix = pd.Index(df2['index'].str.get(0)).rename(None)
df3 = df2.set_index(ix).sort_index()
df3

Which returns:

      column1      index          column2
0        Paul     [0, 2, 6]      [xx, gg, gg]
1        John     [1, 3, 4, 7]   [aa, xx, bb,xx]
5      George        [5]             [gg]

After that, I delete column1 and index. To have the values of column2 in a format, not represented as list I execute:

def transform_list(df3):
    df3['column2'] = df3['column2'].apply(lambda x: ','.join(x))
    return df3

dfb=transform_list(df3)

df3.head()

which return:

    column2
0   xx, gg, gg
1   aa, xx, bb,xx
5   gg

So now what I want is to have the unique values of each row

so my final dataframe will be

  column2
0   xx, gg
1   aa, xx, bb
5   gg

Any ideas?

CodePudding user response：

You could transform each list to a set and then back to a list, to eliminate duplicate entries. This could be done within your lambda function:

import pandas as pd

df = pd.DataFrame({'column2': [['xx', 'gg', 'gg'],
                               ['aa', 'xx', 'bb', 'xx'],
                               ['gg']]},
                  index=[0, 1, 5])

df['column2'] = df.column2.apply(lambda x: ', '.join(list(set(x))))
df

    column2
0   gg, xx
1   bb, xx, aa
5   gg

CodePudding user response：

As long as the order of the elements in your output doesn't matter, you can redefine your function as follows:

def transform_list(df3):
    df3['column2'] = df3['column2'].apply(lambda x: ','.join(set(x)))
    return df3

A set contains inherently only unique elements, so converting the list x to a set will discard any duplicates. Sets are inherently unordered however, so you may get unintended results if order matters.

If order does matter, you can use the version

def transform_list(df3):
    df3['column2'] = df3['column2'].apply(lambda x: ','.join(list(dict.fromkeys(x))))
    return df3

This creates a dictionary (which is insertion ordered) with keys from your initial list x, and since the keys can't be multiply defined, we end up with only the unique elements. Converting back to a list takes the keys from the dictionary, and the rest of the workflow can continue as needed without alteration.