I have the following dataframe
column1 column2
0 Paul xx
1 John aa
2 Paul gg
3 John xx
4 John bb
5 George gg
6 Paul gg
7 john xx
.
n Jonathan ff
I want to have the information of each person in one row. On the same row I want to have the index but in another column. So I want a dataframe like this:
column1 column2 column3
0 Paul 0,2,6 xx, gg, gg
1 John 1,3,4,7 aa, xx, bb, xx
5 George 5 gg
.
.
.
n Jonathan n ff
In order to make the above dataframe i execute
df2 = df.reset_index().groupby('column1').agg(list).reset_index()
ix = pd.Index(df2['index'].str.get(0)).rename(None)
df3 = df2.set_index(ix).sort_index()
df3
Which returns:
column1 index column2
0 Paul [0, 2, 6] [xx, gg, gg]
1 John [1, 3, 4, 7] [aa, xx, bb,xx]
5 George [5] [gg]
After that, I delete column1 and index.
To have the values of column2 in a format, not represented as list I execute:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(x))
return df3
dfb=transform_list(df3)
df3.head()
which return:
column2
0 xx, gg, gg
1 aa, xx, bb,xx
5 gg
So now what I want is to have the unique values of each row
so my final dataframe will be
column2
0 xx, gg
1 aa, xx, bb
5 gg
Any ideas?
CodePudding user response:
You could transform each list to a set and then back to a list, to eliminate duplicate entries. This could be done within your lambda function:
import pandas as pd
df = pd.DataFrame({'column2': [['xx', 'gg', 'gg'],
['aa', 'xx', 'bb', 'xx'],
['gg']]},
index=[0, 1, 5])
df['column2'] = df.column2.apply(lambda x: ', '.join(list(set(x))))
df
column2
0 gg, xx
1 bb, xx, aa
5 gg
CodePudding user response:
As long as the order of the elements in your output doesn't matter, you can redefine your function as follows:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(set(x)))
return df3
A set contains inherently only unique elements, so converting the list x to a set will discard any duplicates. Sets are inherently unordered however, so you may get unintended results if order matters.
If order does matter, you can use the version
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(list(dict.fromkeys(x))))
return df3
This creates a dictionary (which is insertion ordered) with keys from your initial list x, and since the keys can't be multiply defined, we end up with only the unique elements. Converting back to a list takes the keys from the dictionary, and the rest of the workflow can continue as needed without alteration.
