Home > Net >  Join 2 columns with list values in pandas(Avoiding duplicates and NaN)
Join 2 columns with list values in pandas(Avoiding duplicates and NaN)

Time:01-07

I have pandas dataframe as follows,

loc_1                               loc_2                             

[mumbai, gujarat, sri lanka]        [chennai, UP]
[Goa, telangana]                    [Kashmir, Goa, Rajkot]
NaN                                 [Bihar, Orissa]

I want to create a new column that is a combination of both the above columns, I did search other similar questions but the issue I am facing is that,

When I do,

data['locations'] = data['loc_1']   data['loc_2']

Output
--------
loc_1                               loc_2                       locations

[mumbai, gujarat, sri lanka]        [chennai, UP]                [mumbai, gujarat, sri lanka,chennai, UP]   
[Goa, telangana]                    [Kashmir, Goa, Rajkot]       [Goa, telangana,Kashmir, Goa, Rajkot]
NaN                                 [Bihar, Orissa]              NaN

Issue

As you can see above, there are duplicate values as well as NaN values formed. How to avoid them?

Remember

The original dataset contains values in list, str and NaN format.

Dataset:


loc = pd.DataFrame({
'loc_1': [['mumbai', 'gujarat', 'sri lanka'],['Goa', 'telangana'],np.nan],
'loc_2':[['chennai','UP'],['kashmir','goa','rajkot'],['bihar','orissa']],
'loc_3':['Chennai','Bangalore','Vizag']

})

CodePudding user response:

First join values with replace NaNs (floats) to empty lists:

data['locations'] = data['loc_1'].apply(lambda x: [] if isinstance(x, float) else x)   data['loc_2']

And then remove duplicates with same order like original by converting to dictionaries by dict.fromkeys:

data['locations'] = data['locations'].apply(lambda x: list(dict.fromkeys(x)))

If the order is not important you can use a set:

data['locations'] = data['locations'].apply(lambda x: list(set(x)))

CodePudding user response:

If you use loc.fillna("", inplace=True), then additions containing empty values should not result in NaNs anymore.

To filter duplicates from a column containing lists, use:

loc['locations'] = loc['locations'].apply(lambda locs: list(set(locs)))
  •  Tags:  
  • Related