I have this dataframe:
df = pd.DataFrame({"c1":["[\"text\",\"text2\"]","[\"bla\",\"bla\",\"bla\"]"]})
and I'm removind [] and "" :
df["c2"] = df["c1"].apply(lambda x:re.sub('[\["\]]', "", x))
then I want to add df['c2'] to a list:
list = df['c2'].to_list()
Then I get this: ['text,text2', 'bla,bla,bla']
So far so good. But then I want a list with only unique values, what I could to using set(list).
The proble is that Instead of ['text,text2', 'bla,bla,bla'] I needed to get ['text','text2', 'bla','bla','bla'] so when I apply `set(list) I would get what I am expecting:
['text','text2','bla']
CodePudding user response:
First, don't use list as a variable. Second, once you get ['text,text2',...] you can use str.split. So your set would be
{y for x in df['c2'].str.split(',') for y in x}
Output:
{'bla', 'text', 'text2'}
Note: You can use regex directly to extract all patterns between the \":
set(df['c1'].str.extractall('\"([^"] )\"')[0])
CodePudding user response:
Try this:
new = []
for l in list:
new.extend(l.split(',') )
new = list(set(new))
which results in new to be
['text2', 'text', 'bla']
