Getting a set of a list from a list of strings-CodePudding

I have this dataframe:

df = pd.DataFrame({"c1":["[\"text\",\"text2\"]","[\"bla\",\"bla\",\"bla\"]"]})

and I'm removind [] and "" :

df["c2"] = df["c1"].apply(lambda x:re.sub('[\["\]]', "", x))

then I want to add df['c2'] to a list:

list = df['c2'].to_list()

Then I get this: ['text,text2', 'bla,bla,bla']

So far so good. But then I want a list with only unique values, what I could to using set(list).

The proble is that Instead of ['text,text2', 'bla,bla,bla'] I needed to get ['text','text2', 'bla','bla','bla'] so when I apply `set(list) I would get what I am expecting:

['text','text2','bla']

CodePudding user response：

First, don't use list as a variable. Second, once you get ['text,text2',...] you can use str.split. So your set would be

{y for x in df['c2'].str.split(',') for y in x}

Output:

{'bla', 'text', 'text2'}

Note: You can use regex directly to extract all patterns between the \":

set(df['c1'].str.extractall('\"([^"] )\"')[0])

CodePudding user response：

Try this:

new = []
for l in list:
    new.extend(l.split(',') )
new = list(set(new))

which results in new to be

['text2', 'text', 'bla']