How do I separate these data from each other-CodePudding

I have a data set whereby the data in ever cell is mixed up with the column name in each cell as illustrated below:

Gender
“Gender”:”male”
“Gender”:”female”
“Gender”:”male”
“Gender”:”female”

I am in the process of cleaning it via anaconda and I have tried all but to no avail. I want it to look as illustrated below:

Gender
Male
Female
Male
Female

CodePudding user response：

You can make use of Pandas Apply function like this:

import pandas as pd

df = pd.DataFrame({"Gender":['“Gender”:”male”','“Gender”:”female”','“Gender”:”male”','“Gender”:”female”'])

def cln(st):
 me = st.split(":")
 return me[1] 

df["Gender"].apply(lambda val: cln(val))

CodePudding user response：

Considering your question. I have recreated the dataframe like this below,

import pandas as pd
df = pd.DataFrame({"Gender":['“Gender”:”male”',
 '“Gender”:”female”',
 '“Gender”:”male”',
 '“Gender”:”female”']})

So, the DataFrame looks like this below,

              Gender
0    “Gender”:”male”
1  “Gender”:”female”
2    “Gender”:”male”
3  “Gender”:”female”

Here is the code that can solve the issue

for i in df.columns:
    df[i] = [j.replace("”",'').split(":")[-1].capitalize() for j in df[i]]

Output df:

   Gender
0    Male
1  Female
2    Male
3  Female

CodePudding user response：

The data has some weird quote character so you'll need to do some massaging to make it clean. You can simple use the str call on the Series object to work directly with the string values.

df.Gender.str.replace(r'”|“', '', regex=True)\
         .str.split(":", expand=True)[1]\
         .str.capitalize()

0      Male
1    Female
2      Male
3    Female