I have the following df:
data = {'Org': ['<a href="/00xO" target="_blank">Chocolate</a>'],
'Owner': ['Charlie']
}
df = pd.DataFrame(data)
print (df)
and when I apply the lamba function below instead of giving me 'Chocolate' it's returning 0.
df['Correct Org']=df['Org'].apply(lambda st: st[st.find(">"):st.find("<")])
I've tried adding 'str' as follows:
df['Correct Org']=df['Org'].str.apply(lambda st: st[st.find(">") 1:st.find("<")])
& get the following error:
AttributeError: 'StringMethods' object has no attribute 'apply'
CodePudding user response:
You're getting None returned because df['Org'][0].find(">") returns 31 but df['Org'][0].find("<") returns 0. So it's not clear what st[st.find(">"):st.find("<") means. You can use bs4.BeautifulSoup to create a soup object and get the text inside a directly:
from bs4 import BeautifulSoup
df['Org'] = df['Org'].apply(lambda x: BeautifulSoup(x).text)
Output:
Org Owner
0 Chocolate Charlie
CodePudding user response:
Use BeautifulSoup for parsing html tags:
from bs4 import BeautifulSoup
df['Correct Org']=df['Org'].apply(lambda st: ','.join(BeautifulSoup(st, features="lxml").findAll(text=True)))
CodePudding user response:
If you don't want to use BeautifulSoup, I wrote a simple function for you.
A FUNCTION FOR GETTING THE LINK TEXT
def getOrg(link):
link = str(link)
link = link[link.find('>'):link.find("</")]
return link.replace(link[0], '')
FOR EXAMPLE
import pandas as pd
data = {'Org': ['<a href="/00xO" target="_blank">Chocolate</a>'],
'Owner': ['Charlie']
}
df = pd.DataFrame(data)
# Function Call
getOrg(df['Org'])
OUTPUT
Chocolate
