So I've actually solved this, but the way I'm doing it may not be the most efficient.
For a column in my database - Industry - I want to replace values. If a value contains the word "tech", "technology" or something similar, I want to replace that value with just the word "technology".
I've followed a basic algorithm below using apply which basically loops through a predefined list (e.g. science) and checks whether any of the values are present in the current Industry cell, and replaces if they are.
It then does the same for the next list. I only have two lists so far, but I'll likely have over a dozen once I'm finished.
def industry_convert(row):
science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]
for v in science:
if v.lower() in row.Industry.lower():
row.Industry = "Research, Science, & Education"
for v in tech:
if v.lower() in row.Industry.lower():
row.Industry = "Technology"
return row
df = df.apply(industry_convert, axis = 1)
I'm just wondering if this is the best way to do this, or if there is a more pythonic or pandas way of doing it?
EDIT:
This is what some of the Industry column looks like:
Industry
Research Scientist
Science: Education
Tech
Technical Assistance
Technology
Medical
Hospitality
This what it would look like after applying the code:
Industry
Research, Science, & Education
Research, Science, & Education
Technology
Technology
Technology
Medical
Hospitality
CodePudding user response:
Personally, I would use str.contains and .loc to assign new values.
this will work a number of times faster than looping over each row individually to check. (Which is an anti pattern with regards to the pandas API)
science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]
df.loc[df['Industry'].str.contains(f"{'|'.join(science)}",regex=True,case=False),
'industry_new'] = "Research, Science, & Education"
df.loc[df['Industry'].str.contains(f"{'|'.join(tech)}",regex=True,case=False),
'industry_new'] = "Technology"
df['industry_new'] = df['industry_new'].fillna(df['Industry'])
print(df)
Industry industry_new
0 Research Scientist Research, Science, & Education
1 Science: Education Research, Science, & Education
2 Tech Technology
3 Technical Assistance Technology
4 Technology Technology
5 Medical Medical
6 Hospitality Hospitality
CodePudding user response:
I updated the for loop, in your function:
science = list(map(lambda x:x.lower(),["research", "science", "scientific", "scientist", "academia", "education", "academic"]))
tech = list(map(lambda x:x.lower(),["technology", "tech", "software"]))
def industry_convert(row):
global science,tech
if row.Industry.lower() in science:
row.Industry = "Research, Science, & Education"
if row.Industry.lower() in science:
row.Industry = "Technology"
return row
df = df.apply(industry_convert, axis = 1)
I computed the lists to lower only once, so that it is not recomputed and the computation of the for loop is saved.
