I have a dataframe which consist a column named Url. There are around 500K Urls. I want to delete all the rows where the Urls consist of some domain name like amazon.com, ebay.com, bestbuy.com and so on.
Those Urls can be like the below:
https://www.homedepot.com/b/Plumbing-Water-Pumps-Sump-Pumps-Submersible-Sump-Pumps/N-5yc1vZbqj6
https://images.homedepot-static.com/catalog/pdfImages/ba/ba1bd2c2-82ea-4510-85c8-333392e70a23.pdf
https://us.amazon.com/Simer-A5000-Hole-Battery-System/dp/B000DZKXC2
https://www.amazon.com/Hydromatic-DPS41-Diaphragm-Switch-Range/dp/B009S0NS0C
So the domain can be present as subdomain too. It may or may not consist with http, https, www, top level country domain name like co.uk, .co.nz and so on.
So I need an universal solution to delete any URL when the domain name is present in the exclude-sites list.
I already created a function for it which is working fine for smaller data set. But it couldn't clean the data for the 500K rows even after running it for straight 5 hours.
Here is the function I am using:
exclude_sites=['amazon.com','amzn.to','ebay.com','walmart.com','sears.com','costco.com','youtube.com', 'lowes.com', 'homedepot.com', 'bestbuy.com']
def exclude_urls(exclude_sites, df):
i=[]
for row in df['Url']:
if any(url in row for url in exclude_sites):
i.extend(df.index[df['Url']==row])
#reset index
return i
df = df.drop(list( exclude_urls(exclude_sites, df)))
What might be the fastest solution to delete the domain saved on the exclude_sites list? Thank you in advance.
CodePudding user response:
exclude_sites_escaped = [x.replace('.', '\.') for x in exclude_sites]
df[~df['Url'].str.contains('|'.join(exclude_sites_escaped), regex = True)]
