Home > Software engineering >  Delete rows based on the multiple values as fast as possible
Delete rows based on the multiple values as fast as possible

Time:01-15

I have a dataframe which consist a column named Url. There are around 500K Urls. I want to delete all the rows where the Urls consist of some domain name like amazon.com, ebay.com, bestbuy.com and so on.

Those Urls can be like the below:

https://www.homedepot.com/b/Plumbing-Water-Pumps-Sump-Pumps-Submersible-Sump-Pumps/N-5yc1vZbqj6

https://images.homedepot-static.com/catalog/pdfImages/ba/ba1bd2c2-82ea-4510-85c8-333392e70a23.pdf

https://us.amazon.com/Simer-A5000-Hole-Battery-System/dp/B000DZKXC2

https://www.amazon.com/Hydromatic-DPS41-Diaphragm-Switch-Range/dp/B009S0NS0C

So the domain can be present as subdomain too. It may or may not consist with http, https, www, top level country domain name like co.uk, .co.nz and so on.

So I need an universal solution to delete any URL when the domain name is present in the exclude-sites list.

I already created a function for it which is working fine for smaller data set. But it couldn't clean the data for the 500K rows even after running it for straight 5 hours.

Here is the function I am using:

exclude_sites=['amazon.com','amzn.to','ebay.com','walmart.com','sears.com','costco.com','youtube.com', 'lowes.com', 'homedepot.com', 'bestbuy.com']

def exclude_urls(exclude_sites, df):
    i=[]
    
    for row in df['Url']:
        if any(url in row for url in exclude_sites):
             i.extend(df.index[df['Url']==row])
            #reset index  
        
    return i

df = df.drop(list( exclude_urls(exclude_sites, df)))

What might be the fastest solution to delete the domain saved on the exclude_sites list? Thank you in advance.

CodePudding user response:

exclude_sites_escaped = [x.replace('.', '\.') for x in exclude_sites]
df[~df['Url'].str.contains('|'.join(exclude_sites_escaped), regex = True)]
  •  Tags:  
  • Related