In Python I've a dataframe that contains in a column two comma separated URLS (https://pippo.it, https://pluto.it) and another column where the urls I want to remove from all the dataframe are stored. How do I accomplish this?
Example code
df = df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df2 = df
for column in df:
URLVal = df["url2"].values
df2 = df2.replace(str(URLVal.values), "")
CodePudding user response:
You may try replacing using a regex:
df["urls"] = df["urls"].str.replace(r'(?:,\s*)?https?://pluto.it(?:,\s*)?', '', regex=True)
Here is a regex demo showing that the replacement logic is working.
CodePudding user response:
I want to explain why
I've tried the solution df.replace("https://pluto.it", "")
failed. In pandas there exist distinct methods, pandas.DataFrame.replace (for replacing whole elements) and pandas.Series.str.replace (akin to replace method of str, note that it pertains to Series, i.e. single column of DataFrame). Consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["ABC","ABCDEF","DEF"]})
print(df.replace("DEF","X")) # holds ABC, ABCDEF, X
print(df.A.str.replace("DEF","X")) # holds ABC, ABCX, X
CodePudding user response:
You can use
df = df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True)
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'urls':['https://pippo.it, https://pluto.it', 'http://blah.com'], 'urls2':['http://blah.net', 'https://pippo.it, https://pluto.it']})
df.replace(r"\s*(?:,\s*)?https://pluto\.it\b", "", regex=True, inplace=True)
So, if the initial dataframe looks like
urls urls2
0 https://pippo.it, https://pluto.it http://blah.net
1 http://blah.com https://pippo.it, https://pluto.it
The output will be:
urls urls2
0 https://pippo.it http://blah.net
1 http://blah.com https://pippo.it
The inplace=True makes the changes directly to the dataframe, no need to reassign the variable.
The \s*(?:,\s*)?https://pluto\.it\b regex needs more attention:
\s*- zero or more whitespaces(?:,\s*)?- an optional sequence of a comma and then zero or more whitespaceshttps://pluto\.it- a literalhttps://pluto.itstring\b- a word boundary (used to matchitbut notita,it0,it_etc.). Note: If you want to make sure there is end of string, use$. If you want to make sure there can be a whitespace or end of string only after the URL, use(?!\S)instead of\b.
CodePudding user response:
Assuming the structure of the string element to be replaced is always the same, you could also do it very easily and without Regex as follows:
df.replace("https://pippo.it, https://pluto.it", "https://pippo.it")
