I am trying to capture instances in my dataframe where a string has the following format:
'/random a/random b/random c/capture this/random again/random/random'
Where a string is preceded by four instances of '/', and more than two '/' appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return none.
In this instance 'capture this' should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
CodePudding user response:
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/] )(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^- start of string(?:[^/]*/){4}- four occurrences of any zero or more chars other than/and then a/char([^/] )- Capturing group 1:one or more chars other than a/char(?:/[^/]*){2}- two occurrences of a/char and then any zero or more chars other than/.
CodePudding user response:
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3}- match the part beforecapture this, matching any but as few as possible characters between each pair of/s (use noncapturing group to ignore the contents)(.*?)- capturecapture this(since this is a capturing group, we can fetch the contents from<match_object>.group(1)(?:/.*?){2,}- same as the first part, match as few characters as possible in between each pair of/s
