I'm trying to scrape a portion of text out of a long text using regex.
Original text: If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb
Portion I'm interested in: kaieldentsome [!at] gmail.com.
It's not necessary that contact us at will always be present there.
I've tried with:
import re
item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s).*?\s\[!at\].*?\s.*?\s",item_str)[0]
print(output)
Output I wish to get:
kaieldentsome [!at] gmail.com.
CodePudding user response:
You could use
(?<=\s)\S \s\[!at\]\s\S \.\S
(?<=\s)Positive lookbehind, assert a whitespace char to the left\SMatch 1 non whitespace chars\s\[!at\]\sMatch[!at]between whitespace chars\S \.\SMatch 1 non whitespace chars with at least a dot
Note that there has to be a whitespace to the left present. If that is not mandatory, you can omit (?<=\s)
CodePudding user response:
\S \s*\[!at\]\s*\S
Will also work if there is no whitespace before and/or after the [!at].
If you want to exclude the trailing ., you can do this:
(\S \s*\[!at\]\s*\S )\.?
Then take the first group.
CodePudding user response:
Regex is usually greedy. Meaning it will match as much as possible. So by using .*, it'll match all characters, including whitespaces.
If you use \S* instead, which will match everything, except for whitespaces, you will get the desired result.
Updated code:
import re
item_str = 'If you have any questions or concerns, you may contact us at kaieldentsome [!at] gmail.com. You can also follow us on fb'
output = re.findall(r"(?<=\s)\S*?\s\[!at\]\S*?\s\S*?\s",item_str)[0]
print(output)
Try it here: https://regex101.com/r/ZMw139/1
