Given a block of arbitrary text, I need a regex pattern that will find/extract domains only, ignoring scheme and subdomain components of domains, and ignoring strings entirely if there is a path (these are being extracted as URLs)
Example Text:
www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
Matches:
reddit.com
stackoverflow.com
I have tried the following
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9] (-[a-z0-9] )*\.) [a-z]{2,63}\b
Which of course will return:
www.google.com
www.stackoverflow.com
reddit.com
www.facebook.com
CodePudding user response:
You can use
\b(?!www\.)(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.) [a-z]{2,63}\b(?![/.])
See the regex demo.
Details:
\b- a word boundary(?!www\.)- nowww.immediately on the right is allowed(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.)- one or more occurrences of(?=[a-z0-9-]{1,63}\.)- a positive lookahead that requires 1 to 63 ASCII lowercase letters, digits or hyphens and then a.immediately to the right of the current location(?:xn--)?- an optionalxn--char sequence[a-z0-9]- one or more lowercase ASCII letters or digits(?:-[a-z0-9] )*- zero or more sequences of-and one or more lowercase ASCII letters or digits\.- a.char
[a-z]{2,63}- 2 to 63 lowercase ASCII letters\b- a word boundary(?![/.])- a negative lookahead that fails the match if there is a/or.immediately to the right of the current location.
CodePudding user response:
import re
text = '''www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
'''
re.findall(r'(?<=(?:www.|tps:))[/]*([a-z] .com)(?![/])', text)
['stackoverflow.com', 'reddit.com']
