I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help
Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo
Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g
I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/] $ but was not able to remove or skip ".html"
CodePudding user response:
Here you are:
\/([^\/] ?)(?>\.. )?$
Explaination:
\/- literal character '/'([^\/] ?)- first group: at least one character that is not a '/' with greedyness (match only the first expansion)[^\/]- any character that is not a '/'- at least one occurence?- greediness operator (match only first expansion)
(?>\.. )?- second optional group: '.' any character (like '.html' or '.exe' or '.png')?>- non-capturing lookahead group (exclude the content from the result)\.- literal character '.'.- any character (except line terminators)- at least one occurence?- optionality (note that this one is outside the parenthesis)
$- end of the string
If you want also to exclude query strings you can expand it like this:
\/([^\/] ?)(?>\.. )?(?>\?.*)?$
If you also need to remove the protocol part of the url you can use this:
(?<!\/)\/([^\/] ?)(?>\.. )?(?>\?.*)?$
Where this (?<!\/) just look if there are no '/' before the start of the match
CodePudding user response:
You are only matching using [^\/] $ but not differentiating between the part before and after the dot.
To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.
\S*\/([^\/\s.] )\.[^\/\s] $
\S*\/Match optional non whitespace chars till the last occurrence of/([^\/\s.] )Capture group 1 Match 1 times any char except a/whitespace char or.\.Match a dot[^\/\s]Match 1 times any char except a/whitespace char or.$End of string
See a regex demo.
