I am looking for certain entries with special words in a string. The string looks like this.
entry 1: hello
entry 2: world
entry 3: this
is a multiline
that makes it hard
entry 4: here we have a special entry
entry 5: here
we
have
another special entry
in a multiline
entry 6: end
Because it is an multiline problem I use Java's DOTALL so that the . matches also newline characters.
I am looking for entries that have the word special in it.
First I tried to find a regex, that captures a full entry: entry \d : .*?(?=\s*(entry \d: )|\Z). That is like a simplified version of this
Then I thought, ok I just have to exchange the .*? for the regex I need to find. But entry \d : .*?special.*?(?=\s*(entry \d: )|\Z) does not work, probably because the special breaks the greed of the expression.
Does anyone know a better solution?
CodePudding user response:
You can use a tempered greedy token:
(?s)entry \d : (?:(?!entry \d : ).)*special.*?(?=\s*entry \d : |$)
See the regex demo. Details:
entry \d :-entryspace one or more digits,:, space(?:(?!entry \d : ).)*- any char, repeated zero or more times, that does not start theentryspace one or more digits,:, space sequencespecial- a fixed string.*?- any zero or more chars as few as possible(?=\s*entry \d : |$)- a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces,entry, space, one or more digits,:and space, or end of the string.
NOTE: Do not use Pattern.MULTILINE with this regex. Or, keep on using \Z (end of the string, or position right before the trailing newline, LF char).
CodePudding user response:
If you use words and space classes instead of dots then it seems to work
/entry \d : [\w\s]*special[\w\s]*?(?=\s*(?:entry \d :)|$)/gm
It seems that if you allow the colon : in your text, it breaks the expression.
And also you have \Z in your expression but it seems to me that end of line $ is more suited here
CodePudding user response:
[Edit:] I unfortunately missed the multiline nature of entries, so this answer is valid for single line entries but will return only the first line for multiline entries. I think one could overcome this by setting a certain regex for delimiter, though.
I'd suggest you use a Scanner to deal with the multi line aspect. This will give you a stream of tokens (the lines). You can use a String.contains(...) or a String.matches(...) to filter tokens then.
var result = new Scanner(myMultiLineString).tokens()
.useDelimiter("\\n")
// alternatively use String.contains(...)
// if you're looking for a constant
// rather than a complex rule.
.filter(s -> s.matches(regex))
.collect(Collectors.toList());
