At the moment I have: text.split("[^\\w ]"
But I also need to include words like: Can't but not something like: 'HEART'
I can't find a solution, that splits a text into words, including the letters, numbers and the aposthroph, if it's between other letters. Thx
CodePudding user response:
If you want to match words using \w, instead of using split you can use word boundaries and assert not ' at the left and at the right.
\b(?<!')\w (?:'\w )*\b(?!')
In Java
String regex = "\\b(?<!')\\w (?:'\\w )*\\b(?!')";
String string = "Can't but not something like: 'HEART'";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output
Can't
but
not
something
like
CodePudding user response:
It may be simpler to get rid of the single quotes/apostrophes when they occur before/after the word, and then split using the initial delimiter pattern with excluded apostrophe:
String text = "Modern Talking's Hit: 'You're my heart, you're my soul', 1985";
String[] words = text.replaceAll("(?:^|\\W)'|'(?:\\W|$)", "").split("[^\\w^'] ");
System.out.println(Arrays.toString(words));
Output:
[Modern, Talking's, Hit, You're, my, heart, you're, my, soul, 1985]
CodePudding user response:
Instead of splitting, you could use Pattern and MatchResult libraries to list the words you want with \w ('\w )? regex
import java.util.regex.Pattern;
import java.util.regex.MatchResult;
String regex = "\\w ('\\w )?";
String text = "sampl'e 'text'";
String[] words = Pattern.compile(regex)
.matcher(text)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
You could also split for a whitespace surrounded (or not) by apostrophes
text.split("'?\s'?");
