Home > Blockchain >  Java: Searching for a Regex that splits a text into separate words, including letters, numbers and &
Java: Searching for a Regex that splits a text into separate words, including letters, numbers and &

Time:01-14

At the moment I have: text.split("[^\\w ]" But I also need to include words like: Can't but not something like: 'HEART'

I can't find a solution, that splits a text into words, including the letters, numbers and the aposthroph, if it's between other letters. Thx

CodePudding user response:

If you want to match words using \w, instead of using split you can use word boundaries and assert not ' at the left and at the right.

\b(?<!')\w (?:'\w )*\b(?!')

In Java

String regex = "\\b(?<!')\\w (?:'\\w )*\\b(?!')";
String string = "Can't but not something like: 'HEART'";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Output

Can't
but
not
something
like

CodePudding user response:

It may be simpler to get rid of the single quotes/apostrophes when they occur before/after the word, and then split using the initial delimiter pattern with excluded apostrophe:

String text = "Modern Talking's Hit:  'You're my heart, you're my soul', 1985";
String[] words = text.replaceAll("(?:^|\\W)'|'(?:\\W|$)", "").split("[^\\w^'] ");
System.out.println(Arrays.toString(words));

Output:

[Modern, Talking's, Hit, You're, my, heart, you're, my, soul, 1985]

CodePudding user response:

Instead of splitting, you could use Pattern and MatchResult libraries to list the words you want with \w ('\w )? regex

import java.util.regex.Pattern;
import java.util.regex.MatchResult;

String regex = "\\w ('\\w )?";
String text = "sampl'e 'text'";

String[] words = Pattern.compile(regex)
                          .matcher(text)
                          .results()
                          .map(MatchResult::group)
                          .toArray(String[]::new);

You could also split for a whitespace surrounded (or not) by apostrophes

text.split("'?\s'?");
  •  Tags:  
  • Related