Home > database >  split text into array using regex in Java
split text into array using regex in Java

Time:01-19

I need some help with regex for string split() method. String I would like to split looks like this:

<code a="1234" n1="John doe" n2="1/3 forest" o1="game dev" id_fk="2"/>

It always starts with <code and ends with `/>

I would like result to look like this:

a="1234"
n1="John doe"
n2="1/3 forest"
o1="game dev"
id_fk="2"

I tried to create regex expression using regex101 and my expression at the moment is (\S =".*?"), so strings I would like to have as the result are delimeters in this case, and I can't figure out how to negate this expression or write different expression that would give me correct result.

Thank you for your help in advance

CodePudding user response:

The thing you pass to split is the separator. As in, the stuff in between the data you wanted.

Your regexp matches... the data you wanted, which means what you get back is the spaces and nothing more.

You're using the wrong tool for the job, twice.

[A] This is HTML. The 'regular' in 'RegularExpression' is not a random name. Regexes were not invented by "Mr. Regular". They refer to a type of grammar: Some grammars are 'regular' and some or not. You cannot parse non-regular grammars with regular expressions. HTML IS NOT REGULAR. I can craft valid HTML that will fail any regex you care to make. It's unlikely it'll happen perhaps, but them's the breaks.

[B] Even if you decide to throw that one to the wolves and keep going down this wrong road, split is not the right tool. Make a pattern object, make a matcher, repeatedly call find:

Pattern p = Pattern.compile("([a-z] )\\s*=\\s*\"([^\"]*)\"");
Matcher m = p.matcher("<code a=\"1234\"...>");
while (m.find()) {
  String key = m.group(1);
  String value = m.group(2);
  System.out.println(key); // prints 'a'
  System.out.println(value); // prints value
}

CodePudding user response:

Your approach works, this code

String input = "<code a=\"1234\" n1=\"John doe\" n2=\"1/3 forest\" o1=\"game dev\" id_fk=\"2\"/>";
Matcher matcher = Pattern.compile("(\\S =\"[^\"] \")").matcher(input);
while(matcher.find()) {
    System.out.println(matcher.group(1));
}

prints

a="1234"
n1="John doe"
n2="1/3 forest"
o1="game dev"
id_fk="2"

CodePudding user response:

Keeping it simple, and assuming you only need to do this with top level non nested single XML tags, we can try:

String input = "<code a=\"1234\" n1=\"John doe\" n2=\"1/3 forest\" o1=\"game dev\" id_fk=\"2\"/>";
String[] parts = input.replaceAll("^<\\S \\s |\\s*/>$", "").split("\\s ");
System.out.println(Arrays.toString(parts));

// [a="1234", n1="John, doe", n2="1/3, forest", o1="game, dev", id_fk="2"]
  •  Tags:  
  • Related