Home > Blockchain >  Split string based blank lines and create objects based on separator
Split string based blank lines and create objects based on separator

Time:02-05

I'm trying to create a very rudimentary parser that would take a multi-line string and convert that into an array containing objects. The string would be formatted like this:

title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com

The goal is to turn this into an array where each section separated by one or more empty lines would be an object containing key/value pairs with the colon as the separator in between the key and value, and one new line as the separator in between individual key/value pairs. So the input above should result in the following output:

[
  {
    title: "This is a title",
    description: "Shorter text in one line",
    image: "https://www.example.com"
  },
  {
    title: "This is another title : with colon",
    description: "Longer text that potentially could span over several new lines, even three or more",
    image: "https://www.example.com"
  },
  {
    title: "This is another title, where the blank lines above are two",
    description: "Another description",
    image: "https://www.example.com"
  }
]

I've started with this CodePen, but as you can see, the code currently have a few problems that needs to be solved before it's complete.

  1. If colons are used in the value, they shouldn't be split. I somehow need to make the split by the first occurence of a colon and then ignore additional colons in the value. This currently results in the following:
// Input:
//     title: This is another title : with colon
//     image: https://www.example.com

{
  image: " https",
  title: " This is another title "
}
  1. Some lines could contain a value that spans over multiple lines. The line breaks in the value should be concatenated into a single line and not be treated as a separator for a new key/value pair. This currently results in the following:
// Input:
//     description: Longer text that potentially
//     could span over several new lines,
//     even three or more

{
  could span over several new lines,: undefined,
  description: " Longer text that potentially",
  even three or more: undefined
}

Would greatly appreciate any help with how to approach this given the code I have so far. Any suggestions on how to optimise the code to be more performance efficient is also very welcome.

CodePudding user response:

As a partial-answer, the below will handle the multiple semicolons on one line:

var input = `title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com`;

var finalArray = [];
var first = input.split(/\n\s*\n/);

console.log("Array with sections split:", first);

first.forEach(function (section) {
  var result = section.split("\n").reduce(function (o, pair) {
    pair = pair.split(":");
    return (o[pair.shift()] = pair.join(':')), o;
  }, {});
  console.log(result);
  finalArray.push(result);
});

console.log("Array of sections as objects:", finalArray);

This still doesn't handle multi-line values, but the issue is that in your schema there is no way to determine when a new line means the start of a new property and when it is just the continuation of a value. You already rule out using colon and comma separation so you've now got no way to solve your second issue.

I'd advise using a special character that you don't allow in the main text body to denote the end of a key-value pair and splitting based on that.

CodePudding user response:

There is a very simple rule if you work with text, always keep in mind regular expressions.

Try this approach:

const data = `title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com`;

const bloks = data.split(/\n\s*\n/);

result = bloks.map((blok) => {
  const title = blok.match(/(?<=title:)([\S\s]*\n?)(?=description:)/gm).join(' ').trim();
  const description = blok.match(/(?<=description:)([\S\s]*\n?)(?=image:)/gm).join(' ').replaceAll('\n', ' ').trim();
  const image = blok.match(/(?<=image:)([\S\s]*\n?)(?=)/gm).join(' ').trim();

  return { title, description, image };
})

console.log(result);
.as-console-wrapper { max-height: 100% !important; top: 0; }

  •  Tags:  
  • Related