I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word' should be counted simply as word (without the quotes, otherwise they'll behave as two different words), while it's should be counted as a whole word.
(?<=\w)(')(?=\w)
This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w] .
(?<=\w)(')(?=\w)|[\w]
Will count it's a 'miracle' of nature as 7 words, instead of 5 (it, ', s becoming 3 different words). Also, the third word should be selected simply as miracle, and not as 'miracle'.
To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ] instead of \w.
How can I accomplish that?
CodePudding user response:
1) You can simply use /[^\s] /g regex
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g);
console.log(result.length);
console.log(result);
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
2) If you are calculating total number of words in a string then you can also use split as:
const str = `it's a 'miracle' of nature`;
const result = str.split(/\s /);
console.log(result.length);
console.log(result);
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
3) If you want a word without quote at the starting and at the end then you can do as:
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s] /g).map((s) => {
s = s[0] === "'" ? s.slice(1) : s;
s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
return s;
});
console.log(result.length);
console.log(result);
<iframe name="sif3" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
CodePudding user response:
You might use an alternation with 2 capture groups, and then check for the values of those groups.
(?<!\S)'(\S )'(?!\S)|(\S )
(?<!\S)'Negative lookbehind, assert a whitespace boundary to the left and match'(\S )Capture group 1, match 1 non whitespace chars'(?!\S)Match'and assert a whitespace boundary to the right|Or(\S )Capture group 2, match 1 non whitespace chars
See a regex demo.
const regex = /(?<!\S)'(\S )'(?!\S)|(\S )/g;
const s = "it's a 'miracle' of nature";
Array.from(s.matchAll(regex), m => {
if (m[1]) console.log(m[1])
if (m[2]) console.log(m[2])
});
<iframe name="sif4" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

