Home > Mobile >  How to find and remove first/starting word from an Arabic string having diacritics but maintaining t
How to find and remove first/starting word from an Arabic string having diacritics but maintaining t

Time:01-22

The aim is to find and remove a starting word (string) from an Arabic string that we don't know if it has diacritics or not but must maintain any and all diacritics of the remaining string (if any).

There are many answers for removing the first/starting word from an English string on StackOverflow, but there is no existing solution to this problem found on StackOverflow that maintains the balance of the Arabic string in its original form.

If the original string is normalized (removing the diacritics, tanween, etc.) before processing it, then the remaining string returned will be the balance of the normalized string, not the balance of the original string.

Example. Assume the following original string which can be in any of the following forms (i.e. the same string but different diacritics):

1. "السلام عليكم ورحمة الله"

2. "السَلام عليكمُ ورحمةُ الله"

3. "السَلامُ عَليكمُ ورَحمةُ الله"

4. "السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله"

Now we want to remove the first/staring word "السلام" only if the string starts with such word (which it does), and return the balance of the "original" string with its original diacritics.

Of course, we are looking for the word "السلام" without diacritics because we don't know how the original string is formatted with diacritics.

So, in this case, the returned balance of each string must be:

1. " عليكم ورحمة الله"

2. " عليكمُ ورحمةُ الله"

3. " عَليكمُ ورَحمةُ الله"

4. " عَلَيْكُمُ وَرَحْمَةُ الله"

The following code works for an English string (there are many other solutions) but not for an Arabic string as explained above.

function removeStartWord(string,word) {
if (string.startsWith(word)) string=string.slice(word.length);
return string;
}

The above code uses the principle of slicing the first word from the original string based on the word's length; which works fine for English text.

For an Arabic string, we don't know the form of diacritics of the original string and thus the length of the word we are looking for in the original string will be different and unknown.

CodePudding user response:

Working with regex unicode escapes might already be good enough for what the OP is looking for, though JavaScript does not support unicode scripts like \p{Arabic}.

A category based pattern like /^[\p{L}\p{M}] \p{Z} /gmu together with replace already exactly does what the OP did ask for ...

find and remove first starting word from an arabic string having diacritis

The pattern ... ^[\p{L}\p{M}] \p{Z} ... reads like this ...

  • ^... starting at the beginning of a new line ...
  • [ ... ] ... find at list one character of the specified character class ...
    • \p{L} ... either any kind of Letter from any language,
    • \p{M} ... or a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
  • ... followed by \p{Z} ... at least one of any kind of whitespace or invisible separator.

console.log(`السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`.replace(/^[\p{L}\p{M}] \p{Z} /gmu, ''));
.as-console-wrapper { min-height: 100%!important; top: 0; }

Edit

Since it is clear now what the OP really wants, the above approach remains and just gets raised to the next level by utilizing a replacer function with additional comparison logic based on an Intl.Collator object which takes Arabic and base letter comparison into account.

The collator is initialized the least strict by providing (additionally to the 'ar' locals) an option which features a base sensitivity. Thus, while comparing two similar (but not quite equal) strings via the collator's compare method, e.g. 'السلام' and 'السَّلَامُ' will be considered equal despite of the latter featuring (a lot of) diacritics.

proof / examples ...

const baseLetterCollator = new Intl.Collator('ar', { sensitivity: 'base' } );

console.log(
  "('السلام عليكم ورحمة الله' === 'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') ?..",
  ('السلام عليكم ورحمة الله' === 'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله')
);
console.log('\n');

console.log(`new Intl.Collator()
  .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0

  ?..`,
  new Intl.Collator()
    .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0
);
console.log(`new Intl.Collator('ar', { sensitivity: 'base' } )
  .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0

  ?..`,
  new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('السلام عليكم ورحمة الله' ,'السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله') === 0
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Based on all the above said ... the final solution ...

function removeFirstMatchingWordFromEveryNewLine(search, multilineString) {
  const baseLetterCollator
    // - [ar]abic
    // - base sensitivity
    //   ... only strings that differ in base letters compare as unequal.
    = new Intl.Collator('ar', { sensitivity: 'base' } );

  const replacer = word => {
    return (baseLetterCollator.compare(search, word.trim()) === 0)
      ? ''    // - remove the matching word (whitespace included).
      : word; // - keep the word since there was no match. 
  }
  const regXFirstLineWord = /^[\p{L}\p{M}] \p{Z} /gmu;

  search = String(search).trim();

  return String(multilineString).replace(regXFirstLineWord, replacer);  
}
const sampleData = `السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
أهلا ومرحبا
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeFirstMatchingWordFromEveryNewLine('السلام', sampleData) ...",
  removeFirstMatchingWordFromEveryNewLine('السلام', sampleData)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

CodePudding user response:

I don't see what wrong in your code, but here is another approach:

function removeStartWord(string, word) {
  return string.split(' ').filter((_word, index) => index !== 0 || _word.replace(/[^a-zA-Zء-ي] /g, '') !== word).join(' ');
}

const sampleData = `السلام عليكم ورحمة الله
السَلام عليكمُ ورحمةُ الله
أهلا ومرحبا
السَلامُ عَليكمُ ورَحمةُ الله
السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله`;

console.log('sampleData ...', sampleData);
console.log(
  "removeStartWord(sampleData, 'السلام') ...",
  removeStartWord(sampleData,'السلام')
);

console.log(
  "removeStartWord('السلام', 'السلام عليكم ورحمة الله') ...",
  removeStartWord('السلام', 'السلام عليكم ورحمة الله')
);
console.log(
  "removeStartWord('السلام', 'السَلام عليكمُ ورحمةُ الله') ...",
  removeStartWord('السلام', 'السَلام عليكمُ ورحمةُ الله')
);
console.log(
  "removeStartWord('السلام', 'أهلا ومرحبا') ...",
  removeStartWord('السلام', 'أهلا ومرحبا')
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

  •  Tags:  
  • Related