Home > Software design >  Is there a better way to clean a string?
Is there a better way to clean a string?

Time:01-06

Currently, this is my code.

function clean_string(raw_string) {
    A =
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 1234567890".split(
            ""
        );
    var cleaned_string = raw_string.toLowerCase();
    for (i = 0; i < cleaned_string.length; i  ) {
        if (!A.includes(cleaned_string[i])) {
            cleaned_string = setCharAt(cleaned_string, i, " ");
        }
    }
    cleaned_string = cleaned_string.replace(/\s\s /g, " ");

    return cleaned_string;
}

function setCharAt(str, index, chr) {
    if (index > str.length - 1) return str;
    return str.substring(0, index)   chr   str.substring(index   1);
}

I don't know regex and it'll probably be easier with regex. Here's what I want to do:

Input: Hello, David World 123!

Output: hello david world 123

.

Input: hELlo., <>;dAVId world .;- 123

Output: hello david world 123

.

Input: He.llo David, w!orld 123#

Output: he llo david w orld 123

.

Basically what I want to do is replace anything but a-z0-9 with a space and then remove double spaces. In other words, I only want a-z0-9 in my results. How can I do that?

P.S. The code works but I think it looks bad and pretty inefficient.

EDIT: Sorry, I meant I only want lowercase letters in my output. I'm dumb.

CodePudding user response:

A simple solution would be to convert all characters to lowercase, replace any character that isn't a-z, 0-9, or a space with a space character, and then replace multiple space characters with a single space character.

function sanitize(input) {
    return input
      .toLowerCase()
      .replace(/([^a-z\d\s] )/g, ' ')
      .replace(/(\s )/g, ' ');
}

console.log(sanitize('Hello, David World 123!'));
console.log(sanitize('hELlo.,     <>;dAVId  world  .;- 123'));
console.log(sanitize('He.llo     David,   w!orld 123#'));

CodePudding user response:

Here is one approach using a regex callback:

var inputs = ["Hello, David World 123!", "hELlo.,     <>;dAVId  world  .;- 123", "He.llo     David,   w!orld 123#"];
for (var i=0; i < inputs.length;   i) {
    var input = inputs[i];
    input = input.replace(/\w /g, x => x.toLowerCase())
                 .replace(/[^\w_] /g, " ");
    console.log(input);
}

The strategy here is to do two regex replacements. The first finds all words in the input and converts them to lowercase. The second then strips off all non word characters and whitespace, including underscore, and replaces with a single space.

CodePudding user response:

A simple regex to replace non alpha numeric characters, then another to remove more than one space in a row should do the trick.

const clean = (input) => {
  const alphanumeric = input.replace(/[^a-zA-Z0-9]/g, ' ')
  const spaceless = alphanumeric.replace(/\s{2,}/g, ' ')
  
  console.log(spaceless.toLowerCase())
  return spaceless.toLowerCase()
}

clean("Hello, David World 123!")
clean("hELlo.,     <>;dAVId  world  .;- 123")
clean("He.llo     David,   w!orld 123#   ")

Ofcourse, the function can be shortened to

const clean = (input) => {
  return input.replace(/[^a-zA-Z0-9]/g, ' ').
               replace(/\s{2,}/g, ' ').
               toLowerCase()
}

Regex Explanation:

[^a-zA-Z0-9]: Match anything that doesn't match a-zA-Z0-9 (anything non-alphanumeric) \s{2,}: Match a space that occurs two or more times in a row

  •  Tags:  
  • Related