Home > database >  Simplify regex code in C#: Add a space between a digit/decimal and unit
Simplify regex code in C#: Add a space between a digit/decimal and unit

Time:01-14

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:

dosage_value = Regex.Replace(dosage_value, @"(\d)\s ", @"$1");
dosage_value = Regex.Replace(dosage_value, @"(\d)%\s ", @"$1%");
dosage_value = Regex.Replace(dosage_value, @"(\d (\.\d )?)", @"$1 ");
dosage_value = Regex.Replace(dosage_value, @"(\d)\s %", @"$1% ");
dosage_value = Regex.Replace(dosage_value, @"(\d)\s :", @"$1:");
dosage_value = Regex.Replace(dosage_value, @"(\d)\s e", @"$1e");
dosage_value = Regex.Replace(dosage_value, @"(\d)\s E", @"$1E");

Example:

10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05

should become

10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05

Exceptions are: %, E, e and :. I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?

Thank you!

CodePudding user response:

For your example data, you might use 2 capture groups where the second group is in an optional part.

In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.

(\d (?:\.\d )?)(?:\s*([%:eE]))?
  • ( Capture group 1
    • \d (?:\.\d )? match 1 digits with an optional decimal part
  • ) Close group 1
  • (?: Non capture group to match a as a whole
    • \s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
  • )? Close non capture group and make it optional

.NET regex demo

string[] strings = new string[]
{
    "10ANYUNIT",
    "10:something",
    "10 : something",
    "10 %",
    "40 e-5",
    "40 E-05",
};
string pattern = @"(\d (?:\.\d )?)(?:\s*([%:eE]))?";
var result = strings.Select(s => 
    Regex.Replace(
        s, pattern, m => 
        m.Groups[1].Value   (m.Groups[2].Success ? m.Groups[2].Value : " ")
    )
);

Array.ForEach(result.ToArray(), Console.WriteLine);

Output

10 ANYUNIT
10:something
10: something
10%
40e-5 
40E-05

As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:

\b([0-9] (?:\.[0-9] )?)(?:[\p{Zs}\t]*([%:eE]))?

CodePudding user response:

I think you need something like this:

dosage_value = Regex.Replace(dosage_value, @"(\d (\.\d*)?)\s*((E|e|%|:) )\s*", @"$1$3 ");

Group 1 - (\d (\.\d*)?)

Any number like 123 1241.23

Group 2 - ((E|e|%|:) )

Any of special symbols like E e % :

Group 1 and Group 2 could be separated with any number of whitespaces.

If it's not working as you asking, please provide some samples to test.

CodePudding user response:

For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation :)

using System.Text.RegularExpressions;

var testStrings = new string[]
{
    "10mg",
    "10:something",
    "10  :   something",
    "10 %",
    "40 e-5",
    "40 E-05",
};

foreach (var testString in testStrings)
{
    Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}


string RegexReplace(string input)
{
    // First look for exponential notation.
    // Pattern is: match zero or more whitespaces \s*
    // Then match one or more digits and store it in first capturing group (\d )
    // Then match one ore more whitespaces again.
    // Then match part with exponent ([eE][- ]?\d ) and store it in second capturing group.
    // It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
    // Then match zero or more white spaces.
    var expForMatch = Regex.Match(input, @"\s*(\d )\s ([eE][- ]?\d )\s*");
    if(expForMatch.Success)
    {
        return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
    }

    var matchWithColon = Regex.Match(input, @"\s*(\d )\s*:\s*(\w )");
    if (matchWithColon.Success)
    {
        return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
    }

    var matchWithPercent = Regex.Match(input, @"\s*(\d )\s*%");
    if (matchWithPercent.Success)
    {
        return $"{matchWithPercent.Groups[1].Value}%";
    }

    var matchWithUnit = Regex.Match(input, @"\s*(\d )\s*(\w )");
    if (matchWithUnit.Success)
    {
        return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
    }

    return input;
}

Output is

Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10  :   something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'
  •  Tags:  
  • Related