Delete underscores between japanese characters from a string in python-CodePudding

i need help with deleting underscores from certains strings. That's not difficult, the difficulty comes from the fact that the string does contain japanese characters.

E.g. i have these strings (of hundred of thousands of other strings):

str1 = "3F_う_が_LOW_まい_が"
str2 = "A5_BB_合_ら"
str3 = "C1_だ_と_思"

What i want to get as a final result is this:

strFinal1 = "3F_うが_LOW_まいが"
strFinal2 = "A5_BB_合ら"
strFinal3 = "C1_だと思

So essentially i want to delete the underscore only between two japanese characters. How can i do this in python?

CodePudding user response：

You should check built-in function ord:

>>> ord('a')
97
>>> ord('が')
12364

As you can notice, a Japanase character has a much higher number returned when passed as argument to ord, so you can use this difference this way:

# Where i is the index of an _ in the string
if (ord(string[i 1]) > 500 and ord(string[i-1]) > 500):
    # The _ is between two not-european characters

This should work:

string: str = list(input())

for index, element in enumerate(string):
    if (index == 0):
        continue
    # Where index is the index of an _ in the string
    if (element == '_'):
        # The _ is between two not-european characters
        if (ord(string[index 1]) > 500 and ord(string[index-1]) > 500):
            string[index] = ' '

string = ''.join(string)

CodePudding user response：

I'm not familiar with the different sets of Japanese characters, but you should be able to identify Japanese characters based on their Unicode code points, which should lie within one of the following ranges:

Hiragana: 3040-309f
Katakana: 30a0-30ff
Kanji: 4e00-9fbf

Note that different sources may also include other ranges, such as 1 or 2. The ones that I listed should definitely be included, but you should figure out which other ranges you also want to cover as well, and then extend the is_japanese_char function shown below.

import re

def is_japanese_char(ch):
    assert(len(ch) == 1)  # only use this for single character strings
    if re.search("[\u3040-\u309f]", ch):
        return True  # is hiragana
    if re.search("[\u30a0-\u30ff]", ch):
        return True  # is katakana
    if re.search("[\u4e00-\u9faf]", ch):
        return True  # is kanji
    return False

Now that you can identify Japanese characters, you can iterate over each character in the string, and remove all unwanted characters, like this:

def is_bad_underscore(ch, prev_ch, next_ch):
    if ch != "_":
        return False
    if not is_japanese_char(prev_ch):
        return False
    if not is_japanese_char(next_ch):
        return False
    return True


def remove_bad_underscores(s):
    new_string = s[0]
    for i, ch in enumerate(s[1:-1], start=1):  # skip first and last
        if not is_bad_underscore(ch, s[i-1], s[i 1]):
            new_string  = ch
    return new_string   s[-1]

It's not the cleanest code, and can be optimized, but it works.

print(remove_bad_underscores("3F_う_が_LOW_まい_が") == "3F_うが_LOW_まいが") # True
print(remove_bad_underscores("A5_BB_合_ら") == "A5_BB_合ら") # True
print(remove_bad_underscores("C1_だ_と_思") == "C1_だと思") # True

CodePudding user response：

To refine a bit on Alan Verresen’s answer: For slightly-more human-readable code, you can:

use the regex module rather than the re module
use Unicode-script category properties rather than explicitly specifying code-point ranges

import regex

def is_japanese_char(ch):
    assert(len(ch) == 1)  # only use this for single character strings
    if regex.search("\p{Hiragana}", ch):
        return True  # is hiragana
    if regex.search("\p{Katakana}", ch):
        return True  # is katakana
    if regex.search("\p{Han}", ch):
        return True  # is kanji
    return False

The regex module supports that \p{} syntax but the re module doesn’t yet, as far as I know. For more info on matching other categories of Unicode properties, see also the answers at: