Match multiple strings by similarity or by dissimilarity (python)-CodePudding

Say you have a list of strings of the same length. You want to match every string with say 1 or 2 other strings that are most similar (sharing the same character at the same position) or least similar (not sharing a character at same position)

CodePudding user response：

Not the most efficient way, but you can get the matching values from two lists like this:

>>> list_1 = ["hello", "world", "today is a good day", "have a nice day"]
>>> list_2 = ["cats", "dogs", "today is a good day", "have a nice day"]
>>> set(list_1) & set(list_2)
{'today is a good day', 'have a nice day'}

If the order is important, you can do it with comprehensions like this:

>>> list_1 = ["hello", "world", "today is a good day", "have a nice day"]
>>> list_2 = ["cats", "dogs", "today is a good day", "have a nice day"]
>>> print([i for i, j in zip(list_1, list_2) if i == j])
['today is a good day', 'have a nice day']

CodePudding user response：

It depends what you mean by "similar". I'd say two strings such as 'abcdefg' and 'gabcdef' are very similar, but under your definition they are completely different

here is a code to implement your idea

the function most_similar_index returns the indices of the n most similar strings in a list to a given string

import numpy as np

def similarity(str1, str2):
    return sum([str1[i]==str2[i] for i in range(len(str1))])

def most_similar_index(list_string, s, n):
    """
    list_string : list of all strings of same size
    s : string of same size as all of those in list_string
    n : number of indices to return

    returns indices of the n closest strings to the given string
    """
    
    temp_list = []
    for string in list_string:
        temp_list.append(similarity(s,string))
    temp_list = np.array(temp_list)
    
    return np.argsort(temp_list)[-1:-n-1:-1]

result :

>>> list_string = ['abcde', 'abcdf', 'xbcde', 'xeeee', 'aeeef']
>>> s = 'abcff'
>>> most_similar_index(list_string, s, 3)
array([1, 0, 4], dtype=int64)