Home > Back-end >  Match multiple strings by similarity or by dissimilarity (python)
Match multiple strings by similarity or by dissimilarity (python)

Time:01-14

Say you have a list of strings of the same length. You want to match every string with say 1 or 2 other strings that are most similar (sharing the same character at the same position) or least similar (not sharing a character at same position)

CodePudding user response:

Not the most efficient way, but you can get the matching values from two lists like this:

>>> list_1 = ["hello", "world", "today is a good day", "have a nice day"]
>>> list_2 = ["cats", "dogs", "today is a good day", "have a nice day"]
>>> set(list_1) & set(list_2)
{'today is a good day', 'have a nice day'}

If the order is important, you can do it with comprehensions like this:

>>> list_1 = ["hello", "world", "today is a good day", "have a nice day"]
>>> list_2 = ["cats", "dogs", "today is a good day", "have a nice day"]
>>> print([i for i, j in zip(list_1, list_2) if i == j])
['today is a good day', 'have a nice day']

CodePudding user response:

It depends what you mean by "similar". I'd say two strings such as 'abcdefg' and 'gabcdef' are very similar, but under your definition they are completely different

here is a code to implement your idea

the function most_similar_index returns the indices of the n most similar strings in a list to a given string

import numpy as np

def similarity(str1, str2):
    return sum([str1[i]==str2[i] for i in range(len(str1))])

def most_similar_index(list_string, s, n):
    """
    list_string : list of all strings of same size
    s : string of same size as all of those in list_string
    n : number of indices to return

    returns indices of the n closest strings to the given string
    """
    
    temp_list = []
    for string in list_string:
        temp_list.append(similarity(s,string))
    temp_list = np.array(temp_list)
    
    return np.argsort(temp_list)[-1:-n-1:-1]

result :

>>> list_string = ['abcde', 'abcdf', 'xbcde', 'xeeee', 'aeeef']
>>> s = 'abcff'
>>> most_similar_index(list_string, s, 3)
array([1, 0, 4], dtype=int64)
  •  Tags:  
  • Related