I have two dataframe
One with codes and values need to map to other dataframe
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
Other datframe content lots of data with values and need to make a new column base on dataframe "B" column "Code".
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']
So I'm using the below code to replace A['NewName'].
A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])
Test Name Value NewName
0 2.0000 House with indoor pool 50 a
1 34.0000 House with Gray_C_Door 100 b
2 12.0000 House with indoor pool and Porch 70 House with indoor pool and Porch
3 45.0000 Wood Raling 20 b
4 NaN Window Glass Tinted 15 Window Glass Tinted
5 34.0000 Windows Glass_with 75 Windows Glass_with
6 56.0000 Big Chandelier 50 c
7 23.0000 Frame Window 10 e
8 56.0000 NaN 10 NaN
9 87.0000 Window glass 34 Window glass
10 23.0000 House with indoor pool 5 a
11 67.0000 High column with 56 High column with
12 89.0000 Window Glass trim 12 c
13 123.000 Frame Window 83 e
14 NaN glass Window NaN glass Window
However, some A['NewName'] are not matching with B['Value'] and doesn't give an exact expected outcome.
Is there a way, I can match those values when It has partial matching with A['NewName'] and give the correct code? I mean for instance when A['NewName'] has "House with indoor pool and Porch" I want to match it with B['Value'] = 'House with indoor pool' and replace it with correct B['Code] = 'a'. I couldn't add that to the data frame B Values column because there are several ways it could change after "House with indoor pool" (for ex: "House with indoor pool_ with big glass door", "House with indoor pool and High railings" etc.)
Is this possible to do it in a map/replace function or any other method?
Thanks in advacne!
CodePudding user response:
Here's a way to do it using difflib. We effectively do the outer product of comparisons of Name and Value for which there's no exact match, to find the best fit (according to difflib) before making a selection.
To flag really poor matches, we can set a threshold for the difflib match ratio below which we return NaN. I have selected 0.5 in the code below, which catches the sample input Name "I don't match anything".
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})
from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
name = x['Name']
if not isinstance(name, str):
return np.nan
if name in bValuesDict:
idx = bValuesDict[name]
else:
bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
maxRatio = max(bMatches)
if maxRatio < 0.5:
return np.nan
idx = bMatches.index(maxRatio)
code = B.iloc[idx, columnCode]
return code
A['NewName'] = A.apply(getCode, axis=1)
print(A)
Output:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 c
5 34.0 Windows Glass_with 75 a
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 a
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 b
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN e
15 333.0 I don't match anything 999 NaN
CodePudding user response:
You can use re (regular expression) in your own function and apply this function to A['Name'] (btw initializing 'Newname' is useless here):
import re
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
B_series = B.set_index('Value')
def get_code_for_matching_value(val):
if type(val)==str:
if val in B_series.index:
return B_series.at[val, 'Code']
else:
for i in B_series.index:
if re.match(i, val):
return B_series.at[i, 'Code']
return val
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)
Output:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 a
5 34.0 Windows Glass_with 75 Windows Glass_with
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 Window glass
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 High column with
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN glass Window
Note: you can improve matching by making it case insensitive (for example): if re.match(i.lower(), val.lower()):
