Finding an Intersection between two lists or dataframes while enforcing an ordering condition-CodePudding

I have two lists (columns from two separate pandas dataframes) and want to find the intersection of both lists while preserving the order, or ordering based on a condition. Consider the following example:

x = ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']
y = ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']

answer = set(x).intersection(y)

The variable answer yields the overlapping columns, yet the order is not preserved. Is there a way of sorting the solution such that the answer yields:

answer = ['3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
          '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR',
          '30 YR']

i.e first sorting the intersected list by month ("MO") and integers, and then by year ("YR") and its integers?

Alternatively, is there a pandas method to obtain the same result with two dataframes of overlapping columns (preserving or stating order)?

CodePudding user response：

You could simply with list comprehensions:

[this_name for this_name in x if this_name in y]

and

[this_name for this_name in y if this_name in x]

CodePudding user response：

I don't know what you are trying to do exactly, but my answer will be for the use case you described. If you want to work with pandas, I think the following code will do what you want. If you have more complex data, I think you might need to change the columns types to timedelta to have more flexibility. The sorting is working in this case because MO is alphabetically before YR.

import pandas as pd
df1 = pd.DataFrame({'x': ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']})
df2 = pd.DataFrame({'y': ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']})

# drop 'non-standard' data 
df1["x"] = df1["x"].apply(lambda x: x if x[0].isdigit() else None)
df2["y"] = df2["y"].apply(lambda x: x if x[0].isdigit() else None)
df1.dropna(inplace=True)
df2.dropna(inplace=True)

# make two columns to sort 
df1["value"] = df1["x"].apply(lambda x: int(x[:-2]))
df1["unit"] = df1["x"].apply(lambda x: x[-2:])

df2["value"] = df2["y"].apply(lambda x: int(x[:-2]))
df2["unit"] = df2["y"].apply(lambda x: x[-2:])

# sort by unit and value
df1 = df1.sort_values(by=["unit", "value"]).drop("x", axis=1)
df2 = df2.sort_values(by=["unit", "value"]).drop("y", axis=1)

# merge 
df = pd.merge(df1, df2, on=["unit", "value"])
df["result"] = df.apply(lambda x: str(x["value"])   " "   x["unit"], axis=1)
df.drop(["unit", "value"], axis=1, inplace=True)
df

CodePudding user response：

Use list comprehension to to check if items in x also exist in the set of y. This preserves the order each item appears in x while checking only for membership in y:

y_set = set(y)
answer = [item for item in x if item in y_set]

or use filter to do essentially the same job:

answer = list(filter(lambda i: i in y_set, x))

Output:

['1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR', '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO']