Remove rows containing partial columns matches pandas-CodePudding

This is my first attempt at posting a question. Sorry if it is not well described.

I have lagged/shifted datetime features using pandas and have created a correlation matrix of the results. I then output a dataframe of the top (90)% correlations and displayed them similar to shown below.

The data appears like this where the _number in c1 c2 is the lag/shift period. (_number = _lag/shift)

    c1      c2      %corr
0   Atype   Atype   1.000000
1   Ctype   Dtype   0.96
2   Btype_3 Etype_3 0.96
3   Btype   Etype   0.95
4   Ctype_1 Atype_2 0.93
5   Atype_1 Atype_1 0.93
6   Dtype_2 Etype_4 0.92
7   type_1  Atype_5 0.91

I am only interested in the values that are lagged/shifted. I wish to eliminate everything that is of the same lag/shift (including the 0 value - without any lag). To appear like this

    c1      c2      %corr
0   Ctype_1 Atype_2 0.93
1   Dtype_2 Etype_4 0.92
2   Etype_1 Atype_5 0.91

How can this be done?

CodePudding user response：

You can extract the lag with str.extract for each column and slice the rows with different values:

lag1 = df['c1'].str.extract('_(.*)', expand=False).fillna('')
lag2 = df['c2'].str.extract('_(.*)', expand=False).fillna('')

df2 = df[lag1 != lag2]

output:

        c1       c2  %corr
4  Ctype_1  Atype_2   0.93
6  Dtype_2  Etype_4   0.92
7   type_1  Atype_5   0.91

Alternative using str.replace:

lag1 = df['c1'].str.replace('^[^_] ', '', regex=True)
lag2 = df['c2'].str.replace('^[^_] ', '', regex=True)

df[lag1 != lag2]

NB. note that depending on your data, it might be more efficient to act upstream and to only compute the correlations for data of different lags