This is my first attempt at posting a question. Sorry if it is not well described.
I have lagged/shifted datetime features using pandas and have created a correlation matrix of the results. I then output a dataframe of the top (90)% correlations and displayed them similar to shown below.
The data appears like this where the _number in c1 c2 is the lag/shift period. (_number = _lag/shift)
c1 c2 %corr
0 Atype Atype 1.000000
1 Ctype Dtype 0.96
2 Btype_3 Etype_3 0.96
3 Btype Etype 0.95
4 Ctype_1 Atype_2 0.93
5 Atype_1 Atype_1 0.93
6 Dtype_2 Etype_4 0.92
7 type_1 Atype_5 0.91
I am only interested in the values that are lagged/shifted. I wish to eliminate everything that is of the same lag/shift (including the 0 value - without any lag). To appear like this
c1 c2 %corr
0 Ctype_1 Atype_2 0.93
1 Dtype_2 Etype_4 0.92
2 Etype_1 Atype_5 0.91
How can this be done?
CodePudding user response:
You can extract the lag with str.extract for each column and slice the rows with different values:
lag1 = df['c1'].str.extract('_(.*)', expand=False).fillna('')
lag2 = df['c2'].str.extract('_(.*)', expand=False).fillna('')
df2 = df[lag1 != lag2]
output:
c1 c2 %corr
4 Ctype_1 Atype_2 0.93
6 Dtype_2 Etype_4 0.92
7 type_1 Atype_5 0.91
Alternative using str.replace:
lag1 = df['c1'].str.replace('^[^_] ', '', regex=True)
lag2 = df['c2'].str.replace('^[^_] ', '', regex=True)
df[lag1 != lag2]
NB. note that depending on your data, it might be more efficient to act upstream and to only compute the correlations for data of different lags
