I have a big dataframe with more than 2500 columns but the structure is very similar than this:
A B C D E
0 1 0 8 0 0
1 0 0 0 0 5
2 1 2 3 0 0
3 0 2 0 1 0
I need to detect all the columns with a value greater than 0 and I have done with this:
df['X'] = df.gt(0).dot(df.columns ',')
And then I get this:
A B C D E X
0 1 0 8 0 0 A,C
1 0 0 0 0 5 E
2 1 2 9 0 0 A,B,C
3 0 3 0 1 0 B,D
The problem is that I don't need in 'X' the name of the columns but the value of the columns and I need to perform the following mathematical operation:
I want it to detect only the 2 lowest values and subtract the lowest value from the second lowest. In the case that there were not more than 2 values in 'X' it would be enough for me to show the value of the column only.
The final result in my example would be the following:
A B C D E X
0 1 0 8 0 0 7
1 0 0 0 0 5 5
2 1 2 9 0 0 1
3 0 3 0 1 0 2
Any idea how to solve it or any orientation?
CodePudding user response:
You can use apply and you have to specify axis=1 to apply the function row-wise. I have added this get_diff function without being 100% if that is exactly what you would need. I have also added the assign call to create a new dataframe with a new column X that holds the needed value
def get_diff(in_:pd.Series) -> int | float:
res = in_[in_ != 0].sort_values(ascending=False)
if len(res) == 0:
return 0 # Not sure if this is what you want to do in that case
return res[-2] - res[-1] if len(res) > 1 else res[0]
df = df.assign(X=lambda df: df.apply(get_diff, axis=1))
CodePudding user response:
We can do nsmallest then follow by np.ptp and condition for those row only have one value not equal to 0
df['new'] = df.apply(lambda x : np.ptp(pd.Series.nsmallest(x[x!=0],2)) if sum(x!=0) != 1 else x[x!=0].iloc[0],axis=1)
Out[520]:
0 7
1 5
2 1
3 1
dtype: int64
Or doing two steps
df['new'] = df[df.ne(0).sum(1)>1].apply(lambda x : np.ptp(pd.Series.nsmallest(x,2)),axis=1)
df['new'].fillna(df.max(1),inplace=True)
df
Out[530]:
A B C D E new
0 1 0 8 0 0 7.0
1 0 0 0 0 5 5.0
2 1 2 3 0 0 1.0
3 0 2 0 1 0 1.0
CodePudding user response:
I think you can simply use apply() as you want to do a row operation on each row.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
CodePudding user response:
One option is to identify values greater than 0, before sorting and trimming to get the values for X:
temp = df.where(df.gt(0), np.nan, axis = 0)
# you could use `np.partition` instead
# which should be more efficient
temp = np.sort(temp, axis = 1)[:, :2]
temp = np.nan_to_num(temp)
temp = np.ptp(temp, axis = 1)
df.assign(X = temp)
A B C D E X
0 1 0 8 0 0 7.0
1 0 0 0 0 5 5.0
2 1 2 3 0 0 1.0
3 0 2 0 1 0 1.0
