Second and third largest values within multiple columns in Pandas-CodePudding

How do I find second and third largest values from multiple columns? It's clear how to find max value, min and median, however I cannot extract the second and third largest values as new columns?

import pandas as pd

df = pd.read_csv(...)

df['max'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].max(axis=1)

df['min'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].min(axis=1)

# ? 
df['2nd_largest'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]]

CodePudding user response：

I think it's fine enough to sort the dataframe once (wrt rows) and then extract those information. But we need to do the sorting in NumPy domain as it doesn't care about the column labels whilst sorting and so allows "independent" sorting of the rows:

# sort the subset frame over axis 1, i.e., columns change places (if at all)
cols_subset = ["A1", "B1", "C1", "D1", "E1", "F1"]
sorted_values = np.sort(df[cols_subset], axis=1)

# extract information
df["maximum"] = sorted_values[:, -1]  # since ascending sort, max is at the end
df["minimum"] = sorted_values[:, 0]
df["2nd largest"] = sorted_values[:, -2]
df["3rd largest"] = sorted_values[:, -3]

(The index is kept intact during sorting, so we don't worry about alignment issues when assigning.)

An example:

>>> np.random.seed(224)
>>> df = pd.DataFrame(np.random.randint(-5, 22, size=(6, 4)), columns=[*"ABDE"])
>>> df
    A   B   D   E
0   2   9  21   8
1  15   8  10  -2
2   0  18  -3   3
3   2  -3   6  -1
4  -4  17   9  10
5   8  -4  12  16

>>> cols_subset = ["A", "B", "E"]
>>> sorted_values = np.sort(df[cols_subset], axis=1)

# each row sorted
>>> sorted_values 
array([[ 2,  8,  9],
       [-2,  8, 15],
       [ 0,  3, 18],
       [-3, -1,  2],
       [-4, 10, 17],
       [-4,  8, 16]])

# after above assignments for maximum, minimum, ...
>>> df
    A   B   D   E  maximum  minimum  2nd largest  3rd largest
0   2   9  21   8        9        2            8            2
1  15   8  10  -2       15       -2            8           -2
2   0  18  -3   3       18        0            3            0
3   2  -3   6  -1        2       -3           -1           -3
4  -4  17   9  10       17       -4           10           -4
5   8  -4  12  16       16       -4            8           -4
# `minimum` and `3rd largest` coincide here since `cols_subset` is of length 3

CodePudding user response：

To find the second largest values of each row, you can use nlargest.

One way: Transpose the DataFrame; find the 2 highest values, transpose back and find the minimum across rows:

df['2nd_largest'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].T.nlargest(2, df.index).T.min(axis=1)

another way: apply a function to each row:

df['2nd_largest'] = df[["A1", "B1", "C1", "D1", "E1", "F1"]].apply(lambda row: row.nlargest(2).iat[-1], axis=1)

CodePudding user response：

If you values are in a list you can sort the values like:

df.sort()

and you can find the 2nd highest value like:

df[-2]

And alternative is a function I found on Get the second largest number in a list in linear time

def second_largest(numbers):
    count = 0
    m1 = m2 = float('-inf')
    for x in numbers:
        count  = 1
        if x > m2:
            if x >= m1:
                m1, m2 = x, m1            
            else:
                m2 = x
    return m2 if count >= 2 else None