More Efficient Way to find the Second Largest Item in a List in Python-CodePudding

I wrote this simple code for a simple task of finding the second largest item in a list of integers:

def second_largest(input_list):
  input_list.sort()
  return input_list[-2]

However, for large lists this function can be really uneffective, for example a running time of more than 1.5 seconds for one million items.

I understand that it's due to the fact that the function changes the very list itself (with the .sort method) which can be very inefficient for a long list. How can I perform this task without having to use inefficient methods that change the list?

Thank you all in advnace.

CodePudding user response：

How about the following:

lst = list(range(1000000))

largest, second_largest = sorted(lst[:2])
for x in lst[2:]:
    if x > largest:
        largest, second_largest = x, largest
    elif x > second_largest:
        second_largest = x
print(largest, second_largest) # 999999 999998

It traverses an iterable only once, so I hope it is efficient. (This assumes the list has at least two items.)

CodePudding user response：

I've found that a solution based on np.argpartition is the fastest. It does require Numpy though, and it doesn't take into account the conversion of your list to a numpy array. But, in the end, if you wish to do some further number crunching down the line you might want to use numpy arrays since operations on those are typically much faster than operations on list. So I'm assuming this is the case for you.

First you should convert your list to an array:

import numpy as np
v = list(range(10000000))
var = np.array(v)

We can then use np.argpartition

%%time
ind = np.argpartition(var, -2)[-2]

CPU times: user 46.3 ms, sys: 25.9 ms, total: 72.2 ms
Wall time: 71.7 ms

ind = 9999998, so the result seems correct.

Now if we compare with the other suggestions here:

%%time
v.pop(v.index(max(v)))
max(v)

CPU times: user 619 ms, sys: 1.96 ms, total: 621 ms
Wall time: 623 ms

About 10 times slower

%%time
heapq.nlargest(2,v)[1]

Well... much slower

And the last one

largest, second_largest = v[0], v[0]
for x in v:
    if x > largest:
        largest, second_largest = x, largest
    elif x > second_largest:
        second_largest = x
print(largest, second_largest) # 999999 999998

CPU times: user 1.41 s, sys: 0 ns, total: 1.41 s
Wall time: 1.41 s

Again, many times slower.

The np.argpartition solution seems the fastest if you plan to work with arrays somewhere down the line. If not, you will spend a lot of CPU time converting your list to an array that you won't be using again (see below). It still outperforms some other solutions though. If we take into account the conversion to an array, we get this result:

%%time
var=np.array(v)
ind = np.argpartition(var, -2)[-2]

CPU times: user 867 ms, sys: 59 ms, total: 926 ms
Wall time: 924 ms

This question explains pretty well why this solution is faster. The reason is that you only perform a partial sort using np.argpartition rather than fully sorting the list.

CodePudding user response：

Pop max, find max again:

my_list.pop(my_list.index(max(my_list)))
max(my_list)

CodePudding user response：

As mentioned by @juanpa.arrivillaga, we can use Heap queue algorithm's heapq.nlargest method

import heapq

data = list(range(100))
data.append(100)
print(heapq.nlargest(2, data)[1])

Output:

Disclaimer:

If the data contains repeating values, it will return the unique second largest element.