Which way is better to compare two strings in Python? I want to know which one is faster-CodePudding

Suppose two variables a and b stores two strings, I want to compare whether the two strings are same or not. In Python, the typical way should be a == b. I believe the time complexity of that is O(min(m,n)), where m and n are the length of string. But if I compare them in this way a in {b} (I add string b to a set and check if string a in that set), will the time complexity be O(1), regardless of the length of strings in a and b?

CodePudding user response：

No, as the in operator first computes the hash of the string (linear complexity) and then does another comparison if an element with the same hash is in the set (again, linear complexity). So, a == b is your best bet, as it skips the computation of the hash and avoids the construction of the set.

The constant time complexity of the in operator is with regards to the length of the set. It is still linear in regards to the length of the string.

However, the following results seem to disagree:

$ python3 -m timeit "'aaa' == 'aaa'"
20000000 loops, best of 5: 18.9 nsec per loop
$ python3 -m timeit "'aaa' == 'bbb'"
10000000 loops, best of 5: 22.1 nsec per loop
$ python3 -m timeit "'aaa' in {'aaa'}"
20000000 loops, best of 5: 17 nsec per loop
$ python3 -m timeit "'aaa' in {'bbb'}"
20000000 loops, best of 5: 16 nsec per loop

A slightly altered example seems to make more sense...

$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a in {b}"
5000000 loops, best of 5: 63 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a in {a}"
5000000 loops, best of 5: 55.7 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a == b"
10000000 loops, best of 5: 30.1 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a == a"
10000000 loops, best of 5: 27 nsec per loop

CodePudding user response：

No. This just hides the check. To check if a value is in a set, it first needs to hash the member it is checking. Then, if it finds a match for the hash it needs to call their implementation of equals to verify if the matching hash was just a false positive. So now you end up with O(m n min(m, n)).

That being said, since strings are such a core part of computer science they have a lot of various optimizations behind the scene. Some strings might have already been hashed ahead of time and that value was cached. Or maybe a and b point to the same string so it can exit early. Or their lengths are different so it can quickly exit. Attempting to add your own optimizations for something as important as strings will likely only make it more difficult for the system you are working to help you.

CodePudding user response：

Time complexity-wise, it will be faster to compare a is b for strings than a == b, given one can exclude that none of them is NaN.

Okay, I thought this is "smart" but see below a counterexample why using is is fatal - by @khelwood! The article suggested by @JohnColeman is also very interesting. (Thanks!)