String comparison time complexity

16,057

Solution 1

Time for string comparison is O(n), n being the length of the string.

However depending on the test data, you can manually optimize the matching algorithm. I have mentioned a few.

Optimization 1:

Check the size of both the strings, if unequal, return false. As this will stop the further O(n) comparison, and save time. Generally string data structure stores the size in memory, rather than calculating it each time. This allows O(1) time access to the string size.

Practically this is a huge optimization. I will explain how, by calculating the amortized time complexity.

If your string data structure can have a string of max size x, then there can be a total of (x + 1) possible string sizes (0, 1, 2, ... , x).

There are (x + 1) choose 2 ways of selecting two strings = x * (x + 1) / 2

If you use optimization 1, then you would need to compare the whole length only when two strings are of equal length. There will be only x + 1 such cases. Number of operations done will be 0 + 1 + 2 + .... + x = x * (x + 1) / 2 .

Remaining (x + 1) * (x - 2) / 2 cases will be calculated in O(1) time.

Hence total computations = x * (x + 1) / 2 + (x + 1) * (x - 2) / 2 = (x + 1) * (x - 1) which is O(n^2). Since we are doing x * (x + 1) / 2 string comparisons, hence amortized time complexity per comparison is O(1).

Where as without any optimization, there will be

0 + 1 * (x) * 1 + 2 * (x - 1) * 2 + 3 * (x - 3) * 3 + .... + x/2 * x/2 * x/2 calculations. Which will be without any doubt more than O(n^3). And amortized time complexity will be more than O(n).

Optimization 2:

Since you database contains web links, it is possible that they belong to the same website, hence their first few characters will always be same. This will lead to redundant CPU time usage. Hence better to check from the end for this case, as relative links will differ only from the end.

Note Theoretically speaking, we are not developing an algorithm that will change the worst case time complexity, it is still O(n). We are just optimizing the algorithm.

Solution 2

String comparisons typically do a linear scan of the characters, returning false at the first index where characters do not match.

The time complexity is O(N) and the actual time taken depends on how many characters need to be scanned before differences statistically emerge. If every one of your strings starts with http://, there will be a constant overhead to scan those first 7 characters (without tailoring the comparison algorithm to your specialized data).

If you have long strings, a tendency for the beginning of many strings to have the same starting characters, and extreme performance requirements you can consider hashing the strings, comparing the hashes first, and only doing a linear comparison of the strings if the hashes match (in order to rule out the possibility of a hash collision). If you do your initial comparison using hashes, which are shorter than the supposed long strings, you may be able to reduce the IO and RAM requirements of the system by carefully designing your query strategy.

Share:
16,057
Zip
Author by

Zip

Just a noob Java beginner.

Updated on June 04, 2022

Comments

  • Zip
    Zip almost 2 years

    Which comparison would take longer time?

    a = helloworldhelloworldhelloworld
    b = https://www.somerandomurls.com/directory/anotherdirectory/helloworld.html
    if a != b: doThis()

    versus

    a=one, b=two
    if a != b: doThis()

    I often need to check this against my database which has thousands of rows. I am not looking for any specific programming language. I just want to know which comparison takes faster. As you see, the value of b is longer string on the first example and shorter on the second example. So I wonder if that might make any difference on comparison.

  • jtschoonhoven
    jtschoonhoven almost 4 years
    Since string lengths can be compared in constant time, shouldn't this only apply to strings of equal length? I would expect the time complexity of comparing two arbitrary strings to amortize to O(1) since lengths will vary in the average case.
  • Eric J.
    Eric J. almost 4 years
    Sometimes, though when it is true, the cost has been shifted to a different part of the algorithm. The C language stores strings as a null-terminated sequence of characters, so the algorithm you describe would not work. Many languages (e.g. C#) store information about the string length as metadata. However, at some point in the execution of that program, the characters of the string were counted to obtain the length. Your point becomes very valid when a given string is compared more than once during the runtime of a program. Storing the length becomes a useful optimization.