Why is Python recursion so expensive and what can we do about it?

python c++ recursion stack-overflow

14,207

Solution 1

A solution is a trampoline: the recursive function, instead of calling another function, returns a function that makes that call with the appropriate arguments. There's a loop one level higher that calls all those functions in a loop until we have the final result. I'm probably not explaining it very well; you can find resources online that do a better job.

The point is that this converts recursion to iteration. I don't think this is faster, maybe it's even slower, but the recursion depth stays low.

An implementation could look like below. I split the pair x into a and b for clarity. I then converted the recursive function to a version that keeps track of a and b as arguments, making it tail recursive.

def fib_acc(n, a, b):
    if n == 1:
        return (a, b)
    return lambda: fib_acc(n - 1, (a+b) % 997, (a+2*b) % 997)

def fib(n):
    x = fib_acc(n, 1, 2)
    while callable(x):
        x = x()
    return x

if __name__=='__main__':
    print(fib(50000)[0])

Solution 2

The issue is that Python has an internal limit on number of recursive function calls.

That limit is configurable as shown in Quentin Coumes' answer. However, too deep a function chain will result in a stack overflow. This underlying limitation¹ applies to both C++ and Python. This limitation also applies to all function calls, not just recursive ones.

In general: You should not write² algorithms that have recursion depth growth with linear complexity or worse. Logarithmically growing recursion is typically fine. Tail-recursive functions are trivial to re-write as iterations. Other recursions may be converted to iteration using external data structures (usually, a dynamic stack).

A related rule of thumb is that you shouldn't have large objects with automatic storage. This is C++-specific since Python doesn't have the concept of automatic storage.

¹ The underlying limitation is the execution stack size. The default size differs between systems, and different function calls consume different amounts of memory, so the limit isn't specified as a number of calls but in bytes instead. This too is configurable on some systems. I wouldn't typically recommend touching that limit due to portability concerns.

² Exceptions to this rule of thumb are certain functional languages that guarantee tail-recursion elimination - such as Haskell - where that rule can be relaxed in case of recursions that are guaranteed to be eliminated. Python is not such a language, and the function in question isn't tail-recursive. While C++ compilers can perform the elimination as an optimization, it isn't guaranteed, and is typically not optimized in debug builds. Hence, the exception doesn't generally apply to C++ either.

Disclaimer: The following is my hypothesis; I don't actually know their rationale: The Python limit is probably a feature that detects potentially infinite recursions, preventing likely unsafe stack overflow crashes and substituting a more controlled RecursionError.

Why are recursive calls so much cheaper in C++?

C++ is a compiled language. Python is interpreted. (Nearly) everything is cheaper in C++, except the translation from source code to an executable program.

Solution 3

Let me first answer your direct questions:

Why are recursive calls so much cheaper in C++?

Because C++ has no limitation on recursive call depth, except the size of the stack. And being a fully compiled language, loops (including recursion) are much faster in C++ than in Python (the reason why special Python modules like numpy/scipy directly use C routines). Additionally, most C++ implementations use a special feature called tail recursion elimination (see later in this post) and optimize some recursive code into iterative equivalents. This is nice here but not guaranteed by the standard, so other compilations could result in a program crashing miserably - but tail recursion is probably not involved here.

If the recursion is too deep and exhausts the available stack, you will invoke the well-known Undefined Behaviour where anything can happen, from an immediate crash to a program giving wrong results (IMHO the latter is much worse and cannot be detected...)

Can we somehow modify the Python compiler to make it more receptive to recursion?

No. Python implementation explicitly never uses tail recursion elimination. You could increase the recursion limit, but this is almost always a bad idea (see later in this post why).

Now for the true explanation of the underlying rationale.

Deep recursion is evil, full stop. You should never use it. Recursion is a handy tool when you can make sure that the depth will stay in sane limits. Python uses a soft limit to warn the programmer that something is going wrong before crashing the system. On the other hand, optimizing C and C++ compilers often internally change tail recursion into an iterative loop. But relying on it is highly dangerous because a slight change could prevent that optimization and cause an application crash.

As found in this other SO post, common Python implementations do not implement that tail recursion elimination. So you should not use recursion at a 5000 depth but instead use an iterative algorithm.

As your underlying computation will need all Fibonacci numbers up to the specified one, it is not hard to iteratively compute them. Furthermore, it will be much more efficient!

Solution 4

"Both will find the answer 996 without issues"

I do see at least one issue : the answer should be 836, not 996.

It seems that both your functions calculate Fibonacci(2*n) % p, and not Fibonacci(n) % p.

996 is the result of Fibonacci(1000) % 997.

Pick a more efficient algorithm

An inefficient algorithm stays an inefficient algorithm, regardless if it's written in C++ or Python.

In order to compute large Fibonacci numbers, there are much faster methods than simple recursion with O(n) calls (see related article).

For large n, this recursive O(log n) Python function should run in circles around your above C++ code:

from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n, p):
    "Calculate Fibonacci(n) modulo p"
    if n < 3:
        return [0, 1, 1][n]
    if n % 2 == 0:
        m = n // 2
        v1 = fibonacci(m - 1, p)
        v2 = fibonacci(m, p)
        return (2*v1 + v2) * v2 % p
    else:
        m = (n + 1) // 2
        v1 = fibonacci(m, p) ** 2
        v2 = fibonacci(m - 1, p) ** 2
        return (v1 + v2) % p


print(fibonacci(500, 997))
#=> 836
print(fibonacci(1000, 997))
#=> 996

Try it online!

It will happily calculate fibonacci(10_000_000_000_000_000, 997).

It's possible to add recursion level as parameter, in order to see how deep the recursion needs to go, and display it with indentation. Here's an example for n=500:

# Recursion tree:
500
  249
    124
      61
        30
          14
            6
              2
              3
                1
                2
            7
              4
          15
            8
        31
          16
      62
    125
      63
        32
  250

Try it online!

Your examples would simply look like very long diagonals:

Solution 5

For Windows executables, the stack size is specified in the header of the executable. For the Windows version of Python 3.7 x64, that size is 0x1E8480 or exactly 2.000.000 bytes.

That version crashes with

Process finished with exit code -1073741571 (0xC00000FD)

and if we look that up we find that it's a Stack Overflow.

What we can see on the (native) stack with a native debugger like WinDbg (enable child process debugging) is

[...]
fa 000000e9`6da1b680 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
fb 000000e9`6da1b740 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e
fc 000000e9`6da1b980 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
fd 000000e9`6da1ba40 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e
fe 000000e9`6da1bc80 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
ff 000000e9`6da1bd40 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e

2:011> ? 000000e9`6da1bd40 - 000000e9`6da1ba40 
Evaluate expression: 768 = 00000000`00000300

So Python will use 2 Stack frames per method call and there's an enormous 768 bytes difference in the stack positions.

If you modify that value inside the EXE (make a backup copy) with a hex editor to, let's say 256 MB

you can run the following code

[...]
if __name__=='__main__':
    sys.setrecursionlimit(60000)
    print(fib(50000)[0])

and it will give 151 as the answer.

In C++, we can also force a Stack Overflow, e.g. by passing 500.000 as the parameter. While debugging, we get

0:000> .exr -1
ExceptionAddress: 00961015 (RecursionCpp!fib+0x00000015)
ExceptionCode: c00000fd (Stack overflow)
[...]
0:000> k
[...]
fc 00604f90 00961045     RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7] 
fd 00604fb0 00961045     RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7] 
fe 00604fd0 00961045     RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7] 
ff 00604ff0 00961045     RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp @ 7] 

0:000> ? 00604ff0 - 00604fd0
Evaluate expression: 32 = 00000020

which is just 1 stack frame per method call and only 32 bytes difference on the stack. Compared to Python, C++ can do 768/32 = 24x more recursions for the same stack size.

My Microsoft compiler created the executable with the default stack size of 1 MB (Release build, 32 Bit):

The 64 bit version has a stack difference of 64 bit (also Release build).

Tools used:

Microsoft WinDbg Preview (free)
Sweetscape 010 Editor (commercial) with the template for PE files

View more solutions

14,207

jtallk

Updated on July 16, 2021

Comments

jtallk almost 3 years
Suppose we want to compute some Fibonacci numbers, modulo 997.

For n=500 in C++ we can run
```
#include <iostream>
#include <array>

std::array<int, 2> fib(unsigned n) {
    if (!n)
        return {1, 1};
    auto x = fib(n - 1);
    return {(x[0] + x[1]) % 997, (x[0] + 2 * x[1]) % 997};
}

int main() {
    std::cout << fib(500)[0];
}
```
and in Python
```
def fib(n):
    if n==1:
        return (1, 2)
    x=fib(n-1)
    return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)

if __name__=='__main__':
    print(fib(500)[0])
```
Both will find the answer 996 without issues. We are taking modulos to keep the output size reasonable and using pairs to avoid exponential branching.

For n=5000, the C++ code outputs 783, but Python will complain
```
RecursionError: maximum recursion depth exceeded in comparison
```
If we add a couple of lines
```
import sys

def fib(n):
    if n==1:
        return (1, 2)
    x=fib(n-1)
    return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)

if __name__=='__main__':
    sys.setrecursionlimit(5000)
    print(fib(5000)[0])
```
then Python too will give the right answer.

For n=50000 C++ finds the answer 151 within milliseconds while Python crashes (at least on my machine).

Why are recursive calls so much cheaper in C++? Can we somehow modify the Python compiler to make it more receptive to recursion?

Of course, one solution is to replace recursion with iteration. For Fibonacci numbers, this is easy to do. However, this will swap the initial and the terminal conditions, and the latter is tricky for many problems (e.g. alpha–beta pruning). So generally, this will require a lot of hard work on the part of the programmer.
- Alan Birtles almost 3 years
  
  sys.setrecursionlimit works for me, note that the limit needs to be 5002 not 5000
- Caleth almost 3 years
  
  @TedKleinBergman this isn't a tail recursive. it looks like gcc can unroll this somewhat to use much less stack per call than CPython (intentionally) does
- eerorika almost 3 years
  
  @Caleth Pedantic technicality: I don't think unrolling would likely decrease and might rather even increase stack use per call. But it would reduce the number of calls by a fraction, and hence the total stack use.
- Caleth almost 3 years
  
  @eerorika per C++ function call, not per object code branch-with-link
- Russell Borogove almost 3 years
  
  @Jarod42 Languages aren't compiled or interpreted; language implementations are. The standard Python distribution (known as CPython) compiles Python source to bytecode representation, similar to how Java or C# do.
- PM 2Ring almost 3 years
  
  BTW, you can dramatically reduce the number of recursive calls by using a cache, eg the cache or lru_cache decorators from functools. And you can optimize that code slightly by assigning the result of the call to two names, eg u, v = fib(n-1), rather than assigning to x and then indexing x 4 times. Also, you might like my fast Fibonacci code: stackoverflow.com/a/40683466/4014959 ;)
- Jivan almost 3 years
  
  @Jarod42 this has more to do with dynamic vs static typing
- Jann Poppinga almost 3 years
  
  Check out Stackless Python. It doesn't use the call stack of the C runtime/OS, but manages its own. That might already be the modification you had in mind.
- Konrad Rudolph almost 3 years
  
  I'm a bit confused why you're taking about fibonacci numbers, when the functions you show clearly don't compute the fibonacci series.
- dan04 almost 3 years
  
  Why the modulo-997 arithmetic?
- OverLordGoldDragon almost 3 years
  
  I've not read everything but this may be relevant.
- smci almost 3 years
  
  Your underlying unasked questions are "a) Why can't Python do tail recursion on function calls?" and "b) Why is cPython's stackframe size(/recursionlimit) so low?". Infinite (call-based) recursion is obviously going to crash, at some point. Tail recursion in g++ saves C++ from that. Modifying Fibonacci to be modulo 997 reduces this from an infinite problem to a finite one with a state space of at most 997**2, and makes it more amenable to caching, which Python certainly can do - and that's a one-line decorator call. This question's a moving target. It didn't ask "How to speed Python up?"
- smci almost 3 years
  
  Thirteen years of previous writings on this: Does Python optimize tail recursion?, What is the maximum recursion depth in Python, and how to increase it?, Python recursion limit vs stack size?...
- smci almost 3 years
  
  ...and essentially the same comparison Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell
- smci almost 3 years
  
  ...and Is there a Python caching library? to which the 3.x standard-library answer is functools.
- MisterMiyagi almost 3 years
  
  "Of course, one solution is to replace recursion with iteration. [...] So generally this will require a lot of hard work on the part of the programmer." The proper tool to generating sequences define via recursion from a base-case is corecursion. The corresponding feature in Python is aptly named a generator. It has the exact same initial and terminal conditions as the code shown in the question.
- MisterMiyagi almost 3 years
  
  Can you please clarify what exactly you are asking about? I see at least three separate topics in there: How can I compute "Fibonacci modulo 997" efficiently in Python? Why are calls cheaper in compiled versus interpreted programs? How can I make the "Python compiler" handler deeper recursion? The first seems answerable, the second is extremely broad, and the third should have several duplicates.
- Eric Duminil almost 3 years
  
  @PM2Ring "you can dramatically reduce the number of recursive calls by using a cache". It's a good advice in general, but I really don't think it applies in this case. Actually adding @lru_cache(maxsize=None) to OP's code makes it fail faster because it reaches maximum recursion depth. Cache would help if the method called itself twice at each level. With the current structure, fib(500) calls itself 500 times, even without cache.
- PM 2Ring almost 3 years
  
  @Eric That's a very good point. Simply decorating the OP's code with lru-cache isn't sufficient. As you say, we also need to use a better algorithm that gives us a binary tree of calls instead of the OP's linear list. That's what my fast_fib code does, which I linked in that same comment.
- PM 2Ring almost 3 years
  
  FWIW, here's a slightly improved version of that code, which handles modulus (set modulus to 0 for plain Fibonacci numbers). The code also shows when it's hitting the cache, and prints the hit stats and cache size at the end.
- PM 2Ring almost 3 years
  
  Fast mod Fibonacci
- Eric Duminil almost 3 years
  
  @PM2Ring Indeed, I love this method, and already wrote an answer based on it : stackoverflow.com/a/68017363/6419007 . OP is calculating F(2*n) % p, BTW, not F(n) % p.
- PM 2Ring almost 3 years
  
  @KonradRudolph Yes, it's a bit confusing that the OP's code is computing (Fib(2n), Fib(2n+1)) mod 997.
- Jim Fell almost 3 years
  
  Generally speaking (virtually always) use of recursion is bad programming practice, including (especially) fibonacci calculations. Solution: avoid recursion.
- Eric Duminil almost 3 years
  
  @dan04 maybe modulo has been used to avoid requiring BigInts in C++?
- Solomon Slow almost 3 years
  
  You asked, "Why is [it] so expensive?" But I don't see anywhere you measured a cost. I only see that you showed that a particular Python implementation, by default, allocates room on the stack for fewer function activations than a particular C++ implementation allocates.
- ggorlen almost 3 years
  
  Recursion applied to lists vs trees in Python seems applicable here but I'm not sure the fundamental premise of this question makes sense, as others have pointed out. Python is a language, not an implementation, so it needs to specify an implementation such as the CPython interpreter to ensure correct scoping. Pypy, for example, doesn't have the same characteristics as CPython does when it comes to recursion. Within an implementation, there's a lot that can be done about recursion.
ti7 almost 3 years

(responding to a now-deleted comment) I believe that practically setting the limit low is encouragement to not write heavily-recursive code which in Python which is likely either an error or very inefficient due to creating lots of expensive intermediate objects
eerorika almost 3 years

@jtallk Here it crashes before n=1'000'000. Is there a reason for this relative difference? Presumably, Python function calls consume more execution stack space.
jtallk almost 3 years

I guess it depends on the specifics of the machine. I was using ideone.com/5KsxJD
Russell Borogove almost 3 years

Languages aren't compiled or interpreted; language implementations are. The standard Python distribution (known as CPython) compiles Python source to bytecode representation, similar to how Java or C# do. C++ is faster for many operations because of the semantics of the language, not because of compiling versus interpreting.
eerorika almost 3 years

@RussellBorogove That's mostly philosophical pedantry in my opinion. There exists only compiled implementations of C++ (I don't count "mostly conforming"), and in this context there is no practical difference between purely interpreted and just-in-time-byte-compiler or whatever CPython's definition is. Neither have the opportunity to spend the time to optimise that AOT compiler has. I do agree that language semantics help as well... although I would suspect that the fact that whether language is designed to be AOT or JIT has an effect on what semantics the languages are designed to have.
Russell Borogove almost 3 years

It's correctness, not pedantry, and not a matter of opinion. An interpreted Python would be vastly slower than CPython.
PM 2Ring almost 3 years

eeroika, the CPython compiler is not JIT. The Python source is compiled to bytecode. That bytecode is then executed on a virtual machine. That is, the bytecode is the machine language of the virtual machine.
eerorika almost 3 years

@PM2Ring compiler is not JIT. Is it not compiled to bytecode upon running the program?
PM 2Ring almost 3 years

@eerorika JIT compiling involves compilation during execution, and generally involves dynamic analysis to determine what parts of the code could benefit from being compiled. In contrast, CPython code is compiled immediately before execution commences, if necessary (the interpreter can use existing compiled images if their timestamp is later than that of the source code, if the system saves such images to disk).
Quelklef almost 3 years

"Deep recursion is evil, full stop." -- This seems like a strong statement when some languages (e.g. pure PF langs) have no iteration at all but are still quite usable
user2357112 almost 3 years

This just turns a clean Python stack overflow into a messy C stack overflow once you run out of C-level stack space.
Serge Ballesta almost 3 years

@Quelklef: I am sorry if I was not clear. It did not mean that recursion was evil. It is indeed a nice tool. What I mean is that when using it a programmer should wonder whether the depth will be acceptable. In Python, a beginner should never try to increase the recursion limit, because they would better use a different algorithm. It is in fact a very advanced feature that should only be used in rare use cases after a careful analysis to make sure that is is really worth it...
Serge Ballesta almost 3 years

... And in C++ (the other tagged language) even it tail recursion elimination is a fact with optimizing compilers, it is not guaranteed by the current language standard. So relying on it could result in an application that works fine or miserably fails depending on the used compiler or even on compilation options. That is the reason why I strongly use beginners to avoid deep recursion as most as they can.
theberzi almost 3 years

Python is not a functional programming language. Serge Ballesta's statement may not apply to all languages, but it certainly does for Python.
jk. almost 3 years

heres one python implementation pypi.org/project/trampoline
Peter Mortensen almost 3 years

Re "Deep recursion is evil, full stop": But many problems naturally lends themselves to recursion, like traversing (binary) trees. Is implementing your own stack any better?
Edgar Bonet almost 3 years

This answer does not address the question. The Fibonacci code is clearly a dummy example: the OP's finger pointed at the Moon. You were expected to look at the Moon, not at the finger.
Abigail almost 3 years

As you know that you will need all Fibonacci numbers below the searched one. No, you don't. The nth Fibonacci number is equal to the nearest integer to φ^n/√5, where φ is the golden ratio ((1 - √5)/2). Neither recursion nor iteration is needed to calculate Fibonacci numbers. (But calculating those numbers isn't the issue in the question)
Serge Ballesta almost 3 years

@Abigail: You are perfectly right on a mathematical point of view. And plain wrong when we come to large numbers in Python. The direct mathematical formula will use floating point numbers which are IEEE 754 64 bits binary numbers. With a precision of at most 15 or 16 decimal digits. While Python integers are multiprecision numbers, with an arbitrary precision.
Serge Ballesta almost 3 years

@EdgarBonet: I do not agree. What this answer means is what except in very special use cases if you wind yourself having to increase Python recursion limit, you should not do unless no other path is possible. And even then, you should carefully analyze what would be the higest possible resursion limit in that specific use case. And a beginner should never do without asking a more experimented programmer. But it is only my opinion and my advice. And I have only been programming for more than 40 years...
Edgar Bonet almost 3 years

The quality of your advise is not disputed. Yet, you failed to address the OP's question.
Arsenal almost 3 years

Is Cling also only mostly conforming? They use clang and claim it supports everything clang does.
Serge Ballesta almost 3 years

@EdgarBonet: This is a good point. I have edited my post.
Ilmari Karonen almost 3 years

@PeterMortensen: Traversing a binary tree is not deep recursion in the sense I assume this answer means (and as eerorika's answer makes explicit): the recursion depth needed to traverse a balanced binary tree of n elements is O(log n). If your tree is small enough to fit in RAM (or even on disk) and not extremely unbalanced (to the point where even calling it a tree becomes questionable), Python's default recursion limit is more than enough to traverse it.
eerorika almost 3 years

@Arsenal Accroding to root.cern/cling their goal is to be backward compatible with CINT, which accroding to their documentation is subtly different from C++. So, I would guess that Cling also differs, but to be fair my analysis isn't very deep.
dan04 almost 3 years

@SergeBallesta: You can use the formula with a high-precision floating-point type or with Fraction. Then it's just a matter of calculating an accurate-enough approximation of φ (which can be done by iterating phi = 1 + 1 / phi) to make the integer part correct.
chepner almost 3 years

Scheme guarantees tail-recursion optimization; Haskell does not, nor does it need to, because a single function doesn't necessarily produce a single basic block.
Wayne Conrad almost 3 years

Beauitful. This would also be an excellent answer to stackoverflow.com/questions/189725/… , simpler and clearer than most of the examples there.
dan04 almost 3 years

Technically, any interpreted language can be made into a “compiled” language by generating executable files that contain both the interpreted code and the interpreter.
Konrad Rudolph almost 3 years

@dan04… and that's supposed to be faster than a simple recursion adding O(n) integers?!
Konrad Rudolph almost 3 years

The C++ code isn’t tail recursive, and neither GCC nor clang transform it into a tail recursive implementation. Consequently, neither perform TCO here (at -O2). And the C++ code does crash for larger n. This answer is completely unrelated to the actual reason why the code works in C++ but not in Python.
dan04 almost 3 years

@KonradRudolph: It depends on how many iterations are needed to compute φ to the necessary precision. I haven't done the math.
ShadowRanger almost 3 years

@ti7: The official reason is to throw a (friendly, handlable) exception before you blow the real (C) stack and the program crashes hard. As the OP observed, raising the limit too high means the program crashes hard; you couldn't catch the exception and handle it in a nice way even if you wanted to.
Daniel Wagner almost 3 years

For a less academic argument that not all the Fibonacci numbers are needed, consider the following standard-ish algorithm: compute the matrix power [[1,1],[1,0]]^n by repeated squaring, and then return the top left element of the result. This only computes O(log(n)) Fibonacci numbers below the searched one.
smci almost 3 years

This isn't a solution, it just pushes the inevitable stack overflow due to infinite recursion further out. The OP's underlying unasked questions are "a) Why doesn't Python do tail recursion on function calls?" and *"b) Why is cPython's stackframe size(/recursionlimit) so low?
MisterMiyagi almost 3 years

This seems to just re-invent generators. How does it address the difference between C++ and Python?
Roel Schroeven almost 3 years

@MisterMiyagi It addresses the "what can we do about it?" part of the question.
MisterMiyagi almost 3 years

So what about the other parts of the question?
MisterMiyagi almost 3 years

The question doesn't even compute actual fibonacci numbers. Even if all the other parts about comparing Python to C++ and the explicit request on making Python handle the existing recursion better were not there, this answer would still be missing the question.
MisterMiyagi almost 3 years

Thanks for taking the comment gracefully. FWIW, I think it is the proper approach, just not for the question asked. It certainly doesn't seem worse than the other answers. I'm afraid this answer just missed the upvote rush hour.
Yakk - Adam Nevraumont almost 3 years

@eerorika C++ compiled wasm produces wasm bytecode, which is interpreted by essentially the same engine as javascript bytecode is in a browser. The resulting wasm is much faster than JS for many purposes, because of the semantics of C++ make it amenable to optimization. Generally highly optimized "gc-style" languages take a 2x hit over non-gc languages, and languages that go further like python another 2x-5x hit; in C++, pair<int,int> is just 8 bytes, while in Python the equivalent is a much more complicated structure. There is nothing wrong with that cost, but it is semantics not JIT.
Eric Duminil almost 3 years

@MisterMiyagi: I'm here to learn new stuff and have interesting discussions, not (only) get upvotes. :D I just wasted half an hour trying to get the same result as OP. In vain, because both the functions are somehow broken, and don't return F(n) % 997. It should be 836, not 996.
Eric Duminil almost 3 years

@DanielWagner: Indeed.
Eric Duminil almost 3 years

@MisterMiyagi: I've found the problem and updated my answer.
Eric Duminil almost 3 years

Cool looking answer. It's nice that you manage to save the code without changing much. Just like OP, you actually compute fibonacci(2*n) % p, not fibonacci(n) % p, BTW.
PM 2Ring almost 3 years

@dan04 There are faster ways to compute phi, eg phi = (phi*phi + 1) / (2*phi - 1). Demo.
PM 2Ring almost 3 years

@Abigail That phi-based formula, which is derived from Binet's formula is certainly useful, if you know phi with sufficient precision. But it's not much use for computing huge modulus Fibonacci numbers, eg Fib(10**100) % (10**9), which equals 560546875.
ti7 almost 3 years

@ShadowRanger of course! the original comment was why it was so very low rather than something higher
Serge Ballesta almost 3 years

@KonradRudolph: After your comment, and a second reading of the used algorithm I have changed my answer. You are absolutely right: tail recursion elimination is not involved here.
glaba almost 3 years

This doesn't answer half the question, which is why the recursion is so expensive in Python in the first place
chepner almost 3 years

@glaba Recursion (as typically implemented) involves a global name lookup, because the recursive call is just a free variable, not a direct reference to the current function. It could be sped up using various closure tricks to make the recursive reference a local variable, but that does nothing about the cost of calling any function in Python.
chepner almost 3 years

Also, functions are distinct objects that need to be interpreted by the Python runtime, not just blocks of code in memory that can be jumped to. Python's data model is just very, very different from C++'s.
Barmar almost 3 years

@PM2Ring A virtual machine executing bytecode is essentially the same as an interpreter.
PM 2Ring almost 3 years

@Barmar Of course. It's the 2nd strategy for an interpreter described on Wikipedia. Whereas a JIT scheme is generally a combination of the 2nd & 3rd strategies, where bytecode may be translated to the machine code of the actual hardware CPU.
Roel Schroeven almost 3 years

To @MisterMiyagi and glaba: I didn't answer the first half of the question because that was already covered in other answers. Should I have included an explanation for that part too to make my answer a complete answer?
Eric Duminil almost 3 years

@PM2Ring: I didn't notice your comment. You're right, it's not too clever to calculate F(n), F(n-1) and F(n+1) separately.
MisterMiyagi almost 3 years

Be aware the question is not about plain Fibonacci numbers, but modulo 997.
abhinav omprakash almost 3 years

thanks for pointing that out. is this better?