What is tail call optimization?

algorithm recursion language-agnostic tail-recursion tail-call-optimization

207,332

Solution 1

Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function. The most common use is tail-recursion, where a recursive function written to take advantage of tail-call optimization can use constant stack space.

Scheme is one of the few programming languages that guarantee in the spec that any implementation must provide this optimization, so here are two examples of the factorial function in Scheme:

(define (fact x)
  (if (= x 0) 1
      (* x (fact (- x 1)))))

(define (fact x)
  (define (fact-tail x accum)
    (if (= x 0) accum
        (fact-tail (- x 1) (* x accum))))
  (fact-tail x 1))

The first function is not tail recursive because when the recursive call is made, the function needs to keep track of the multiplication it needs to do with the result after the call returns. As such, the stack looks as follows:

(fact 3)
(* 3 (fact 2))
(* 3 (* 2 (fact 1)))
(* 3 (* 2 (* 1 (fact 0))))
(* 3 (* 2 (* 1 1)))
(* 3 (* 2 1))
(* 3 2)
6

In contrast, the stack trace for the tail recursive factorial looks as follows:

(fact 3)
(fact-tail 3 1)
(fact-tail 2 3)
(fact-tail 1 6)
(fact-tail 0 6)
6

As you can see, we only need to keep track of the same amount of data for every call to fact-tail because we are simply returning the value we get right through to the top. This means that even if I were to call (fact 1000000), I need only the same amount of space as (fact 3). This is not the case with the non-tail-recursive fact, and as such large values may cause a stack overflow.

Solution 2

Let's walk through a simple example: the factorial function implemented in C.

We start with the obvious recursive definition

unsigned fac(unsigned n)
{
    if (n < 2) return 1;
    return n * fac(n - 1);
}

A function ends with a tail call if the last operation before the function returns is another function call. If this call invokes the same function, it is tail-recursive.

Even though fac() looks tail-recursive at first glance, it is not as what actually happens is

unsigned fac(unsigned n)
{
    if (n < 2) return 1;
    unsigned acc = fac(n - 1);
    return n * acc;
}

ie the last operation is the multiplication and not the function call.

However, it's possible to rewrite fac() to be tail-recursive by passing the accumulated value down the call chain as an additional argument and passing only the final result up again as the return value:

unsigned fac(unsigned n)
{
    return fac_tailrec(1, n);
}

unsigned fac_tailrec(unsigned acc, unsigned n)
{
    if (n < 2) return acc;
    return fac_tailrec(n * acc, n - 1);
}

Now, why is this useful? Because we immediately return after the tail call, we can discard the previous stackframe before invoking the function in tail position, or, in case of recursive functions, reuse the stackframe as-is.

The tail-call optimization transforms our recursive code into

unsigned fac_tailrec(unsigned acc, unsigned n)
{
TOP:
    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;
}

This can be inlined into fac() and we arrive at

unsigned fac(unsigned n)
{
    unsigned acc = 1;

TOP:
    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;
}

which is equivalent to

unsigned fac(unsigned n)
{
    unsigned acc = 1;

    for (; n > 1; --n)
        acc *= n;

    return acc;
}

As we can see here, a sufficiently advanced optimizer can replace tail-recursion with iteration, which is far more efficient as you avoid function call overhead and only use a constant amount of stack space.

Solution 3

TCO (Tail Call Optimization) is the process by which a smart compiler can make a call to a function and take no additional stack space. The only situation in which this happens is if the last instruction executed in a function f is a call to a function g (Note: g can be f). The key here is that f no longer needs stack space - it simply calls g and then returns whatever g would return. In this case the optimization can be made that g just runs and returns whatever value it would have to the thing that called f.

This optimization can make recursive calls take constant stack space, rather than explode.

Example: this factorial function is not TCOptimizable:

from dis import dis

def fact(n):
    if n == 0:
        return 1
    return n * fact(n-1)


dis(fact)
  2           0 LOAD_FAST                0 (n)
              2 LOAD_CONST               1 (0)
              4 COMPARE_OP               2 (==)
              6 POP_JUMP_IF_FALSE       12

  3           8 LOAD_CONST               2 (1)
             10 RETURN_VALUE

  4     >>   12 LOAD_FAST                0 (n)
             14 LOAD_GLOBAL              0 (fact)
             16 LOAD_FAST                0 (n)
             18 LOAD_CONST               2 (1)
             20 BINARY_SUBTRACT
             22 CALL_FUNCTION            1
             24 BINARY_MULTIPLY
             26 RETURN_VALUE

This function does things besides call another function in its return statement.

This below function is TCOptimizable:

def fact_h(n, acc):
    if n == 0:
        return acc
    return fact_h(n-1, acc*n)

def fact(n):
    return fact_h(n, 1)


dis(fact)
  2           0 LOAD_GLOBAL              0 (fact_h)
              2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (1)
              6 CALL_FUNCTION            2
              8 RETURN_VALUE

This is because the last thing to happen in any of these functions is to call another function.

Solution 4

Probably the best high level description I have found for tail calls, recursive tail calls and tail call optimization is the blog post

"What the heck is: A tail call"

by Dan Sugalski. On tail call optimization he writes:

Consider, for a moment, this simple function:
sub foo (int a) {
  a += 15;
  return bar(a);
}
So, what can you, or rather your language compiler, do? Well, what it can do is turn code of the form return somefunc(); into the low-level sequence pop stack frame; goto somefunc();. In our example, that means before we call bar, foo cleans itself up and then, rather than calling bar as a subroutine, we do a low-level goto operation to the start of bar. Foo's already cleaned itself out of the stack, so when bar starts it looks like whoever called foo has really called bar, and when bar returns its value, it returns it directly to whoever called foo, rather than returning it to foo which would then return it to its caller.

And on tail recursion:

Tail recursion happens if a function, as its last operation, returns the result of calling itself. Tail recursion is easier to deal with because rather than having to jump to the beginning of some random function somewhere, you just do a goto back to the beginning of yourself, which is a darned simple thing to do.

So that this:

sub foo (int a, int b) {
  if (b == 1) {
    return a;
  } else {
    return foo(a*a + a, b - 1);
  }

gets quietly turned into:

sub foo (int a, int b) {
  label:
    if (b == 1) {
      return a;
    } else {
      a = a*a + a;
      b = b - 1;
      goto label;
   }

What I like about this description is how succinct and easy it is to grasp for those coming from an imperative language background (C, C++, Java)

Solution 5

GCC C minimal runnable example with x86 disassembly analysis

Let's see how GCC can automatically do tail call optimizations for us by looking at the generated assembly.

This will serve as an extremely concrete example of what was mentioned in other answers such as https://stackoverflow.com/a/9814654/895245 that the optimization can convert recursive function calls to a loop.

This in turn saves memory and improves performance, since memory accesses are often the main thing that makes programs slow nowadays.

As an input, we give GCC a non-optimized naive stack based factorial:

tail_call.c

#include <stdio.h>
#include <stdlib.h>

unsigned factorial(unsigned n) {
    if (n == 1) {
        return 1;
    }
    return n * factorial(n - 1);
}

int main(int argc, char **argv) {
    int input;
    if (argc > 1) {
        input = strtoul(argv[1], NULL, 0);
    } else {
        input = 5;
    }
    printf("%u\n", factorial(input));
    return EXIT_SUCCESS;
}

GitHub upstream.

Compile and disassemble:

gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
  -o tail_call.out tail_call.c
objdump -d tail_call.out

where -foptimize-sibling-calls is the name of generalization of tail calls according to man gcc:

   -foptimize-sibling-calls
       Optimize sibling and tail recursive calls.

       Enabled at levels -O2, -O3, -Os.

as mentioned at: How do I check if gcc is performing tail-recursion optimization?

I choose -O1 because:

the optimization is not done with -O0. I suspect that this is because there are required intermediate transformations missing.
-O3 produces ungodly efficient code that would not be very educative, although it is also tail call optimized.

Disassembly with -fno-optimize-sibling-calls:

0000000000001145 <factorial>:
    1145:       89 f8                   mov    %edi,%eax
    1147:       83 ff 01                cmp    $0x1,%edi
    114a:       74 10                   je     115c <factorial+0x17>
    114c:       53                      push   %rbx
    114d:       89 fb                   mov    %edi,%ebx
    114f:       8d 7f ff                lea    -0x1(%rdi),%edi
    1152:       e8 ee ff ff ff          callq  1145 <factorial>
    1157:       0f af c3                imul   %ebx,%eax
    115a:       5b                      pop    %rbx
    115b:       c3                      retq
    115c:       c3                      retq

With -foptimize-sibling-calls:

0000000000001145 <factorial>:
    1145:       b8 01 00 00 00          mov    $0x1,%eax
    114a:       83 ff 01                cmp    $0x1,%edi
    114d:       74 0e                   je     115d <factorial+0x18>
    114f:       8d 57 ff                lea    -0x1(%rdi),%edx
    1152:       0f af c7                imul   %edi,%eax
    1155:       89 d7                   mov    %edx,%edi
    1157:       83 fa 01                cmp    $0x1,%edx
    115a:       75 f3                   jne    114f <factorial+0xa>
    115c:       c3                      retq
    115d:       89 f8                   mov    %edi,%eax
    115f:       c3                      retq

The key difference between the two is that:

the -fno-optimize-sibling-calls uses callq, which is the typical non-optimized function call.

This instruction pushes the return address to the stack, therefore increasing it.

Furthermore, this version also does push %rbx, which pushes %rbx to the stack.

GCC does this because it stores edi, which is the first function argument (n) into ebx, then calls factorial.

GCC needs to do this because it is preparing for another call to factorial, which will use the new edi == n-1.

It chooses ebx because this register is callee-saved: What registers are preserved through a linux x86-64 function call so the subcall to factorial won't change it and lose n.
the -foptimize-sibling-calls does not use any instructions that push to the stack: it only does goto jumps within factorial with the instructions je and jne.

Therefore, this version is equivalent to a while loop, without any function calls. Stack usage is constant.

Tested in Ubuntu 18.10, GCC 8.2.

View more solutions

207,332

Author by

majelbstoat

Ex-gaijin pom in search of a cork hat.

Updated on October 19, 2021

Comments

majelbstoat over 2 years

Very simply, what is tail-call optimization?

More specifically, what are some small code snippets where it could be applied, and where not, with an explanation of why?
majelbstoat over 15 years

The whole 'function g can be f' thing was a little confusing, but I get what you mean, and the examples really clarified things. Thanks a lot!
Kyle Cronin over 15 years

If you want to learn more about this, I suggest reading the first chapter of Structure and Interpretation of Computer Programs.
Steve Gilham almost 15 years

Not necessarily true on a language by language basis -- the 64 bit C# compiler may insert tail opcodes whereas the 32-bit version will not; and F# release build will, but F# debug won't by default.
Vigneshwaran over 11 years

I guess an inner function which will tail-recurse and an accumulator as parameter to that function are typical in most cases.
rmcc over 11 years

Excellent example that illustrates the concept. Just take into account that the language you choose has to implement tail call elimination or tail call optimization. In the example, written in Python, if you enter a value of 1000 you get a "RuntimeError: maximum recursion depth exceeded" because the default Python implementation does not support Tail Recursion Elimination. See a post from Guido himself explaining why that is: neopythonic.blogspot.pt/2009/04/tail-recursion-elimination.h‌tml.
nomen about 11 years

3 hasn't been yet optimized. That is the unoptimized representation which the compiler transforms into iterative code which uses constant stack space instead of recursive code. TCO is not the cause of using the wrong recursion scheme for a data structure.
grillSandwich about 11 years

"TCO is not the cause of using the wrong recursion scheme for a data structure" Please elaborate how this is relevant to the given case. The above example just points out an example of the frames are allotted on the call stack with and without TCO.
nomen about 11 years

You chose to use unfounded recursion to traverse (). That had nothing to do with TCO. eternity happens to be tail-call position, but tail-call position is not necessary: void eternity() { eternity(); exit(); }
nomen about 11 years

While we're at it, what is a "large scale recursion"? Why should we avoid goto's in the function? This is neither necessary nor sufficient to allow TCO. And what instruction overhead? The whole point of TCO is that the compiler replaces the function call in tail position by a goto.
grillSandwich about 11 years

TCO is about optimizing the space used on call stack. By large scale recursion, I'm referring to size of frame. Everytime a recursion occurs, if I need to allot a huge frame on call stack above the callee function, TCO would be more helpful and allow me more levels of recursion. But in case my frame size is less, I can do without TCO and still run my program well (I'm not talking about infinite recursion here). If you are left with goto in the function, "tail" call is not actually tail call and TCO is not applicable.
grillSandwich about 11 years

Instruction overhead is the compiler's extra effort to replace the current variables with the corresponding ones of the next frame. This is substitution to having a new frame over the callee's frame.
nomen about 11 years

void eternity() { goto dur; dur: eternity() }
grillSandwich about 11 years

Oh yes. Thank you for the clarification on goto.
Totti over 10 years

"TCO applys to a special case of recursion". I'm afraid that is completely wrong. Tail calls apply to any call in tail position. Commonly discussed in the context of recursion but actually nothing specifically to do with recursion.
Totti over 10 years

Strictly speaking, tail call optimization does not necessarily replace the caller's stack frame with the callees but, rather, ensures that an unbounded number of calls in tail position require only a bounded amount of space. See Will Clinger's paper "Proper tail recursion and space efficiency": cesura17.net/~will/Professional/Research/Papers/tail.pdf
SexyBeast over 9 years

I didn't get it, isn't the initial foo function tail call optimized? It is only calling a function as its last step, and it is simply returning that value, right?
SexyBeast over 9 years

@Brian, look at the link @btiernay provided above. Isn't the initial foo method tail call optimized?
Will Ness about 9 years

@Cupidvogel correct, though it is not TCOptimized, but rather TCOptimizable.
TryinHard over 8 years

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
btiernay over 8 years

@TryinHard maybe not what you had in mind, but I updated it to give a gist of what it is about. Sorry, not going to repeat the whole article!
Sevin7 over 8 years

Thank you, this is simpler and more understandable than the most up-voted scheme example (not to mention, Scheme is not a common language that most developers understand)
Shasak over 8 years

can you explain what a stackframe means precisely? Is there a difference between the call stack and stackframe?
Christoph over 8 years

@Kasahs: a stack frame is the part of the call stack that 'belongs' to a given (active) function; cf en.wikipedia.org/wiki/Call_stack#Structure
dclowd9901 about 8 years

Is this just a way to write recursive functions in a constant-space way? Because couldn't you achieve the same results using an iterative approach?
Will Ness over 7 years

"The only situation" is a bit too absolute; there's also TRMC, at least in theory, which would optimize (cons a (foo b)) or (+ c (bar d)) in tail position in the same way.
mcoolive over 7 years

@dclowd9901, TCO allows you to prefer a functional style rather that a an iterative loop. You can prefer imperative style. Many languages (Java, Python) doesn't provide TCO, then you have to know that a functional call costs memory... and the imperative style is prefered.
James Beninger over 7 years

As someone who rarely dives into functional languages, it's gratifying to see an explanation in "my dialect". There's an (understandable) tendency for functional programmers to evangelize in their language of choice, but coming from the imperative world I find it so much easier to wrap my head around an answer like this.
ColdCold about 7 years

@dclowd9901, this isn't about maintaining constant spacing in your source code. Only the 1st example above is source code. For further explanation, please read my following response to mcoolive, below.
ColdCold about 7 years

@mcoolive, this isn't about language features or a code style preference. In any language, a TCO implementation of recursion will run faster, while using a small, fixed-size block of memory. The 2nd code example implements recursion with iteration, where each function call is open, waiting for its result and memory usage grows larger on each loop. The 3rd example shows how TCO places the recursive call at the end of the code block (tail call), so on each recursion, the call completes and memory is reused.
mcoolive about 7 years

@ColdCold, maybe it's just a vocabulary issue, let's be more specific. TCO optimizes stack memory (some people discuss the word 'optimization' too...) In most cases (I think always) it will produce faster code, but it's not the first goal. You said "In any language, a TCO implementation of recursion will run faster". I agree (I would have formulated it differently), but as Java developer I have to know that my runtime doesn't perform this optimization.
Nithin over 6 years

I liked your f and g approach better than the accepted answer, maybe because i am a maths person.
agm1984 over 6 years

I just had fairly intense epiphany after reading this post after reading 2ality.com/2015/06/tail-call-optimization.html
mbomb007 over 6 years

It should be noted that support for TCO by the browsers is not guaranteed, and may never be supported. stackoverflow.com/a/42788286/2415524
Jacques Mathieu over 6 years

I think you mean TCOptimized. Saying it's not TCOptimizable infers that it can never be optimized (when it in fact can)
Al Sweigart over 5 years

Note that CPython doesn't support tail call optimization. This is a design choice for reasons explained here: stackoverflow.com/questions/13591970/…
Jose Quinteiro almost 4 years

Nice C iteration example
tonix over 2 years

Does it mean that in the tail recursive factorial the compiler sees this line (fact-tail (- x 1) (* x accum)))) and nows that it can discard the current stack before allocating the new one for the next call to fact-tail? Correct? Thank you.