FFMPEG multiple outputs performance (Single instance vs Multiple instances)

9,867

A less obvious problem is that depending on your input/output or filters ffmpeg might need to do pixel format conversion internally and in certain cases this becomes a bottleneck when using parallel outputs if done on each stream separately.

The idea is to do the pixel format conversion once if possible, like:

-filter_complex '[0:v]format=yuv420p, split=3[s1][s2][s3]' \
-map '[s1]' ... \
-map '[s2]' ... \
-map '[s3]' ... \

Same filters applied to all outputs should also be used only once. Some filters might need a specific pixel format.

For other causes see the small note at the bottom of the wiki:

Parallel encoding

Outputting and re encoding multiple times in the same FFmpeg process will typically slow down to the "slowest encoder" in your list. Some encoders (like libx264) perform their encoding "threaded and in the background" so they will effectively allow for parallel encodings, however audio encoding may be serial and become the bottleneck, etc. It seems that if you do have any encodings that are serial, it will be treated as "real serial" by FFmpeg and thus your FFmpeg may not use all available cores.

Share:
9,867

Related videos on Youtube

shalin
Author by

shalin

Updated on September 18, 2022

Comments

  • shalin
    shalin almost 2 years

    I am working on creating multiple encoded streams from the single file input (.mp4). Input stream has no audio. Each encoded stream is created by cropping different part of the input and then encoded with the same bit-rate on 32 core system.

    Here're the scenarios I am trying as explained in ffmpeg wiki for creating multiple outputs. https://trac.ffmpeg.org/wiki/Creating%20multiple%20outputs

    Scenario1 (Using single ffmpeg instance)

    ffmpeg -i input.mp4 \

    -filter:v crop=iw/2:ih/2:0:0 -c:v libx264 -b:v 5M out_1.mp4 \

    -filter:v crop=iw/2:ih/2:iw/2:0 -c:v libx264 -b:v 5M out_2.mp4 \

    -filter:v crop=iw/2:ih/2:0:ih/2 -c:v libx264 -b:v 5M out_3.mp4

    In this case, I am assuming that ffmpeg will be decoding the input only once and it will be supplied to all the crop filters. Please correct me if that is not right.

    Scenario2 (Using multiple ffmpeg instances and hence three separate processes)

    ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:0:0 -c:v libx264 -b:v 5M out_1.mp4

    ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:iw/2:0 -c:v libx264 -b:v 5M out_2.mp4

    ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:0:ih/2 -c:v libx264 -b:v 5M out_3.mp4

    In my case, I actually need to encode even more number of streams by cropping different sections of the input video. I am showing three here just to make this example simpler.

    Now, in terms of fps performance I see that scenario 2 performs better. It also uses cpu to its maximum (more than 95% cpu utilization). Scenario 1 has less fps and cpu utilization is way lower (close to 65%). Also, in this case, as I increase the number of streams to be encoded the cpu utilization does not increase linearly. it almost becomes 1.5x when I go from one stream to two. But after that the increments are very low (probably 10% and even less with more streams).

    So my question is: I want to use single instance ffmpeg because it avoids decoding multiple times and also, because the input I have could be as big as 4K or even bigger. What should I do to get better cpu utilization (> 90%) and hence better fps hopefully? also, why is the cpu utilization not increasing linearly with number of streams to be encoded? Why doesn't single instance ffmpeg perform as good as multiple instances? It seems to me that with single ffmpeg instance, all the encodes are not truly running in parallel.

    Edit: Here's the simplest possible way I can reproduce and explain the issue in case things are not so clear. Keep in my mind, that this is just for experiment purposes to understand the issue.

    Single Instance: ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - -c:v libx264 -x264opts threads=1 -b:v 1M -f null - -c:v libx264 -x264opts threads=1 -b:v 1M -f null -

    Multiple Instances: ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - | ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null - | ffmpeg -y -i input.mp4 -c:v libx264 -x264opts threads=1 -b:v 1M -f null -

    Note that I am limiting x264 to single thread. In case of single instance, I would expect ffmpeg to generate 1 encoding thread for each x264 encode and execute them in parallel. But I see that only one cpu core is fully utilized which makes me believe that only one encode session is running at a time. On the other hand, with the case of multiple instances, I see that three cpu cores are fully utilized which i guess means that all the three encodes are running in parallel.

    I really hope some experts can jump in and help with this.

    • shalin
      shalin about 7 years
      btw, I have done extensive search on the above topic and none of the posts are really talking about why the single instance is not performing as good. the closest post I could find was this one (stackoverflow.com/questions/12465914/…) but without the the kind of details I am looking for.
    • Pablo H
      Pablo H almost 3 years
      Have you tried to look at ffmpeg source code (ffmpeg.c, perhaps?) to see how mutiple outputs are implemented? Perhaps it's just a for... :-)
  • shalin
    shalin about 7 years
    so I modified the command line to include bufsize in the following way: ffmpeg -i input.mp4 -filter:v crop=iw/2:ih/2:0:0 -c:v libx264 -b:v 5M -bufsize 50000k out_1.mp4 -filter:v crop=iw/2:ih/2:iw/2:0 -c:v libx264 -b:v 5M -bufsize 50000k out_2.mp4 -filter:v crop=iw/2:ih/2:0:ih/2 -c:v libx264 -b:v 5M -bufsize 50000k out_3.mp4 but I don't see any improvement in fps performance or cpu utilization
  • Akumaburn
    Akumaburn about 7 years
    Strange, my CPU usage improved significantly with the higher buffersize. How long is your video file duration/filesize?
  • shalin
    shalin about 7 years
    I am using google cloud instance with 64 cpus and 416GB RAM. So we can easily rule out cpu, ram, hdd issues etc. I have been using this for a while and it has very consistent and reliable performance for all benchmarks. FFMPEG version I had was about 6 months old and I also tried the latest 3.3.2 build but that didn't help. Also, I had tried changing number of threads without any success.
  • shalin
    shalin about 7 years
    Now back to main discussion, I tried running your command line. With that I get about 12fps with less than 50% cpu utilization. But when I run each encode as separate ffmpeg process, I can get close to 30fps with almost 100% cpu utilization. Remember I have 64 cores, so the workload have to be really parallel and compute heavy to get to full utilization. In your case, its easy to reach 100% cpu because you have only 4 cores.
  • shalin
    shalin about 7 years
    I have tried this varying with file size (100MB to 1GB) and duration ( 1min to 10min). I really don't see why the file size/duration will have any effect on this. The input frame resolution that I have tried is 1080p, 4K and higher.
  • Akumaburn
    Akumaburn about 7 years
    I've tested this myself, directly with libx264 albeit not through ffmpeg's command line.. Maybe ffmpeg is expecting a different format? What happens if you try -bufsize 50M, any difference?
  • flolilo
    flolilo about 7 years
    Of course you're right about my 4-core-setup not being comparable with your 64-core-one - I am sorry, I totally missed the sentence that explained this. In about 72 hours i could test it on an i7-5820k CPU, but I think that's also not really comparable and therefore also useless. Have you tried it with the above mentioned file? Also, does -an change anything? Other than that, I'm out of ideas - I'm sorry...
  • shalin
    shalin about 7 years
    yes I tried it with the file you gave me. also -an doesn't change anything. Really thanks for spending your time on this.
  • shalin
    shalin about 7 years
    btw, if you have time, you can checkout the Edit section of my original question and try out both cases with your input file. I am sure you can reproduce what I am trying to explain with your 4 core platform.
  • flolilo
    flolilo about 7 years
    I checked that and I can reproduce the issue with -x264opts threads=1, however, that is to be expected as it reduces the threads. Typically, one should use 1 thread per core (or more). I let PowerShell create some CSVs with starting- and end-times of the tests & RAM- + CPU-stats - it's a huge pile of data. I fed it to a diagram in excel and found that: a) x264opts-threads are more "efficient" than -threads is b) everything above 1 thread at least has chances to get 100% CPU c) more threads = more RAM used d) auto (meaning not stating anything about threads) works quite well.
  • flolilo
    flolilo about 7 years
    (can't edit my last comment any more). But I know that you know that. I would be very glad to distribute my test-results (in the meanwhile, I even ran them on my 5820k), but I have no idea how to do that within the limits of superuser.com, as the diagram alone needs to be 200cm wide for one to see something...