Fastest way to concatenate files

45,377

Solution 1

Nope, cat is surely the best way to do this. Why use python when there is a program already written in C for this purpose? However, you might want to consider using xargs in case the command line length exceeds ARG_MAX and you need more than one cat. Using GNU tools, this is equivalent to what you already have:

find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z |
  xargs -0 cat -- >>out

Solution 2

Allocating the space for the output file first may improve the overall speed as the system won't have to update the allocation for every write.

For instance, if on Linux:

size=$({ find . -maxdepth 1 -type f -name 'input_file*' -printf '%s+'; echo 0;} | bc)
fallocate -l "$size" out &&
  find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z | xargs -r0 cat 1<> out

Another benefit is that if there's not enough free space, the copy will not be attempted.

If on btrfs, you could copy --reflink=always the first file (which implies no data copy and would therefore be almost instantaneous), and append the rest. If there are 10000 files, that probably won't make much difference though unless the first file is very big.

There's an API to generalise that to ref-copy all the files (the BTRFS_IOC_CLONE_RANGE ioctl), but I could not find any utility exposing that API, so you'd have to do it in C (or python or other languages provided they can call arbitrary ioctls).

If the source files are sparse or have large sequences of NUL characters, you could make a sparse output file (saving time and disk space) with (on GNU systems):

find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z | xargs -r0 cat | cp --sparse=always /dev/stdin out
Share:
45,377

Related videos on Youtube

fsperrle
Author by

fsperrle

Updated on September 18, 2022

Comments

  • fsperrle
    fsperrle almost 2 years

    I've got 10k+ files totaling over 20GB that I need to concatenate into one file.

    Is there a faster way than

    cat input_file* >> out
    

    ?

    The preferred way would be a bash command, Python is acceptable too if not considerably slower.

  • S edwards
    S edwards over 10 years
    Can you insure in this case that your files will be read in the order ?
  • hayath786
    hayath786 over 10 years
    Yes, because the output of find is piped through sort. Without this, the files would be listed in an arbitrary order (defined by the file system, which could be file creation order).
  • S edwards
    S edwards over 10 years
    @scai I missread sorry, with sort it's pretty obvious
  • Graeme
    Graeme over 10 years
    @Kiwy, the only case I can see is if the locale isn't properly set in the environment, then sort might behave differently from a bash glob. Otherwise I don't see any cases where xargs or cat would not behave as expected.
  • X Tian
    X Tian over 10 years
    love the pre-allocate, but should that be >>out instead of >out ?
  • umläute
    umläute over 10 years
    but your example will not copy if fallocate fails to pre-allocate, e.g. because the filesystem does not support it (currently only btrfs, ext4, ocfs2, and xfs support fallocate); since there is little harm done if pre-allocation fails, i guess it's safer to use fallocate -l "$size" out; find . ...
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @umläute, if fallocate fails because there's not enough space, then you'll waste time transferring everything. The solution assumes you're on a FS that supports fallocate. That's something you should be able to know beforehand.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @XTian, no, it should be neither > nor >>, but 1<> as I said to write into the file.
  • grebneke
    grebneke over 10 years
    @StephaneChazelas - I still don't get 1<>, could you please post a link to reference / explanation?
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @grebneke, <> is the standard Bourne/POSIX read+write redirection operator. See your shell manual or the POSIX spec for details. The default fd is 0 for the <> operator (<> is short for 0<>, like < is short for 0< and > short for 1>), so you need the 1 to explicitly redirect stdout. Here, it's not so much that we need read+write (O_RDWR), but that we don't want O_TRUNC (as in >) which would deallocate what we've just allocated.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
  • grebneke
    grebneke over 10 years
    @StephaneChazelas - Any references to example usage of read+write in bash? Since there is no seek() in bash (?), what are common real-world usages, except for skipping O_TRUNC? man bash is really terse on the subject, and it's hard to usefully google bash <>
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @grebneke, unix.stackexchange.com/search?q=user%3A22565+%22%3C%3E%22 will give you a few. ksh93 has seek operators BTW, and you can seek forward with dd or via reading.
  • grebneke
    grebneke over 10 years
    @StephaneChazelas - thanks a lot, your help and knowledge is deeply appreciated!
  • Graeme
    Graeme over 10 years
    I'm not convinced that there will be many cases where fallocate will negate the overhead of the extra find, even though it will be faster the second time round. btrfs certainly opens up some interesting possibilities though.
  • Marc van Leeuwen
    Marc van Leeuwen over 10 years
    Do I read your last sentence correctly: the displayed code is equivalent to the command OP gave ("what you already have"), and in particular does not solve a potential ARG_MAX problem? I can see only one call of cat in the displayed code, so it does not address that problem at all.
  • Stéphane Chazelas
    Stéphane Chazelas over 10 years
    @MarcvanLeeuwen, xargs will call as may cat as is necessary to avoid an E2BIG error of execve(2).
  • Graeme
    Graeme over 10 years
    @Marc, yes the man page for GNU xargs is pretty bad and misses a couple of major points of xargs operation.
  • SArcher
    SArcher over 3 years
    cat might be the best common way, but its single threaded approach is not very fast.