Fastest way to concatenate files
Solution 1
Nope, cat is surely the best way to do this. Why use python when there is a program already written in C for this purpose? However, you might want to consider using xargs
in case the command line length exceeds ARG_MAX
and you need more than one cat
. Using GNU tools, this is equivalent to what you already have:
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z |
xargs -0 cat -- >>out
Solution 2
Allocating the space for the output file first may improve the overall speed as the system won't have to update the allocation for every write.
For instance, if on Linux:
size=$({ find . -maxdepth 1 -type f -name 'input_file*' -printf '%s+'; echo 0;} | bc)
fallocate -l "$size" out &&
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z | xargs -r0 cat 1<> out
Another benefit is that if there's not enough free space, the copy will not be attempted.
If on btrfs
, you could copy --reflink=always
the first file (which implies no data copy and would therefore be almost instantaneous), and append the rest. If there are 10000 files, that probably won't make much difference though unless the first file is very big.
There's an API to generalise that to ref-copy all the files (the BTRFS_IOC_CLONE_RANGE
ioctl
), but I could not find any utility exposing that API, so you'd have to do it in C (or python
or other languages provided they can call arbitrary ioctl
s).
If the source files are sparse or have large sequences of NUL characters, you could make a sparse output file (saving time and disk space) with (on GNU systems):
find . -maxdepth 1 -type f -name 'input_file*' -print0 |
sort -z | xargs -r0 cat | cp --sparse=always /dev/stdin out
Related videos on Youtube
fsperrle
Updated on September 18, 2022Comments
-
fsperrle almost 2 years
I've got 10k+ files totaling over 20GB that I need to concatenate into one file.
Is there a faster way than
cat input_file* >> out
?
The preferred way would be a bash command, Python is acceptable too if not considerably slower.
-
S edwards over 10 yearsCan you insure in this case that your files will be read in the order ?
-
hayath786 over 10 yearsYes, because the output of
find
is piped throughsort
. Without this, the files would be listed in an arbitrary order (defined by the file system, which could be file creation order). -
S edwards over 10 years@scai I missread sorry, with sort it's pretty obvious
-
Graeme over 10 years@Kiwy, the only case I can see is if the locale isn't properly set in the environment, then sort might behave differently from a
bash
glob. Otherwise I don't see any cases wherexargs
orcat
would not behave as expected. -
X Tian over 10 yearslove the pre-allocate, but should that be
>>out
instead of>out
? -
umläute over 10 yearsbut your example will not copy if
fallocate
fails to pre-allocate, e.g. because the filesystem does not support it (currently onlybtrfs
,ext4
,ocfs2
, andxfs
supportfallocate
); since there is little harm done if pre-allocation fails, i guess it's safer to usefallocate -l "$size" out; find . ...
-
Stéphane Chazelas over 10 years@umläute, if fallocate fails because there's not enough space, then you'll waste time transferring everything. The solution assumes you're on a FS that supports fallocate. That's something you should be able to know beforehand.
-
Stéphane Chazelas over 10 years@XTian, no, it should be neither
>
nor>>
, but1<>
as I said to write into the file. -
grebneke over 10 years@StephaneChazelas - I still don't get
1<>
, could you please post a link to reference / explanation? -
Stéphane Chazelas over 10 years@grebneke,
<>
is the standard Bourne/POSIX read+write redirection operator. See your shell manual or the POSIX spec for details. The defaultfd
is0
for the<>
operator (<>
is short for0<>
, like<
is short for0<
and>
short for1>
), so you need the1
to explicitly redirect stdout. Here, it's not so much that we need read+write (O_RDWR
), but that we don't wantO_TRUNC
(as in>
) which would deallocate what we've just allocated. -
Stéphane Chazelas over 10 years@grebneke,
<>
was in the Bourne shell from the start (1979) but initially not documented. -
grebneke over 10 years@StephaneChazelas - Any references to example usage of read+write in bash? Since there is no
seek()
in bash (?), what are common real-world usages, except for skippingO_TRUNC
?man bash
is really terse on the subject, and it's hard to usefully googlebash <>
-
Stéphane Chazelas over 10 years@grebneke, unix.stackexchange.com/search?q=user%3A22565+%22%3C%3E%22 will give you a few. ksh93 has seek operators BTW, and you can seek forward with
dd
or via reading. -
grebneke over 10 years@StephaneChazelas - thanks a lot, your help and knowledge is deeply appreciated!
-
Graeme over 10 yearsI'm not convinced that there will be many cases where
fallocate
will negate the overhead of the extrafind
, even though it will be faster the second time round.btrfs
certainly opens up some interesting possibilities though. -
Marc van Leeuwen over 10 yearsDo I read your last sentence correctly: the displayed code is equivalent to the command OP gave ("what you already have"), and in particular does not solve a potential
ARG_MAX
problem? I can see only one call ofcat
in the displayed code, so it does not address that problem at all. -
Stéphane Chazelas over 10 years@MarcvanLeeuwen,
xargs
will call as maycat
as is necessary to avoid an E2BIG error of execve(2). -
Graeme over 10 years@Marc, yes the
man
page for GNUxargs
is pretty bad and misses a couple of major points ofxargs
operation. -
SArcher over 3 yearscat might be the best common way, but its single threaded approach is not very fast.