cat a very large number of files together in correct order
Solution 1
Using find
, sort
and xargs
:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find
command finds all relevant files, then prints their pathnames out to sort
that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V
). xargs
takes this list of sorted pathnames and runs cat
on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0
with find
to give sort
nul-terminated names to sort, and sort
handles these using -z
. xargs
too reads nul-terminated names with its -0
flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb
.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1
, to makefind
only enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...
-print0
, to makefind
output nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s\0' {} +
instead.-z
, to makesort
take nul-terminated records. There is no POSIX equivalence.-V
, to makesort
sort e.g.200
after3
. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0
, to makexargs
read nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs
.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V
with sort
.
Solution 2
With zsh
(where that {1..15000}
operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_{1..15000}.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb
files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y>
is a glob operator that matches on decimal numbers x to y. With no x
nor y
, it's any decimal number. Equivalent to extendedglob
's [0-9]##
or kshglob
's +([0-9])
(one or more digits)).
With ksh93
, using its builtin cat
command (so not affected by that limit of the execve()
system call since there's no execution):
command /opt/ast/bin/cat file_{1..15000}.pdb > file_all.pdb
With bash
/zsh
/ksh93
(which support zsh
's {x..y}
and have printf
builtin):
printf '%s\n' file_{1..15000}.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq
:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs
-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb
, use:
seq -f "\"./-It's a trickier filename - %.17g.pdb\"" 15000 |
xargs cat > file_all.pdb
Solution 3
A for loop is possible, and very simple.
for i in file_{1..15000}.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat
a hell of a lot of times. But if you can't remember exactly how to do the stuff with find
and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
Solution 4
seq 1 15000 | awk '{print "file_"$0".dat"}' | xargs cat > file_all.pdb
Solution 5
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_{1..2000}.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 words
cat file_{1..15000}.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX
on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs
(I really should have used xargs
here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=${1:-1} # First file number
EndN=${2:-15000} # Last file number
BlockN=${3:-100} # files in a Block
OutFile=${4:-"all.pdb"} # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=${1:-1} # First file number
EndN=${2:-15000} # Last file number
BlockN=${3:-100} # files in a Block
OutFile=${4:-"all.pdb"} # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_{$CurrentStart..$CurrentEnd}.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_{$CurrentStart..$EndN}.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq
[3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
sodiumnitrate
Updated on September 18, 2022Comments
-
sodiumnitrate almost 2 years
I have about 15,000 files that are named
file_1.pdb
,file_2.pdb
, etc. I can cat about a few thousand of these in order by doing:cat file_{1..2000}.pdb >> file_all.pdb
However, if I do this for 15,000 files, I get the error
-bash: /bin/cat: Argument list too long
I have seen this problem being solved by doing
find . -name xx -exec xx
but this wouldn't preserve the order with which the files are joined. How can I achieve this?-
roaima over 6 yearsWhat is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
-
roaima over 6 yearsI (now) have 15,000 of these files in a directory and your
cat file_{1..15000}.pdb
construct works fine for me. -
ilkkachu over 6 yearsdepends on the system what the limit is.
getconf ARG_MAX
should tell. -
John Hughes over 6 yearsConsider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
-
-
Kevin over 6 yearsYou don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
-
ssola over 6 yearsYou could also write this more succinctly with the asker’s specification as
printf ‘file_%d.pdb\0’ {1..15000} | xargs -0 cat
, or even with Kevin’s point,echo file_{1..15000}.pdb | xargs cat
. Thefind
solution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist. -
John Hughes over 6 years@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
-
Kusalananda over 6 years@chrylis A redirection is never part of a command's arguments, and it's
xargs
rather thancat
that is redirected (eachcat
invocation will usexargs
standard output). If we had saidxargs -0 sh -c 'cat >all.pdb'
then it would have made sense to use>>
instead of>
, if that's what you're hinting at. -
Kusalananda over 6 years@kojiro Yes, possibly. It's a matter of taste I suppose. I tend to want to work with the names in the directory rather than with constructed names that I don't even know exists. Also, my private view is that brace expansions are kinda ugly;
{1..15000}
will expand to all integers between 1 and 15000, which will use memory wastefully (about 80 KB worth of memory for a list just to iterate over). -
Stéphane Chazelas over 6 yearsPOSIX sort doesn't support non-text input.
-t
is to separate fields, not records. -
Kusalananda over 6 years@StéphaneChazelas Thanks. Yes, trying to process ass sorts of filenames on a strictly POSIX system will probably need another solution.
-
Stéphane Chazelas over 6 yearsawk can do seq's job here and seq can do awk's job:
seq -f file_%.10g.pdb 15000
. Note thatseq
is not a standard command. -
Stéphane Chazelas over 6 yearsNote that
%g
is short for%.6g
. It would represent 1,000,000 as 1e+06 for instance. -
Stéphane Chazelas over 6 yearsReally lazy people use the tools designed for the task of working around that E2BIG limitation like
xargs
, zsh'szargs
orksh93
'scommand -x
. -
Stéphane Chazelas over 6 years
seq
is not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000
outputs 1e+06 even in the latest version of coreutils. -
Hastur over 6 years@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less than
xarg
... but I understand it is personal and maybe related only to me. -
Hastur over 6 years@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
-
Hastur over 6 yearsThe
seq -f | xarg cat >
is the most elegant, and effective solution. (IMHO). -
Hastur over 6 yearsCheck the trickier filename... maybe
'"./-It'\''s a trickier filename - %.17g.pdb"'
? -
ilkkachu over 6 years@Hastur, while you're right that those 15k files might very well fit in a command line of a modern Linux system, do note that they never said they were running Linux. There are other systems too, and e.g. the Mac I'm on appears to have the limit at 256k. Which is more than those 213898 bytes you mention, but note that there's some per-argument overhead too (at least the size of a pointer, so probably 4 or 8 bytes)
-
Stéphane Chazelas over 6 years@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
-
Hastur over 6 years@ilkkachu I mainly wanted to emphasize the fact that this error may occur just because you choose a different location from which to run your command. Let's see:
touch {1..146542}
is the upper number that did not throw that error on this particular system... soecho touch {1..146542} |wc
output gives me 914695 characters and 146543 words (of course)... mumble... the length of the command is 5+1 space... mumble mumble... (2097142 - 914695 -6) / 146542 ... mumble something more than 8... so 8 bytes for each parameter... Wow I think I just found out I've a 64bit system!;-)
-
phuclv over 6 yearsI think it's better to use
sort -n
instead ofsort -V
-
Kusalananda over 6 years@LưuVĩnhPhúc It depends whether you need to sort the strings
file3
,file200
andfile10
in the correct order or not.sort -n
will not sort these into the correct order. -
Scott - Слава Україні over 6 yearsIt looks like
sort -n -k1.6
would work (for the original,file_nnn
filenames, orsort -n -k1.5
for the ones without the underscore). -
LarryC over 6 yearsThanks Stéphane -- I think
seq -f
is a great way to do this; will remember that. -
Rolf over 6 yearsI often add a
echo $i;
in the loop body as a "progress indicator"