Using find, sort and xargs:

find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb

The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.

This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.

Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.

The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.

The non-standard flags used are

  • -maxdepth 1, to make find only enter the top-most directory but no subdirectories. POSIXly, use find . ! -name . -prune ...
  • -print0, to make find output nul-terminated pathnames (this was considered by POSIX but rejected). One could use -exec printf '%s\0' {} + instead.
  • -z, to make sort take nul-terminated records. There is no POSIX equivalence.
  • -V, to make sort sort e.g. 200 after 3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.
  • -0, to make xargs read nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised by xargs.

If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.

With zsh (where that {1..15000} operator comes from):

autoload zargs # best in ~/.zshrc
zargs file_{1..15000}.pdb -- cat > file_all.pdb

Or for all file_<digits>.pdb files in numerical order:

zargs file_<->.pdb(n) -- cat > file_all.pdb

(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).

With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):

command /opt/ast/bin/cat file_{1..15000}.pdb > file_all.pdb

With bash/zsh/ksh93 (which support zsh's {x..y} and have printf builtin):

printf '%s\n' file_{1..15000}.pdb | xargs cat > file_all.pdb

On a GNU system or compatible, you could also use seq:

seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb

For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.

Like for -It's a trickier filename - 12.pdb, use:

seq -f "\"./-It's a trickier filename - %.17g.pdb\"" 15000 |
  xargs cat > file_all.pdb

A for loop is possible, and very simple.

for i in file_{1..15000}.pdb; do cat $i >> file_all.pdb; done

The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.

seq 1 15000 | awk '{print "file_"$0".dat"}' | xargs cat > file_all.pdb

You shouldn't incur in that error for only 15k files with that specific name format [1,2].

If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.

Solution run the command from that directory.

(cd That/Directory ; cat file_{1..2000}.pdb >> file_all.pdb )

Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:

seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb

with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).

Some words more

You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 words
cat file_{1..15000}.pdb " > file_all.pdb" | wc

...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits

Give it a look on your system to the output of

getconf ARG_MAX
xargs --show-limits

Laziness guided solution

In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.

... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).

Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.


Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file

StartN=${1:-1}          # First file number
EndN=${2:-15000}        # Last file number
BlockN=${3:-100}        # files in a Block 
OutFile=${4:-"all.pdb"} # Output file name

for i in $(seq $StartN $BlockN $EndN)
  CurrentEnd=$i ;  
    cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd)  >> $OutFile;
  CurrentStart=$(( CurrentEnd + 1 )) 
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] && 
    cat $(seq -f file_%.17g.pdb $CurrentStart $EndN)  >> $OutFile;

Version 2

Calling bash for the expansion (a bit slower in my tests ~20%).

StartN=${1:-1}          # First file number
EndN=${2:-15000}        # Last file number
BlockN=${3:-100}        # files in a Block 
OutFile=${4:-"all.pdb"} # Output file name

for i in $(seq $StartN $BlockN $EndN)
  CurrentEnd=$i ;
    echo  cat file_{$CurrentStart..$CurrentEnd}.pdb | /bin/bash  >> $OutFile;
  CurrentStart=$(( CurrentEnd + 1 )) 
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] && 
    echo  cat file_{$CurrentStart..$EndN}.pdb | /bin/bash  >> $OutFile;

Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...

    I have about 15,000 files that are named file_1.pdb, file_2.pdb, etc. I can cat about a few thousand of these in order by doing:

    cat file_{1..2000}.pdb >> file_all.pdb

    However, if I do this for 15,000 files, I get the error

    -bash: /bin/cat: Argument list too long

    I have seen this problem being solved by doing find . -name xx -exec xx but this wouldn't preserve the order with which the files are joined. How can I achieve this?

