Combining large amount of files
Solution 1
If you have root permissions on that machine you can temporarily increase the "maximum number of open file descriptors" limit:
ulimit -Hn 10240 # The hard limit
ulimit -Sn 10240 # The soft limit
And then
paste res.* >final.res
After that you can set it back to the original values.
A second solution, if you cannot change the limit:
for f in res.*; do cat final.res | paste - $f >temp; cp temp final.res; done; rm temp
It calls paste
for each file once, and at the end there is a huge file with all columns (it takes its minute).
Edit: Useless use of cat... Not!
As mentioned in the comments the usage of cat
here (cat final.res | paste - $f >temp
) is not useless. The first time the loop runs, the file final.res
doesn't already exist. paste
would then fail and the file is never filled, nor created. With my solution only cat
fails the first time with No such file or directory
and paste
reads from stdin just an empty file, but it continues. The error can be ignored.
Solution 2
If chaos' answer isn't applicable (because you don't have the required permissions), you can batch up the paste
calls as follows:
ls -1 res.* | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge* > final.res
This lists the files 1000 at a time in files named lists00
, lists01
etc., then pastes the corresponding res.
files into files named merge00
, merge01
etc., and finally merges all the resulting partially merged files.
As mentioned by chaos you can increase the number of files used at once; the limit is the value given ulimit -n
minus however many files you already have open, so you'd say
ls -1 res.* | split -l $(($(ulimit -n)-10)) -d - lists
to use the limit minus ten.
If your version of split
doesn't support -d
, you can remove it: all it does is tell split
to use numeric suffixes. By default the suffixes will be aa
, ab
etc. instead of 01
, 02
etc.
If there are so many files that ls -1 res.*
fails ("argument list too long"), you can replace it with find
which will avoid that error:
find . -maxdepth 1 -type f -name res.\* | split -l 1000 -d - lists
(As pointed out by don_crissti, -1
shouldn't be necessary when piping ls
's output; but I'm leaving it in to handle cases where ls
is aliased with -C
.)
Solution 3
Try to execute it on this way:
ls res.*|xargs paste >final.res
You can also split the batch in parts and try something like:
paste `echo res.{1..100}` >final.100
paste `echo res.{101..200}` >final.200
...
and at the end combine final files
paste final.* >final.res
Solution 4
Given the amount of files, line sizes, etc. involved, I think it will surpass the default sizes of tools (awk, sed, paste, *, etc)
I would create a small program for this, it would neither have 10,000 files open, nor a line of hundred of thousands in length (10,000 files of 10 (max size of line in the example)). It only requires an ~10,000 array of integers, to store the number of bytes have been read from each file. The disadvantage is that it has only one file descriptor, it is reused for each file, for each line, and this could be slow.
The definitions of FILES
and ROWS
should be changed to the actual exact values. The output is sent to the standard output.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define FILES 10000 /* number of files */
#define ROWS 500 /* number of rows */
int main() {
int positions[FILES + 1];
FILE *file;
int r, f;
char filename[100];
size_t linesize = 100;
char *line = (char *) malloc(linesize * sizeof(char));
for (f = 1; f <= FILES; positions[f++] = 0); /* sets the initial positions to zero */
for (r = 1; r <= ROWS; ++r) {
for (f = 1; f <= FILES; ++f) {
sprintf(filename, "res.%d", f); /* creates the name of the current file */
file = fopen(filename, "r"); /* opens the current file */
fseek(file, positions[f], SEEK_SET); /* set position from the saved one */
positions[f] += getline(&line, &linesize, file); /* reads line and saves the new position */
line[strlen(line) - 1] = 0; /* removes the newline */
printf("%s ", line); /* prints in the standard ouput, and a single space */
fclose(file); /* closes the current file */
}
printf("\n"); /* after getting the line from each file, prints a new line to standard output */
}
}
Solution 5
i=0
{ paste res.? res.?? res.???
while paste ./res."$((i+=1))"[0-9][0-9][0-9]
do :; done; } >outfile
I don't think this is as complicated as all that - you've already done the hard work by ordering the filenames. Just don't open all of them at the same time, is all.
Another way:
pst() if shift "$1"
then paste "$@"
fi
set ./res.*
while [ -n "${1024}" ] ||
! paste "$@"
do pst "$(($#-1023))" "$@"
shift 1024
done >outfile
...but I think that does them backwards... This might work better:
i=0; echo 'while paste \'
until [ "$((i+=1))" -gt 1023 ] &&
printf '%s\n' '"${1024}"' \
do\ shift\ 1024 done
do echo '"${'"$i"'-/dev/null}" \'
done | sh -s -- ./res.* >outfile
And here is yet another way:
tar --no-recursion -c ./ |
{ printf \\0; tr -s \\0; } |
cut -d '' -f-2,13 |
tr '\0\n' '\n\t' >outfile
That allows tar
to gather all of the files into a null-delimited stream for you, parses out all of its header metadata but the filename, and transforms all lines in all files to tabs. It relies on the input being actual text-files though - meaning each ends w/ a newline and there are no null-bytes in the files. Oh - and it also relies on the filenames themselves being newline-free (though that might be handled robustly with GNU tar
's --xform
option). Given these conditions are met, it should make very short work of any number of files - and tar
will do almost all of it.
The result is a set of lines that look like:
./fname1
C1\tC2\tC3...
./fname2
C1\tC2\t...
And so on.
I tested it by first creating 5 testfiles. I didn't really feel like genning 10000 files just now, so I just went a little bigger for each - and also ensured that the file lengths differed by a great deal. This is important when testing tar
scripts because tar
will block out input to fixed lengths - if you don't try at least a few different lengths you'll never know whether you'll actually handle only the one.
Anyway, for the test files I did:
for f in 1 2 3 4 5; do : >./"$f"
seq "${f}000" | tee -a [12345] >>"$f"
done
ls
afterward reported:
ls -sh [12345]
68K 1 68K 2 56K 3 44K 4 24K 5
...then I ran...
tar --no-recursion -c ./ |
{ printf \\0; tr -s \\0; }|
cut -d '' -f-2,13 |
tr '\0\n' '\n\t' | cut -f-25
...just to show only the first 25 tab-delimited fields per line (because each file is a single line - there are a lot)...
The output was:
./1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
./2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
./3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
./4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
./5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Related videos on Youtube
mats
Updated on September 18, 2022Comments
-
mats over 1 year
I have ±10,000 files (
res.1
-res.10000
) all consisting of one column, and an equal number of rows. What I want is, in essence, simple; merge all files column-wise in a new filefinal.res
. I have tried using:paste res.*
However (although this seems to work for a small subset of result files, this gives the following error when performed on the whole set:
Too many open files
.There must be an 'easy' way to get this done, but unfortunately I'm quite new to unix. Thanks in advance!
PS: To give you an idea of what (one of my) datafile(s) looks like:
0.5 0.5 0.03825 0.5 10211.0457 10227.8469 -5102.5228 0.0742 3.0944 ...
-
mats almost 9 years@ Romeo Ninov This gives the same error as I metioned in my initial question:
Too many open files
-
Atul Vekariya almost 9 years@mats, in such case have you consider to split the batch in parts. Will edit my answer to give you idea
-
mats almost 9 yearsThanks! Any idea how I can check what the original values are?
-
chaos almost 9 yearsJust
ulimit -Sn
for soft limit andulimit -Hn
for the hard limit -
Atul Vekariya almost 9 yearsRight, @StephenKitt, I edit my answer
-
mats almost 9 yearsThanks, this partially works. However, for another set of files I get the following error:
-bash: /usr/bin/paste: Argument list too long
. Ideas how to solve this? Sorry for bothering you guys. -
chaos almost 9 years@mats seems your kernel doesn't allow more arguments, you can check it with
getconf ARG_MAX
, you can only increase that value when recompiling the kernel. You may try my second solution? -
mats almost 9 years
getconf ARG_MAX
gives: 262144. I'll try the 2nd solution too! -
chaos almost 9 years@mats If all arguments in
paste res.* ...
are together more than ARG_MAX characters, the command cannot be executed. Useecho res.* | wc -c
in that dir with many files to see how many characters the arguments would be approximately. -
Toby Speight almost 9 yearsCongratulations, @chaos - have a Useless Use of Cat Award!
-
Toby Speight almost 9 years@chaos Thanks for the explanation - that wasn't obvious to me (and the error message could upset the user who wasn't expecting it). The alternative is to pre-create an empty file before the loop - two equally valid solutions.
-
Toby Speight almost 9 yearsTo avoid the temporary files, consider making the
final.x00
be pipes - either as named FIFOs, or implicitly, using process substitution (if your shell supports it - e.g. bash). This isn't fun to write by hand, but may well suit a makefile. -
mikeserv almost 9 years@TobySpeight - agreed.
for f do paste ./f.res; done 3>./f.res
-
Barmar almost 9 yearsInstead of using
cat
every time through the loop, you could start by creating an emptyfinal.res
file. This is probably a good idea any way, in case there's already afinal.res
file there.