Combining large amount of files

8,644

Solution 1

If you have root permissions on that machine you can temporarily increase the "maximum number of open file descriptors" limit:

ulimit -Hn 10240 # The hard limit
ulimit -Sn 10240 # The soft limit

And then

paste res.* >final.res

After that you can set it back to the original values.


A second solution, if you cannot change the limit:

for f in res.*; do cat final.res | paste - $f >temp; cp temp final.res; done; rm temp

It calls paste for each file once, and at the end there is a huge file with all columns (it takes its minute).

Edit: Useless use of cat... Not!

As mentioned in the comments the usage of cat here (cat final.res | paste - $f >temp) is not useless. The first time the loop runs, the file final.res doesn't already exist. paste would then fail and the file is never filled, nor created. With my solution only cat fails the first time with No such file or directory and paste reads from stdin just an empty file, but it continues. The error can be ignored.

Solution 2

If chaos' answer isn't applicable (because you don't have the required permissions), you can batch up the paste calls as follows:

ls -1 res.* | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge* > final.res

This lists the files 1000 at a time in files named lists00, lists01 etc., then pastes the corresponding res. files into files named merge00, merge01 etc., and finally merges all the resulting partially merged files.

As mentioned by chaos you can increase the number of files used at once; the limit is the value given ulimit -n minus however many files you already have open, so you'd say

ls -1 res.* | split -l $(($(ulimit -n)-10)) -d - lists

to use the limit minus ten.

If your version of split doesn't support -d, you can remove it: all it does is tell split to use numeric suffixes. By default the suffixes will be aa, ab etc. instead of 01, 02 etc.

If there are so many files that ls -1 res.* fails ("argument list too long"), you can replace it with find which will avoid that error:

find . -maxdepth 1 -type f -name res.\* | split -l 1000 -d - lists

(As pointed out by don_crissti, -1 shouldn't be necessary when piping ls's output; but I'm leaving it in to handle cases where ls is aliased with -C.)

Solution 3

Try to execute it on this way:

ls res.*|xargs paste >final.res

You can also split the batch in parts and try something like:

paste `echo res.{1..100}` >final.100
paste `echo res.{101..200}` >final.200
...

and at the end combine final files

paste final.* >final.res

Solution 4

Given the amount of files, line sizes, etc. involved, I think it will surpass the default sizes of tools (awk, sed, paste, *, etc)

I would create a small program for this, it would neither have 10,000 files open, nor a line of hundred of thousands in length (10,000 files of 10 (max size of line in the example)). It only requires an ~10,000 array of integers, to store the number of bytes have been read from each file. The disadvantage is that it has only one file descriptor, it is reused for each file, for each line, and this could be slow.

The definitions of FILES and ROWS should be changed to the actual exact values. The output is sent to the standard output.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define FILES 10000 /* number of files */
#define ROWS 500    /* number of rows  */

int main() {
   int positions[FILES + 1];
   FILE *file;
   int r, f;
   char filename[100];
   size_t linesize = 100;
   char *line = (char *) malloc(linesize * sizeof(char));

   for (f = 1; f <= FILES; positions[f++] = 0); /* sets the initial positions to zero */

   for (r = 1; r <= ROWS; ++r) {
      for (f = 1; f <= FILES; ++f) {
         sprintf(filename, "res.%d", f);                  /* creates the name of the current file */
         file = fopen(filename, "r");                     /* opens the current file */
         fseek(file, positions[f], SEEK_SET);             /* set position from the saved one */
         positions[f] += getline(&line, &linesize, file); /* reads line and saves the new position */
         line[strlen(line) - 1] = 0;                      /* removes the newline */
         printf("%s ", line);                             /* prints in the standard ouput, and a single space */
         fclose(file);                                    /* closes the current file */
      }
      printf("\n");  /* after getting the line from each file, prints a new line to standard output */
   }
}

Solution 5

i=0
{ paste res.? res.?? res.???
while paste ./res."$((i+=1))"[0-9][0-9][0-9]
do :; done; } >outfile

I don't think this is as complicated as all that - you've already done the hard work by ordering the filenames. Just don't open all of them at the same time, is all.

Another way:

pst()      if   shift "$1"
           then paste "$@"
           fi
set ./res.*
while  [ -n "${1024}" ] ||
     ! paste "$@"
do     pst "$(($#-1023))" "$@"
       shift 1024
done >outfile

...but I think that does them backwards... This might work better:

i=0;  echo 'while paste \'
until [ "$((i+=1))" -gt 1023 ] &&
      printf '%s\n' '"${1024}"' \
      do\ shift\ 1024 done
do    echo '"${'"$i"'-/dev/null}" \'
done | sh -s -- ./res.* >outfile

And here is yet another way:

tar --no-recursion -c ./ |
{ printf \\0; tr -s \\0; }    |
cut -d '' -f-2,13              |
tr '\0\n' '\n\t' >outfile

That allows tar to gather all of the files into a null-delimited stream for you, parses out all of its header metadata but the filename, and transforms all lines in all files to tabs. It relies on the input being actual text-files though - meaning each ends w/ a newline and there are no null-bytes in the files. Oh - and it also relies on the filenames themselves being newline-free (though that might be handled robustly with GNU tar's --xform option). Given these conditions are met, it should make very short work of any number of files - and tar will do almost all of it.

The result is a set of lines that look like:

./fname1
C1\tC2\tC3...
./fname2
C1\tC2\t...

And so on.

I tested it by first creating 5 testfiles. I didn't really feel like genning 10000 files just now, so I just went a little bigger for each - and also ensured that the file lengths differed by a great deal. This is important when testing tar scripts because tar will block out input to fixed lengths - if you don't try at least a few different lengths you'll never know whether you'll actually handle only the one.

Anyway, for the test files I did:

for f in 1 2 3 4 5; do : >./"$f"
seq "${f}000" | tee -a [12345] >>"$f"
done

ls afterward reported:

ls -sh [12345]
68K 1 68K 2 56K 3 44K 4 24K 5

...then I ran...

tar --no-recursion -c ./ |
{ printf \\0; tr -s \\0; }|
cut -d '' -f-2,13          |
tr '\0\n' '\n\t' | cut -f-25

...just to show only the first 25 tab-delimited fields per line (because each file is a single line - there are a lot)...

The output was:

./1
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
./2
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
./3
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
./4
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
./5
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
Share:
8,644

Related videos on Youtube

mats
Author by

mats

Updated on September 18, 2022

Comments

  • mats
    mats over 1 year

    I have ±10,000 files (res.1 - res.10000) all consisting of one column, and an equal number of rows. What I want is, in essence, simple; merge all files column-wise in a new file final.res. I have tried using:

    paste res.*

    However (although this seems to work for a small subset of result files, this gives the following error when performed on the whole set: Too many open files.

    There must be an 'easy' way to get this done, but unfortunately I'm quite new to unix. Thanks in advance!

    PS: To give you an idea of what (one of my) datafile(s) looks like:

    0.5
    0.5
    0.03825
    0.5
    10211.0457
    10227.8469
    -5102.5228
    0.0742
    3.0944
    ...
    
  • mats
    mats almost 9 years
    @ Romeo Ninov This gives the same error as I metioned in my initial question: Too many open files
  • Atul Vekariya
    Atul Vekariya almost 9 years
    @mats, in such case have you consider to split the batch in parts. Will edit my answer to give you idea
  • mats
    mats almost 9 years
    Thanks! Any idea how I can check what the original values are?
  • chaos
    chaos almost 9 years
    Just ulimit -Sn for soft limit and ulimit -Hn for the hard limit
  • Atul Vekariya
    Atul Vekariya almost 9 years
    Right, @StephenKitt, I edit my answer
  • mats
    mats almost 9 years
    Thanks, this partially works. However, for another set of files I get the following error: -bash: /usr/bin/paste: Argument list too long. Ideas how to solve this? Sorry for bothering you guys.
  • chaos
    chaos almost 9 years
    @mats seems your kernel doesn't allow more arguments, you can check it with getconf ARG_MAX, you can only increase that value when recompiling the kernel. You may try my second solution?
  • mats
    mats almost 9 years
    getconf ARG_MAX gives: 262144. I'll try the 2nd solution too!
  • chaos
    chaos almost 9 years
    @mats If all arguments in paste res.* ... are together more than ARG_MAX characters, the command cannot be executed. Use echo res.* | wc -c in that dir with many files to see how many characters the arguments would be approximately.
  • Toby Speight
    Toby Speight almost 9 years
    Congratulations, @chaos - have a Useless Use of Cat Award!
  • Toby Speight
    Toby Speight almost 9 years
    @chaos Thanks for the explanation - that wasn't obvious to me (and the error message could upset the user who wasn't expecting it). The alternative is to pre-create an empty file before the loop - two equally valid solutions.
  • Toby Speight
    Toby Speight almost 9 years
    To avoid the temporary files, consider making the final.x00 be pipes - either as named FIFOs, or implicitly, using process substitution (if your shell supports it - e.g. bash). This isn't fun to write by hand, but may well suit a makefile.
  • mikeserv
    mikeserv almost 9 years
    @TobySpeight - agreed. for f do paste ./f.res; done 3>./f.res
  • Barmar
    Barmar almost 9 years
    Instead of using cat every time through the loop, you could start by creating an empty final.res file. This is probably a good idea any way, in case there's already a final.res file there.