How to extract one column from multiple files, and paste those columns into one file?

11,319

Solution 1

Here's one way using awk and a sorted glob of files:

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)

Results:

1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g

Explanation:

  • For each line of input of each input file:

    • Add the files line number to an array with a value of column 5.

    • (a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.

  • At the end of the script:

    • Use a C-style loop to iterate through the array, printing each of the arrays values.

Solution 2

Try this one. My script assumes that every file has the same number of lines.

# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)

for ((i=1; i<=$lines; i++)); do
  for file in sample_problem*; do
    # get line number $i and delete everything except the last column
    # and then print it
    # echo -n means that no newline is appended
    echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
  done
  echo
done

This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.

PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!

Solution 3

For only ~4000 files, you should be able to do:

 find . -name sample_problem*_part*.txt | xargs paste

If find is giving names in the wrong order, pipe it to sort:

 find . -name sample_problem*_part*.txt | sort ... | xargs paste

Solution 4

# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?

where transpose.py:

#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest

missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
                         fillvalue=missing_value):
    print " ".join(row)

Output

1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g

Assuming the first and second files have less lines than the third one (missing values are replaced by '?').

Share:
11,319
user1687130
Author by

user1687130

Updated on July 08, 2022

Comments

  • user1687130
    user1687130 almost 2 years

    I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.

    The file names look like:

    sample_problem1_part1.txt
    sample_problem1_part2.txt
    
    sample_problem2_part1.txt
    sample_problem2_part2.txt
    
    sample_problem3_part1.txt
    sample_problem3_part2.txt
    ......
    

    Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines. The content looks like:

    sample_problem1_part1.txt
    1 1 20 20 1
    1 7 21 21 2
    3 1 22 22 3
    1 5 23 23 4
    6 1 24 24 5
    2 9 25 25 6
    1 0 26 26 7
    
    sample_problem1_part2.txt
    1 1 88 88 8
    1 1 89 89 9
    2 1 90 90 10
    1 3 91 91 11
    1 1 92 92 12
    7 1 93 93 13
    1 5 94 94 14
    
    sample_problem2_part1.txt
    1 4 330 30 a
    3 4 331 31 b
    1 4 332 32 c
    2 4 333 33 d
    1 4 334 34 e
    1 4 335 35 f
    9 4 336 36 g
    

    The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)

    1 8 a ...
    2 9 b ...
    3 10 c ...
    4 11 d ...
    5 12 e ...
    6 13 f ...
    7 14 g ...
    

    I was using:

     paste sample_problem1_part1.txt sample_problem1_part2.txt > \
         sample_problem1_partall.txt
     paste sample_problem2_part1.txt sample_problem2_part2.txt > \
         sample_problem2_partall.txt
     paste sample_problem3_part1.txt sample_problem3_part2.txt > \
         sample_problem3_partall.txt
    

    And then:

    for i in `find . -name "sample_problem*_partall.txt"`
    do
        l=`echo $i | sed 's/sample/extracted_col_/'`
        `awk '{print $5, $10}'  $i > $l`
    done    
    

    And:

    paste extracted_col_problem1_partall.txt \
          extracted_col_problem2_partall.txt \
          extracted_col_problem3_partall.txt > \
        extracted_col_problemall_partall.txt
    

    It works fine with a few files, but it's a crazy method when the number of files is large (over 4000). Could anyone help me with simpler solutions that are capable of dealing with multiple files, please? Thanks!

  • erik
    erik about 11 years
    No, that does not work. paste: ./sample_problem1_part2.txt78123: Too many open files
  • erik
    erik about 11 years
    »head -n 1 test | tr '\t' ' ' | sed 's%[0-9a-z]%%g' | wc -c« gave me only 480 columns. So paste (GNU coreutils) 8.15 seems to have some limitations.
  • erik
    erik about 11 years
    Wow, that is fast. Took only 0.430 seconds for the same 4800 files I used for my answer. I never used awk, but now I want. Could you explain your command a bit? I didn’t know ls -v. Nice way of sorting by version.
  • Steve
    Steve about 11 years
    @erik: I've added a quick explanation for you. Please let me know if something is not clear or poorly explained. HTH.
  • erik
    erik about 11 years
    Thank you steve. Very good explanation. I would like to vote up your answer a second time but it isn’t possible. ;-)
  • erik
    erik about 11 years
    31 seconds is not bad. But awk is shorter and less to write. Ok, you have the extra functionality to add »?« for missing lines.
  • William Pursell
    William Pursell about 11 years
    It's not a limitiation of paste so much as a system limit. A process can only have so many files open at once, and that number is usually pretty small (~1024).
  • user1687130
    user1687130 about 11 years
    May I ask how can I write the output into a file instead of print it on the screen please? Thanks again!
  • erik
    erik about 11 years
    At the end of the script, at the last line, append > output.txt. In this example the last line would be done > output.txt. But you should better use steves answer. There it is the same. At the end of the line write > output.txt to put the data into a file called output.txt
  • user1687130
    user1687130 about 11 years
    Thanks steve! That's impressive and awk is doing magic here. I really appreciate it! I think the code is a little beyond me even though I've been reading the explanation several times.Any suggestion in learning awk for scripting beginners would be highly appreciated. Maybe it's a good start by reading awk questions posted by others in this web and practice to learn. Thanks again! :)
  • Steve
    Steve about 11 years
    No problem. Glad I could help :-) I often recommend the [Grymoire site](www.grymoire.com/Unix/Awk.html) for learning awk, but you may find bashsell.net a little easier. Certainly, answering questions on SO is also a great way to reinforce your learning. Also, don't for get to accept an answer, by clicking on the tick next to your favorite one. Cheers!
  • ivivek_ngs
    ivivek_ngs over 8 years
    How can I get the output if all the files have different row numbers? I would like to see if it is possible to build the matrix where all the files have different number of lines