Combine multiple files with different count values

awk
177

Solution 1

you can try (if, I understood correctly)

awk '!($1 in d){d[$1]=$2; next}
     {d[$1]+=$2}
     END{for(key in d) print key, d[key]; }' *.sam

you get:

__too_low_aQual 3
mir-671 19
mir-8 2
__not_aligned 1
Y_RNA 4
mir-10 5

Solution 2

this script will do what you need

 t0=mktemp; touch t0; 
 for f in prefix*.csv; 
     do paste t0 <(cut -d" " -f2 $f) > t1 && mv t1 t0; 
     done; 
 tr '\t' ' ' <t0 && rm t0

use cut/paste to collect second columns in a temp file; when done remove the temp file after printing results.

To preserve the first column change touch t0 to cut -d" " -f1 oneofthefiles.csv > t0

Alternatively, awk to the rescue!

awk '
    {a[FNR]=a[FNR]?a[FNR] OFS $2:$1} 
 END{for(i=1;i<=FNR;i++) print a[i]}
    ' prefix*.csv

combine second field from all files based on the line number, preserve the first field from first file.

Share:
177

Related videos on Youtube

BioMan
Author by

BioMan

Updated on November 29, 2022

Comments

  • BioMan
    BioMan over 1 year

    I would like to combine 96 files by taking the second column from each files and keep the first column which is similar between all files. I tried to do this in R, but figued it would be better in the terminal. Does it work using awk?

    Sample data:

    DMED7013:Rfam robinm$ head Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput402R.sam
    Seq_../trimmed/402R.tally.fasta __not_aligned
    __too_low_aQual 3
    mir-10 5
    Y_RNA 4
    __too_low_aQual 0
    __too_low_aQual 0
    __not_aligned 1
    mir-8 2
    mir-671 3
    mir-671 16
    

    The files:

    DMED7013:Rfam robinm$ ls -l  
    -rw-r--r--   1 robinm  staff  1711388 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100G.sam
    -rw-r--r--   1 robinm  staff  1712778 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100R.sam
    -rw-r--r--   1 robinm  staff  1709703 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106G.sam
    -rw-r--r--   1 robinm  staff  1707486 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106R.sam
    -rw-r--r--   1 robinm  staff  1704757 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122G.sam
    -rw-r--r--   1 robinm  staff  1705471 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122R.sam
    .....
    
    • sat
      sat over 8 years
      where is 1st and 2nd column? sample data?
    • Jose Ricardo Bustos M.
      Jose Ricardo Bustos M. over 8 years
      which is your expected output?
    • BioMan
      BioMan over 8 years
      Column 2 is the counts. Columns 1 is __too_low_aQual, mir-10 etc....There is no header for column 2
    • karakfa
      karakfa over 8 years
      Are the number of lines and the order the same in all files? Since the first field is not unique it's not possible to join them 1 to 1 otherwise.
    • BioMan
      BioMan over 8 years
      yes, number of lines and the order are the same
    • Ed Morton
      Ed Morton over 8 years
      With every question it helps us to see testable sample input and expected output. If you want us to help you combine columns from multiple files then you need to provide multiple input files so that would mean at least 2, right? Maybe 3 would be better/necessary to show various use cases? Then you'd provide the one output file that you'd want to see given those input files. THEN we can start working on helping you solve your problem because before that we're just guessing about several aspects of what you want and we can't test a solution. Get rid of the directory listing, it's irrelevant.
  • George Udosen
    George Udosen over 7 years
    Does this answer the question ?