Combine multiple files with different count values

awk

177

Solution 1

you can try (if, I understood correctly)

awk '!($1 in d){d[$1]=$2; next}
     {d[$1]+=$2}
     END{for(key in d) print key, d[key]; }' *.sam

you get:

__too_low_aQual 3
mir-671 19
mir-8 2
__not_aligned 1
Y_RNA 4
mir-10 5

Solution 2

this script will do what you need

 t0=mktemp; touch t0; 
 for f in prefix*.csv; 
     do paste t0 <(cut -d" " -f2 $f) > t1 && mv t1 t0; 
     done; 
 tr '\t' ' ' <t0 && rm t0

use cut/paste to collect second columns in a temp file; when done remove the temp file after printing results.

To preserve the first column change touch t0 to cut -d" " -f1 oneofthefiles.csv > t0

Alternatively, awk to the rescue!

awk '
    {a[FNR]=a[FNR]?a[FNR] OFS $2:$1} 
 END{for(i=1;i<=FNR;i++) print a[i]}
    ' prefix*.csv

combine second field from all files based on the line number, preserve the first field from first file.

177

BioMan

Updated on November 29, 2022

Comments

BioMan over 1 year
I would like to combine 96 files by taking the second column from each files and keep the first column which is similar between all files. I tried to do this in R, but figued it would be better in the terminal. Does it work using awk?

Sample data:
```
DMED7013:Rfam robinm$ head Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput402R.sam
Seq_../trimmed/402R.tally.fasta __not_aligned
__too_low_aQual 3
mir-10 5
Y_RNA 4
__too_low_aQual 0
__too_low_aQual 0
__not_aligned 1
mir-8 2
mir-671 3
mir-671 16
```
The files:
```
DMED7013:Rfam robinm$ ls -l  
-rw-r--r--   1 robinm  staff  1711388 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100G.sam
-rw-r--r--   1 robinm  staff  1712778 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100R.sam
-rw-r--r--   1 robinm  staff  1709703 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106G.sam
-rw-r--r--   1 robinm  staff  1707486 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106R.sam
-rw-r--r--   1 robinm  staff  1704757 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122G.sam
-rw-r--r--   1 robinm  staff  1705471 Sep 22 19:12 Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122R.sam
.....
```
- sat over 8 years
  
  where is 1st and 2nd column? sample data?
- Jose Ricardo Bustos M. over 8 years
  
  which is your expected output?
- BioMan over 8 years
  
  Column 2 is the counts. Columns 1 is __too_low_aQual, mir-10 etc....There is no header for column 2
- karakfa over 8 years
  
  Are the number of lines and the order the same in all files? Since the first field is not unique it's not possible to join them 1 to 1 otherwise.
- BioMan over 8 years
  
  yes, number of lines and the order are the same
- Ed Morton over 8 years
  
  With every question it helps us to see testable sample input and expected output. If you want us to help you combine columns from multiple files then you need to provide multiple input files so that would mean at least 2, right? Maybe 3 would be better/necessary to show various use cases? Then you'd provide the one output file that you'd want to see given those input files. THEN we can start working on helping you solve your problem because before that we're just guessing about several aspects of what you want and we can't test a solution. Get rid of the directory listing, it's irrelevant.
George Udosen over 7 years

Does this answer the question ?