Combine two files with awk

5,240

Solution 1

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($2 in dict) ? dict[$2] : $2}1' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""}1' file2.txt file1b.txt 
item1 platA    
item2 platB    
item3 platC    
item4 platD    
item5 platE    

Solution 2

I know you said awk, but there is a join command for this purpose...

{
  join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)     
  join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
} | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4   platD

The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.

Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:

$ ruby -e '(1..1000000).each {|i| puts "item#{i}   plat#{i}"}' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each {|i| puts "item#{i}   plat#{i}"}' | shuf > 10million.txt 
$ head 10million.txt 
item530284   plat530284
item7946579   plat7946579
item1521735   plat1521735
item9762844   plat9762844
item2289811   plat2289811
item6878181   plat6878181
item7957075   plat7957075
item2527811   plat2527811
item5940907   plat5940907
item3289494   plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

Share:
5,240

Related videos on Youtube

pawana
Author by

pawana

Updated on September 18, 2022

Comments

  • pawana
    pawana over 1 year

    File1.txt

    item1   carA
    item2   carB
    item3   carC
    item4   platD
    item5   carE
    

    File2.txt

    carA  platA
    carB  platB
    carC  platC
    carE  platE
    

    Wanted output:

    item1   platA
    item2   platB
    item3   platC
    item4   platD
    item5   platE
    

    How can I do it?

  • Yaron
    Yaron about 6 years
    @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
  • David Foerster
    David Foerster about 6 years
    In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
  • JoL
    JoL about 6 years
    @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
  • David Foerster
    David Foerster about 6 years
    To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
  • JoL
    JoL about 6 years
    @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
  • David Foerster
    David Foerster about 6 years
    Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"