Combine two files with awk
Solution 1
The below answer is based on a similar Q&A in SO with some relevant modifications:
$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($2 in dict) ? dict[$2] : $2}1' file2.txt file1.txt
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE
The idea is to create a hash-map with index, and use it as dictionary.
For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt
will be the sixth column):
If the input file will be like file1b.txt
:
item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE
The following command will do it:
$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""}1' file2.txt file1b.txt
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE
Solution 2
I know you said awk
, but there is a join
command for this purpose...
{
join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
} | sort -k 1
It'd be sufficient with the first join
command if it wasn't for this line:
item4 platD
The command basically says: join based on the second column of the first file (-1 2
), and the first column of the second file (-2 1
), and output the first column of the first file and the second column of the second file (-o 1.1,2.2
). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1
) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2
). Then we sort the output of both combined. sort -k 1
means sort based on the first column, and sort -k 2
means to sort based on the second. It's important to sort the files based on the join column before passing them to join
.
Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:
$ ruby -e '(1..1000000).each {|i| puts "item#{i} plat#{i}"}' | shuf > 1million.txt
$ ruby -e '(1..10000000).each {|i| puts "item#{i} plat#{i}"}' | shuf > 10million.txt
$ head 10million.txt
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187
That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.
Related videos on Youtube
pawana
Updated on September 18, 2022Comments
-
pawana over 1 year
File1.txt
item1 carA item2 carB item3 carC item4 platD item5 carE
File2.txt
carA platA carB platB carC platC carE platE
Wanted output:
item1 platA item2 platB item3 platC item4 platD item5 platE
How can I do it?
-
Yaron about 6 years@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
-
David Foerster about 6 yearsIn this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
-
JoL about 6 years@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
-
David Foerster about 6 yearsTo sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the
%E
in the time format) is less interesting to measure computational performance. User mode CPU time (%U
or simply an unsetTIMEFORMAT
variable) would be much more meaningful. -
JoL about 6 years@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with
%U
. -
David Foerster about 6 yearsElapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"