How to remove the lines which appear on file B from another file A?
Solution 1
If the files are sorted (they are in your example):
comm -23 file1 file2
-23
suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort
first...
See the man page here
Solution 2
grep -Fvxf <lines-to-remove> <all-lines>
- works on non-sorted files (unlike
comm
) - maintains the order
- is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-
-F
: use literal strings instead of the default BRE -
-x
: only consider matches that match the entire line -
-v
: print non-matching -
-f file
: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
usage:
remove-lines lines-to-remove remove-from-this-file
Solution 3
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next}
idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR
is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0]
adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a)
we're now in the next file(s),in
is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file,!
negates the condition. What is missing here is the action, which by default is{print}
and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Solution 4
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
Solution 5
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format
is for lines that are in file b but not in a
--old-..
is for lines that are in file a but not in b
--unchanged-..
is for lines that are in both.
%L
makes it so the line is printed exactly.
man diff
for more details
Related videos on Youtube
slhck
Video quality guy and researcher, PhD student in computer science. Founder/CEO of AVEQ. I offer personal consulting and help with video encoding, especially with FFmpeg. Send a mail to werner.robitza at gmail.com. More info on my website.
Updated on December 21, 2021Comments
-
slhck over 2 years
I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A B C
and file B contained:
B D E
Then file A should be left with:
A C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
-
tripleee over 9 yearspossible duplicate of Deleting lines from one file which are in another file
-
tripleee over 9 yearsMost if the answers here are for sorted files, and the most obvious one is missing, which of course isn't your fault, but that makes the other one more generally useful.
-
-
Admin almost 10 years
comm -23 file1 file2 > file3
will output contents in file1 not in file2, to file3. And thenmv file3 file1
would finally clear redundant contents in file1. -
Carlos Macasaet almost 9 yearsYou say this will work unless the files are sorted. What problems occur if they are sorted? What if they are partially sorted?
-
twobob over 8 yearsfull marks on this. To use this on the command line in GnuWin32 in Windows replace the single nibbles with double quotes. works a treat. many thanks.
-
twobob over 8 yearsThis is tougher to use in a corner-case cross platform scenario than the other one liner. However hats off for the performance effort
-
Admin about 8 yearsThat was in response to the solution above that suggested usage of
comm
command.comm
requires the files to be sorted, so if they are sorted you can use that solution as well. You can use this solution regardless of whether the file is sorted or not though -
Anand Builders over 7 yearsThis works but how will i be able to redirect the output to fileA in the form of A (With a new line) B
-
karakfa over 7 yearsI guess you mean
A\nC
, write to a temp file first and overwrite the original file... > tmp && mv tmp fileA
-
Socowi over 6 yearsAlternatively, use
comm -23 file1 file2 | sponge file1
. No cleanup needed. -
MitchellK about 5 yearsFull marks in this from me too. This awk takes all of 1 second to process a file with 104,000 entries :+1:
-
Felix Rabe about 5 yearsMan page link is not loading for me – alternative: linux.die.net/man/1/comm
-
Felix Rabe about 5 years@Socowi What is sponge? I don't have that on my system. (macos 10.13)
-
Alexander Aleksandrovič Klimov about 5 years@FelixRabe, well, that's tiresome. Replaced with your link. Thanks
-
Socowi about 5 years@FelixRabe
sponge
is a program that fully consumes stdin before writing it to a file. On linux it is usually installed from a package calledmoreutils
. -
Peter Nowee over 4 yearsWhen using this in scripts, make sure to first check that
fileB
is not empty (0 bytes long), because if it is, you will get an empty result instead of the expected contents offileA
. (Cause:FNR==NR
will apply tofileA
then.) -
Alexander Aleksandrovič Klimov about 4 yearsAll of these were already given in other answers. Your grep one needs a -F, or you'll get odd results when the lines look like regexps
-
4b0 about 3 yearsIt's good practice on StackOverflow to add an explanation as to why your solution should work.
-
tripleee about 3 yearsThis doesn't really add anything over the accepted answer, except perhaps the tangential tip on how to use a process substitution to sort files which aren't already sorted.
-
Tomme almost 3 yearsDid not work for me at all :-( All duplicate lines are still present in the output.
-
Alexander Aleksandrovič Klimov almost 3 years@Jeroen-bartEngelen, did you sort the files first? It certainly works (comm has been around for 40+ years...)
-
Tomme almost 3 years@TheArchetypalPaul I figured it out. It was line-endings. It's always line-endings in Linux :-) I edited and sorted both files on my Windows desktop, but for some reason the line-endings were saved differently. Dos2unix helped.