How to delete all occurrences of a list of words from a text file?
Solution 1
For your actual use case I recommend terdon's answer using Perl.
However, the simple version, without handling words that are substrings of other words (e.g. removing "king" from "hiking"), is to use one Sed command to generate the command run by a different Sed instance on your actual file.
In this case, with wordfile
containing "king" and "queen" and textfile
containing your text:
sed -e "$(sed 's:.*:s/&//ig:' wordfile)" textfile
Note that the "i
gnore case" flag is a GNU extension, not standard.
Solution 2
The simple but inefficient way is to process the file multiple times, once for each input word:
$ while read w; do sed -i "s/$w//ig" file2 ; done < file1
$ cat file2
Both the and are monarchs. Will the live? , it is!
That can be very slow for large files though (and also matches substrings). You could do it in a single pass with Perl:
perl -lpe 'BEGIN{open(A,"file1"); chomp(@k = <A>)}
for $w (@k){s/\b\Q$w\E\b//ig}' file2
The \b
make sure we only match on word boundaries, \Q\E
make sure $w
is taken literally. This will stop the script from matching hiking
but it will still match high-king
. To avoid that, you need to explicitly list the characters that define a word:
perl -Mopen=locale -Mutf8 -lpe '
BEGIN{open(A,"file1"); chomp(@k = <A>)}
for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' file2
That —
non-ASCII character above needs to be entered in UTF-8 encoding, as we're telling perl
the code is written in UTF-8 with -Mutf8
. We're using -Mopen=locale
for the content of the files and stdout to be decoded/encoded in the locale's character set.
Related videos on Youtube
Admin
Updated on September 18, 2022Comments
-
Admin over 1 year
I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.
Example:
File 1
queen king
Text file sample
Both the king and queen are monarchs. Will the queen live? Queen, it is!
This is what I have tried:
sed -i 's/queen/ /g' page.txt sed -i 's/Queen/ /g' page.txt
Output
Both the and are monarchs. Will the live? , it is!
The list of words I have is big (over 50000 words). How can I do this without having to specify the pattern in the command line?
-
Admin over 7 yearsWhat have you tried so far? Where did you get stuck? See unix.stackexchange.com/q/112023/135943 if you're just starting out. (Show some effort and you're more likely to get help; there is a wealth of information on this site already, very easily searchable.)
-
Admin over 7 yearsSo you need this to be i) case insensitive and ii) ignore punctuation (
queen,
matchesqueen
)? How about substrings? Shouldking
matchhiking
? Orhigh-king
? -
Admin over 7 years@terdon Just full string matches. punctuation is ignored.
-
Admin over 7 years@Wildcard I have been able to use sed to remove all occurrences of a single word specified in the command line. I am not sure how to do that if there are multiple words, to be read from a file.
-
Admin over 7 yearsExcellent. Please edit your question and add the command you've used and how it fails in this case.
-
Admin over 7 yearsAs of How can I do this without having to specify the pattern in the command line?
sed
allows to submit expressions from a file via its option-f
. You could even writesed
"programs" by prepending#!/usr/bin/sed -f
to such a file and giving it execute permissions. However, with 50000 expressions I agree thatsed
seems to be an inappropriate choice.
-
-
Serge Stroobandt over 7 yearsThe
sed
solution to this problem is rarely found on the web. — Feels like finding gold dust…