How to delete all occurrences of a list of words from a text file?

text-processing sed grep text

23,991

Solution 1

For your actual use case I recommend terdon's answer using Perl.

However, the simple version, without handling words that are substrings of other words (e.g. removing "king" from "hiking"), is to use one Sed command to generate the command run by a different Sed instance on your actual file.

In this case, with wordfile containing "king" and "queen" and textfile containing your text:

sed -e "$(sed 's:.*:s/&//ig:' wordfile)" textfile

Note that the "ignore case" flag is a GNU extension, not standard.

Solution 2

The simple but inefficient way is to process the file multiple times, once for each input word:

$ while read w; do sed -i "s/$w//ig" file2 ; done < file1
$ cat file2
Both the  and  are monarchs. Will the  live? , it is!

That can be very slow for large files though (and also matches substrings). You could do it in a single pass with Perl:

perl -lpe 'BEGIN{open(A,"file1"); chomp(@k = <A>)} 
                 for $w (@k){s/\b\Q$w\E\b//ig}' file2

The \b make sure we only match on word boundaries, \Q\E make sure $w is taken literally. This will stop the script from matching hiking but it will still match high-king. To avoid that, you need to explicitly list the characters that define a word:

perl -Mopen=locale -Mutf8 -lpe '
  BEGIN{open(A,"file1"); chomp(@k = <A>)} 
  for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' file2

That — non-ASCII character above needs to be entered in UTF-8 encoding, as we're telling perl the code is written in UTF-8 with -Mutf8. We're using -Mopen=locale for the content of the files and stdout to be decoded/encoded in the locale's character set.

23,991

Admin

Updated on September 18, 2022

Comments

Admin over 1 year
I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

Example:

File 1
```
queen
king
```
Text file sample
```
Both the king and queen are monarchs. Will the queen live? Queen, it is!
```
This is what I have tried:
```
sed -i 's/queen/ /g' page.txt
sed -i 's/Queen/ /g' page.txt
```
Output
```
Both the and are monarchs. Will the live? , it is!
```
The list of words I have is big (over 50000 words). How can I do this without having to specify the pattern in the command line?
- Admin over 7 years
  
  What have you tried so far? Where did you get stuck? See unix.stackexchange.com/q/112023/135943 if you're just starting out. (Show some effort and you're more likely to get help; there is a wealth of information on this site already, very easily searchable.)
- Admin over 7 years
  
  So you need this to be i) case insensitive and ii) ignore punctuation (queen, matches queen)? How about substrings? Should king match hiking? Or high-king?
- Admin over 7 years
  
  @terdon Just full string matches. punctuation is ignored.
- Admin over 7 years
  
  @Wildcard I have been able to use sed to remove all occurrences of a single word specified in the command line. I am not sure how to do that if there are multiple words, to be read from a file.
- Admin over 7 years
  
  Excellent. Please edit your question and add the command you've used and how it fails in this case.
- Admin over 7 years
  
  As of How can I do this without having to specify the pattern in the command line? sed allows to submit expressions from a file via its option -f. You could even write sed "programs" by prepending #!/usr/bin/sed -f to such a file and giving it execute permissions. However, with 50000 expressions I agree that sed seems to be an inappropriate choice.
Serge Stroobandt over 7 years

The sed solution to this problem is rarely found on the web. — Feels like finding gold dust…