How to delete all occurrences of a list of words from a text file?

23,991

Solution 1

For your actual use case I recommend terdon's answer using Perl.

However, the simple version, without handling words that are substrings of other words (e.g. removing "king" from "hiking"), is to use one Sed command to generate the command run by a different Sed instance on your actual file.

In this case, with wordfile containing "king" and "queen" and textfile containing your text:

sed -e "$(sed 's:.*:s/&//ig:' wordfile)" textfile

Note that the "ignore case" flag is a GNU extension, not standard.

Solution 2

The simple but inefficient way is to process the file multiple times, once for each input word:

$ while read w; do sed -i "s/$w//ig" file2 ; done < file1
$ cat file2
Both the  and  are monarchs. Will the  live? , it is!

That can be very slow for large files though (and also matches substrings). You could do it in a single pass with Perl:

perl -lpe 'BEGIN{open(A,"file1"); chomp(@k = <A>)} 
                 for $w (@k){s/\b\Q$w\E\b//ig}' file2 

The \b make sure we only match on word boundaries, \Q\E make sure $w is taken literally. This will stop the script from matching hiking but it will still match high-king. To avoid that, you need to explicitly list the characters that define a word:

perl -Mopen=locale -Mutf8 -lpe '
  BEGIN{open(A,"file1"); chomp(@k = <A>)} 
  for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' file2 

That non-ASCII character above needs to be entered in UTF-8 encoding, as we're telling perl the code is written in UTF-8 with -Mutf8. We're using -Mopen=locale for the content of the files and stdout to be decoded/encoded in the locale's character set.

Share:
23,991

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

    Example:

    File 1

    queen
    king
    

    Text file sample

    Both the king and queen are monarchs. Will the queen live? Queen, it is!
    

    This is what I have tried:

    sed -i 's/queen/ /g' page.txt
    sed -i 's/Queen/ /g' page.txt
    

    Output

    Both the and are monarchs. Will the live? , it is!
    

    The list of words I have is big (over 50000 words). How can I do this without having to specify the pattern in the command line?

    • Admin
      Admin over 7 years
      What have you tried so far? Where did you get stuck? See unix.stackexchange.com/q/112023/135943 if you're just starting out. (Show some effort and you're more likely to get help; there is a wealth of information on this site already, very easily searchable.)
    • Admin
      Admin over 7 years
      So you need this to be i) case insensitive and ii) ignore punctuation (queen, matches queen)? How about substrings? Should king match hiking? Or high-king?
    • Admin
      Admin over 7 years
      @terdon Just full string matches. punctuation is ignored.
    • Admin
      Admin over 7 years
      @Wildcard I have been able to use sed to remove all occurrences of a single word specified in the command line. I am not sure how to do that if there are multiple words, to be read from a file.
    • Admin
      Admin over 7 years
      Excellent. Please edit your question and add the command you've used and how it fails in this case.
    • Admin
      Admin over 7 years
      As of How can I do this without having to specify the pattern in the command line? sed allows to submit expressions from a file via its option -f. You could even write sed "programs" by prepending #!/usr/bin/sed -f to such a file and giving it execute permissions. However, with 50000 expressions I agree that sed seems to be an inappropriate choice.
  • Serge Stroobandt
    Serge Stroobandt over 7 years
    The sed solution to this problem is rarely found on the web. — Feels like finding gold dust…