Find all the occurrence of string in a file and print its line number in Perl

26,177

Solution 1

Here is a solution that matches every occurrence of all keywords:

#!usr/bin/perl
use strict;
use warnings;

#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";

my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;

while (<$search_file>)
{
    while (/$regex/g)
    {
        print "$.: $1\n";
    }
}

keywords.txt:

hello
foo
bar

search.txt:

plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone

Output:

4: bar
4: bar
4: bar
5: hello
7: hello

Explanation:

This creates a single regex that matches all of the keywords in the keywords file.

<$keywords> - when this is used in list context, it returns a list of all lines of the file.

map {chomp;qr/\Q$_\E/} - this removes the newline from each line and applies the \Q...\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter).

join '|', - join the resulting list into a single string, separated by pipe characters.

my $regex = qr|\b($keyword_or)\b|; - create a regex that looks like this:

/\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/

This regex will match any of your keywords. \b is the word boundary marker, ensuring that only whole words match: food no longer matches foo. The parentheses capture the specific keyword that matched in $1. This is how the output prints the keyword that matched.

I updated the solution to match each keyword on a given line and to only match complete words.

Solution 2

Is this part of something bigger? Because this is a one liner with grep

grep -n hello filewithlotsalines.txt

grep -n "hello world" filewithlotsalines.txt

-n gets grep to show the line numbers first before the matching lines. You can do man grep for more options.

I am assuming here that you are on a linux or *nix system.

Solution 3

I have a different interpretation of your request. It seems that you may want to maintain a list of line numbers where certain entries from a lookup table are found on lines of a 'keyword' file. Here's a sample lookup table:

hello world
hello
perl
hash
Test
script

And a tab-delimited 'keyword' file, where multiple keywords may be found on a single line:

programming tests
hello   everyone
hello   hello world perl
scripting   scalar
test    perl    script
hello world perl    script  hash

Given the above, consider the following solution:

use strict;
use warnings;

my %lookupTable;

print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );

print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );

open my $ltFH, '<', $lookupTableFile or die $!;

while (<$ltFH>) {
    chomp;
    undef @{ $lookupTable{$_} };
}

close $ltFH;

open my $kfFH, '<', $keywordsFile or die $!;

while (<$kfFH>) {
    chomp;
    for my $keyword ( split /\t+/ ) {
        push @{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
    }
}

close $kfFH;

open my $slFH, '>', 'SampleLineNum.txt' or die $!;

print $slFH "$_: @{ $lookupTable{$_} }\n"
  for sort { lc $a cmp lc $b } keys %lookupTable;

close $slFH;

print "Done!\n";

Output to SampleLineNum.txt:

hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test: 

The script uses a hash of arrays (HoA), where the key is an entry from the lookup table and the associated value is a reference to a list of line numbers where that entry was found on lines of a 'keyword' file. The hash %lookupTable is initialized with a reference to an empty list.

The each line of the 'keywords' file is split on the delimiting tab, and if a corresponding entry is defined in %lookupTable, the line number is pushed onto the corresponding list. When done, the %lookupTable keys are case-insensitively sorted and written out to SampleLineNum.txt, along with their corresponding list of line numbers where the entry was found, if any.

There's no sanity checks on the file names entered, so consider adding those.

Hope this helps!

Share:
26,177
Sishanth
Author by

Sishanth

Updated on August 04, 2020

Comments

  • Sishanth
    Sishanth over 3 years

    I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab.

    And also I have a file that contains list of keywords to be matched. Say this file act as a look up.

    So for each keyword in the look up table I need to search all its occurrence in the given file. And should print the line number of the occurrence.

    I have tried this

    #!usr/bin/perl
    use strict;
    use warnings;
    
    my $linenum = 0;
    
    print "Enter the file path of lookup table:";
    my $filepath1 = <>;
    
    print "Enter the file path that contains keywords :";
    my $filepath2 = <>;
    
    open( FILE1, "< $filepath1" );
    open FILE2, "< $filepath2" ;
    
    open OUT, ">", "SampleLineNum.txt";
    
    while( $line = <FILE1> )
    {
        while( <FILE2> ) 
        {
            $linenum = $., last if(/$line/);
        }
        print OUT "$linenum ";
    }
    
    close FILE1;
    

    This gives the first occurrence of the keyword. But I need all the occurrence and also the keyword should be exactly match.

    The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world"

    if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number.

  • Sishanth
    Sishanth over 11 years
    Can you please give me more explanation?
  • Karthik T
    Karthik T over 11 years
    @Sishanth You can see an example with grep
  • dan1111
    dan1111 over 11 years
    This is fine for a single keyword, but the OP wanted to match a whole list of keywords from a file.
  • dan1111
    dan1111 over 11 years
    @KarthikT, fair enough. But once you add a loop and the logic to get the keywords from a file, the grep solution won't be any shorter than a Perl solution.
  • mpe
    mpe over 11 years
    @dan1111: Wrong. grep -n -f keywords.txt filewithlotsalines.txt takes the keywords from a file to search the big file.