Unix script to search within a compressed .gz file

10,989

The essence of how to accomplish this is to get the names of the files within the tarball to search over, and extract their content to be searched, while not extracting anything else. Because we don't want to write to the file system, we can use the -O flag to instead extract to standard-out.

tar -tzf file.tar.gz | grep '\.txt' | xargs tar -Oxzf file.tar.gz | grep -B 3 "string-or-regex" will concatenate all of the files in the .tar.gz with names ending in ".txt", and grep them for the given string, also outputting the 3 previous lines. It won't tell you which file in the tarball any match came from, and the "three previous lines" may in fact come from the previous file.

You can instead do:

for file in $(tar -tzf file.tar.gz | grep '\.txt'); do 
    tar -Oxzf file.tar.gz "$file" | grep -B 3 --label="$file" -H "string-or-regex"
done

which will respect file boundaries, and report the file names, but be much less efficient.

(-z tells tar it is gzip compressed. -t lists the contents. -x extracts. -O redirects to standard output rather than the file system. Older tars may not have the -O or -z flag, and will want the flags without -: e.g. tar tz file.tar.gz)

Okay, so you have an unusable grep. We can fix that with awk!

#!/usr/bin/awk -f
BEGIN { context=3; }
{ add_buffer($0) }
/pattern/ { print_buffer() }
function add_buffer(line)
{
    buffer[NR % context]=line
}
function print_buffer()
{
    for(i = max(1, NR-context+1); i <= NR; i++) {
        print buffer[i % context]
    }
}
function max(a,b)
{
    if (a > b) { return a } else { return b }
}

This will not coalesce adjacent matches, unlike grep -B, and can thus repeat lines that are within 3 lines of two different matches.

Share:
10,989
CFUser
Author by

CFUser

Interested in websites

Updated on June 05, 2022

Comments

  • CFUser
    CFUser almost 2 years

    I want to get a few lines from a file which is in a compressed .gz file.

    The .gz file contains many txt files and I want to search a string in all these txt files and need to get the previous 3 line as output, including the current line (where the search string is present).

    I tried zgrep and got the line number, but when I use head or tail command It's giving some garbage values. I think we cannot use the head or tail commands with compressed files containing multiple files.

    Please suggest if there is any simple way?

  • CFUser
    CFUser over 13 years
    yes its gzip of a tar file. I cannot Extract, bcoz it contains Huge files and will get Disk space problems
  • wnoise
    wnoise over 13 years
    Does it support -C? Is it a problem to get 3 lines after as well?
  • CFUser
    CFUser over 13 years
    unfortunately no C as well :(
  • SourceSeeker
    SourceSeeker over 13 years
    @CFUser: Without -B support in grep, you'll have to used awk, sed or Perl to hold a moving window of lines which are output when your match is found. GNU tar supports --wildcards which makes the first tar|grep in each of the versions unnecessary. Other versions of tar may or may not support globbing and may or may not require a switch to enable it.
  • Conrad Meyer
    Conrad Meyer over 13 years
    As long as you want GNU tar, Why not just install GNU coreutils and use gtar/ggrep? But in general, I like the awk answer =).